Special Issue "Computational Linguistics for Low-Resource Languages"

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Information Applications".

Deadline for manuscript submissions: closed (15 November 2019).

Special Issue Editor

Dr. Claudia Soria
Website
Guest Editor
Institute of Computational Linguistics "Antonio Zampolli" ILC, Italian National Research Council (CNR-ILC), Pisa, Italy
Interests: computational linguistics, language resources, low-resource languages, digital language diversity

Special Issue Information

Dear Colleagues,

After years of neglect, low-resource languages (be they minority, regional, endangered, or heritage languages) have made it to the scene of computational linguistics, thanks to the increased availability of digital devices, which make the request for digital usability of low-resource languages stronger. Preservation, revitalisation, and documentation purposes also call for the availability of computational methodologies for these languages, often upon request of speakers’ communities themselves.

In addition to their applicative interest, low-resource languages are a challenging case for computational linguistics per se. By expanding the range of languages traditionally studied by computational linguistics, low-resource languages often represent a test-bed for validating current methods and techniques. In an era dominated by big data, for instance, the data sparseness of low-resource languages requires alternative approaches. Limited availability of expert human resources, on the other side, calls for and questions crowdsourcing approaches and brings in the picture issues of data protection and community involvement.

The goal of this Special Issue is to collect current research in computational linguistics for low-resource languages for a variety of languages, tasks, and applications. We invite submissions of high-quality, original technical and survey papers addressing both theoretical and practical aspects, including their ethical and social implications. We also hope that this Special Issue will not only represent a showcase for promising research but will also contribute to raising awareness about the importance of maintaining linguistic diversity, an effort to which computational linguistics can make an important contribution.

Dr. Claudia Soria
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All papers will be peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1000 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Low-resource languages
  • Computational linguistics
  • Natural language processing
  • Linguistic diversity

Published Papers (12 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Open AccessArticle
Enhancing the Performance of Telugu Named Entity Recognition Using Gazetteer Features
Information 2020, 11(2), 82; https://doi.org/10.3390/info11020082 - 02 Feb 2020
Abstract
Named entity recognition (NER) is a fundamental step for many natural language processing tasks and hence enhancing the performance of NER models is always appreciated. With limited resources being available, NER for South-East Asian languages like Telugu is quite a challenging problem. This [...] Read more.
Named entity recognition (NER) is a fundamental step for many natural language processing tasks and hence enhancing the performance of NER models is always appreciated. With limited resources being available, NER for South-East Asian languages like Telugu is quite a challenging problem. This paper attempts to improve the NER performance for Telugu using gazetteer-related features, which are automatically generated using Wikipedia pages. We make use of these gazetteer features along with other well-known features like contextual, word-level, and corpus features to build NER models. NER models are developed using three well-known classifiers—conditional random field (CRF), support vector machine (SVM), and margin infused relaxed algorithms (MIRA). The gazetteer features are shown to improve the performance, and theMIRA-based NER model fared better than its counterparts SVM and CRF. Full article
(This article belongs to the Special Issue Computational Linguistics for Low-Resource Languages)
Show Figures

Figure 1

Open AccessArticle
Towards Language Service Creation and Customization for Low-Resource Languages
Information 2020, 11(2), 67; https://doi.org/10.3390/info11020067 - 27 Jan 2020
Cited by 1
Abstract
The most challenging issue with low-resource languages is the difficulty of obtaining enough language resources. In this paper, we propose a language service framework for low-resource languages that enables the automatic creation and customization of new resources from existing ones. To achieve this [...] Read more.
The most challenging issue with low-resource languages is the difficulty of obtaining enough language resources. In this paper, we propose a language service framework for low-resource languages that enables the automatic creation and customization of new resources from existing ones. To achieve this goal, we first introduce a service-oriented language infrastructure, the Language Grid; it realizes new language services by supporting the sharing and combining of language resources. We then show the applicability of the Language Grid to low-resource languages. Furthermore, we describe how we can now realize the automation and customization of language services. Finally, we illustrate our design concept by detailing a case study of automating and customizing bilingual dictionary induction for low-resource Turkic languages and Indonesian ethnic languages. Full article
(This article belongs to the Special Issue Computational Linguistics for Low-Resource Languages)
Show Figures

Figure 1

Open AccessArticle
Viability of Neural Networks for Core Technologies for Resource-Scarce Languages
Information 2020, 11(1), 41; https://doi.org/10.3390/info11010041 - 12 Jan 2020
Abstract
In this paper, the viability of neural network implementations of core technologies (the focus of this paper is on text technologies) for 10 resource-scarce South African languages is evaluated. Neural networks are increasingly being used in place of other machine learning methods for [...] Read more.
In this paper, the viability of neural network implementations of core technologies (the focus of this paper is on text technologies) for 10 resource-scarce South African languages is evaluated. Neural networks are increasingly being used in place of other machine learning methods for many natural language processing tasks with good results. However, in the South African context, where most languages are resource-scarce, very little research has been done on neural network implementations of core language technologies. In this paper, we address this gap by evaluating neural network implementations of four core technologies for ten South African languages. The technologies we address are part of speech tagging, named entity recognition, compound analysis and lemmatization. Neural architectures that performed well on similar tasks in other settings were implemented for each task and the performance was assessed in comparison with currently used machine learning implementations of each technology. The neural network models evaluated perform better than the baselines for compound analysis, are viable and comparable to the baseline on most languages for POS tagging and NER, and are viable, but not on par with the baseline, for Afrikaans lemmatization. Full article
(This article belongs to the Special Issue Computational Linguistics for Low-Resource Languages)
Open AccessArticle
Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages
Information 2020, 11(1), 24; https://doi.org/10.3390/info11010024 - 29 Dec 2019
Abstract
To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the [...] Read more.
To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks. Full article
(This article belongs to the Special Issue Computational Linguistics for Low-Resource Languages)
Show Figures

Figure 1

Open AccessArticle
Improving Basic Natural Language Processing Tools for the Ainu Language
Information 2019, 10(11), 329; https://doi.org/10.3390/info10110329 - 24 Oct 2019
Abstract
Ainu is a critically endangered language spoken by the native inhabitants of northern Japan. This paper describes our research aimed at the development of technology for automatic processing of text in Ainu. In particular, we improved the existing tools for normalizing old transcriptions, [...] Read more.
Ainu is a critically endangered language spoken by the native inhabitants of northern Japan. This paper describes our research aimed at the development of technology for automatic processing of text in Ainu. In particular, we improved the existing tools for normalizing old transcriptions, word segmentation, and part-of-speech tagging. In the experiments we applied two Ainu language dictionaries from different domains (literary and colloquial) and created a new data set by combining them. The experiments revealed that expanding the lexicon had a positive impact on the overall performance of our tools, especially with test data unrelated to any of the training sets used. Full article
(This article belongs to the Special Issue Computational Linguistics for Low-Resource Languages)
Open AccessArticle
MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language
Information 2019, 10(10), 317; https://doi.org/10.3390/info10100317 - 16 Oct 2019
Cited by 1
Abstract
Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm, which reduces the [...] Read more.
Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experimental results we obtained demonstrate the high performance of our algorithm, comparable with the other best-performing models. Given its low computational cost and competitive results, we believe that the proposed approach could be extended to other languages, and possibly also to other Natural Language Processing tasks, such as speech recognition. Full article
(This article belongs to the Special Issue Computational Linguistics for Low-Resource Languages)
Open AccessArticle
A Sustainable and Open Access Knowledge Organization Model to Preserve Cultural Heritage and Language Diversity
Information 2019, 10(10), 303; https://doi.org/10.3390/info10100303 - 28 Sep 2019
Cited by 2
Abstract
This paper proposes a new collaborative and inclusive model for Knowledge Organization Systems (KOS) for sustaining cultural heritage and language diversity. It is based on contributions of end-users as well as scientific and scholarly communities from across borders, languages, nations, continents, and disciplines. [...] Read more.
This paper proposes a new collaborative and inclusive model for Knowledge Organization Systems (KOS) for sustaining cultural heritage and language diversity. It is based on contributions of end-users as well as scientific and scholarly communities from across borders, languages, nations, continents, and disciplines. It consists in collecting knowledge about all worldwide translations of one original work and sharing that data through a digital and interactive global knowledge map. Collected translations are processed in order to build multilingual parallel corpora for a large number of under-resourced languages as well as to highlight the transnational circulation of knowledge. Building such corpora is vital in preserving and expanding linguistic and traditional diversity. Our first experiment was conducted on the world-famous and well-traveled American novel Adventures of Huckleberry Finn by the American author Mark Twain. This paper reports on 10 parallel corpora that are now sentence-aligned pairs of English with Basque (an European under-resourced language), Bulgarian, Dutch, Finnish, German, Hungarian, Polish, Portuguese, Russian, and Ukrainian, processed out of 30 collected translations. Full article
(This article belongs to the Special Issue Computational Linguistics for Low-Resource Languages)
Show Figures

Figure 1

Open AccessArticle
Subunits Inference and Lexicon Development Based on Pairwise Comparison of Utterances and Signs
Information 2019, 10(10), 298; https://doi.org/10.3390/info10100298 - 26 Sep 2019
Cited by 1
Abstract
Communication languages convey information through the use of a set of symbols or units. Typically, this unit is word. When developing language technologies, as words in a language do not have the same prior probability, there may not be sufficient training data for [...] Read more.
Communication languages convey information through the use of a set of symbols or units. Typically, this unit is word. When developing language technologies, as words in a language do not have the same prior probability, there may not be sufficient training data for each word to model. Furthermore, the training data may not cover all possible words in the language. Due to these data sparsity and word unit coverage issues, language technologies employ modeling of subword units or subunits, which are based on prior linguistic knowledge. For instance, development of speech technologies such as automatic speech recognition system presume that there exists a phonetic dictionary or at least a writing system for the target language. Such knowledge is not available for all languages in the world. In that direction, this article develops a hidden Markov model-based abstract methodology to extract subword units given only pairwise comparison between utterances (or realizations of words in the mode of communication), i.e., whether two utterances correspond to the same word or not. We validate the proposed methodology through investigations on spoken language and sign language. In the case of spoken language, we demonstrate that the proposed methodology can lead up to discovery of phone set and development of phonetic dictionary. In the case of sign language, we demonstrate how hand movement information can be effectively modeled for sign language processing and synthesized back to gain insight about the derived subunits. Full article
(This article belongs to the Special Issue Computational Linguistics for Low-Resource Languages)
Show Figures

Figure 1

Open AccessArticle
Terminology Translation in Low-Resource Scenarios
Information 2019, 10(9), 273; https://doi.org/10.3390/info10090273 - 30 Aug 2019
Cited by 1
Abstract
Term translation quality in machine translation (MT), which is usually measured by domain experts, is a time-consuming and expensive task. In fact, this is unimaginable in an industrial setting where customised MT systems often need to be updated for many reasons (e.g., availability [...] Read more.
Term translation quality in machine translation (MT), which is usually measured by domain experts, is a time-consuming and expensive task. In fact, this is unimaginable in an industrial setting where customised MT systems often need to be updated for many reasons (e.g., availability of new training data, leading MT techniques). To the best of our knowledge, as of yet, there is no publicly-available solution to evaluate terminology translation in MT automatically. Hence, there is a genuine need to have a faster and less-expensive solution to this problem, which could help end-users to identify term translation problems in MT instantly. This study presents a faster and less expensive strategy for evaluating terminology translation in MT. High correlations of our evaluation results with human judgements demonstrate the effectiveness of the proposed solution. The paper also introduces a classification framework, TermCat, that can automatically classify term translation-related errors and expose specific problems in relation to terminology translation in MT. We carried out our experiments with a low resource language pair, English–Hindi, and found that our classifier, whose accuracy varies across the translation directions, error classes, the morphological nature of the languages, and MT models, generally performs competently in the terminology translation classification task. Full article
(This article belongs to the Special Issue Computational Linguistics for Low-Resource Languages)
Show Figures

Figure 1

Open AccessArticle
The Usefulness of Imperfect Speech Data for ASR Development in Low-Resource Languages
Information 2019, 10(9), 268; https://doi.org/10.3390/info10090268 - 28 Aug 2019
Cited by 2
Abstract
When the National Centre for Human Language Technology (NCHLT) Speech corpus was released, it created various opportunities for speech technology development in the 11 official, but critically under-resourced, languages of South Africa. Since then, the substantial improvements in acoustic modeling that deep architectures [...] Read more.
When the National Centre for Human Language Technology (NCHLT) Speech corpus was released, it created various opportunities for speech technology development in the 11 official, but critically under-resourced, languages of South Africa. Since then, the substantial improvements in acoustic modeling that deep architectures achieved for well-resourced languages ushered in a new data requirement: their development requires hundreds of hours of speech. A suitable strategy for the enlargement of speech resources for the South African languages is therefore required. The first possibility was to look for data that has already been collected but has not been included in an existing corpus. Additional data was collected during the NCHLT project that was not included in the official corpus: it only contains a curated, but limited subset of the data. In this paper, we first analyze the additional resources that could be harvested from the auxiliary NCHLT data. We also measure the effect of this data on acoustic modeling. The analysis incorporates recent factorized time-delay neural networks (TDNN-F). These models significantly reduce phone error rates for all languages. In addition, data augmentation and cross-corpus validation experiments for a number of the datasets illustrate the utility of the auxiliary NCHLT data. Full article
(This article belongs to the Special Issue Computational Linguistics for Low-Resource Languages)
Show Figures

Figure 1

Open AccessArticle
Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology
Information 2019, 10(8), 247; https://doi.org/10.3390/info10080247 - 25 Jul 2019
Abstract
Collecting speech data for a low-resource language is challenging when funding and resources are limited. This paper describes the process of designing, creating and using the Paldaruo Speech Corpus for developing speech technology for Welsh. Specifically, this paper focuses on the crowdsourcing of [...] Read more.
Collecting speech data for a low-resource language is challenging when funding and resources are limited. This paper describes the process of designing, creating and using the Paldaruo Speech Corpus for developing speech technology for Welsh. Specifically, this paper focuses on the crowdsourcing of data using an app on smartphones and mobile devices, allowing speakers from across Wales to contribute. We discuss the development of reading prompts: isolated words and full sentences, as well as the metadata collected from contributors. We also provide background on the design of the Paldaruo App as well as the main uses for the corpus and its availability and licensing. The corpus was designed for the development of speech recognition for Welsh and has been used to create a number of other resources. These methods can be extended to other languages, and suggestions for other low-resource languages are discussed. Full article
(This article belongs to the Special Issue Computational Linguistics for Low-Resource Languages)
Show Figures

Figure 1

Open AccessArticle
Text Filtering through Multi-Pattern Matching: A Case Study of Wu–Manber–Uy on the Language of Uyghur
Information 2019, 10(8), 246; https://doi.org/10.3390/info10080246 - 24 Jul 2019
Cited by 1
Abstract
Given its generality in applications and its high time-efficiency on big data-sets, in recent years, the technique of text filtering through pattern matching has been attracting increasing attention from the field of information retrieval and Natural language Processing (NLP) research communities at large. [...] Read more.
Given its generality in applications and its high time-efficiency on big data-sets, in recent years, the technique of text filtering through pattern matching has been attracting increasing attention from the field of information retrieval and Natural language Processing (NLP) research communities at large. That being the case, however, it has yet to be seen how this technique and its algorithms, (e.g., Wu–Manber, which is also considered in this paper) can be applied and adopted properly and effectively to Uyghur, a low-resource language that is mostly spoken by the ethnic Uyghur group with a population of more than eleven-million in Xinjiang, China. We observe that technically, the challenge is mainly caused by two factors: (1) Vowel weakening and (2) mismatching in semantics between affixes and stems. Accordingly, in this paper, we propose Wu–Manber–Uy, a variant of an improvement to Wu–Manber, dedicated particularly for working on the Uyghur language. Wu–Manber–Uy implements a stem deformation-based pattern expansion strategy, specifically for reducing the mismatching of patterns caused by vowel weakening and spelling errors. A two-way strategy that applies invigilation and control on the change of lexical meaning of stems during word-building is also used in Wu–Manber–Uy. Extra consideration with respect to Word2vec and the dictionary are incorporated into the system for processing Uyghur. The experimental results we have obtained consistently demonstrate the high performance of Wu–Manber–Uy. Full article
(This article belongs to the Special Issue Computational Linguistics for Low-Resource Languages)
Show Figures

Figure 1

Back to TopTop