Next Article in Journal
Connectivity and Hamiltonicity of Canonical Colouring Graphs of Bipartite and Complete Multipartite Graphs
Previous Article in Journal
Combinatorial GVNS (General Variable Neighborhood Search) Optimization for Dynamic Garbage Collection
Article Menu

Export Article

Open AccessArticle
Algorithms 2018, 11(4), 39; https://doi.org/10.3390/a11040039

On Hierarchical Text Language-Identification Algorithms

1
School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
2
Key Multi-lingual Laboratory of Xinjiang, Urumqi 830046, China
3
Department of Computer, Hotan Teachers College, Hotan 848000, China
*
Author to whom correspondence should be addressed.
Received: 7 February 2018 / Revised: 23 March 2018 / Accepted: 23 March 2018 / Published: 27 March 2018

Abstract

Text on the Internet is written in different languages and scripts that can be divided into different language groups. Most of the errors in language identification occur with similar languages. To improve the performance of short-text language identification, we propose four different levels of hierarchical language identification methods and conducted comparative tests in this paper. The efficiency of the algorithms was evaluated on sentences from 97 languages, and its macro-averaged F1-score reached in four-stage language identification was 0.9799. The experimental results verified that, after script identification, language group identification and similar language group identification, the performance of the language identification algorithm improved with each stage. Notably, the language identification accuracy between similar languages improved substantially. We also investigated how foreign content in a language affects language identification. View Full-Text
Keywords: language identification; character N-gram; script identification; language group identification; similar language identification language identification; character N-gram; script identification; language group identification; similar language identification
Figures

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. (CC BY 4.0).
SciFeed

Share & Cite This Article

MDPI and ACS Style

Hasimu, M.; Silamu, W. On Hierarchical Text Language-Identification Algorithms. Algorithms 2018, 11, 39.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics

1

Comments

[Return to top]
Algorithms EISSN 1999-4893 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top