Next Article in Journal
Connectivity and Hamiltonicity of Canonical Colouring Graphs of Bipartite and Complete Multipartite Graphs
Previous Article in Journal
Combinatorial GVNS (General Variable Neighborhood Search) Optimization for Dynamic Garbage Collection
Open AccessArticle

On Hierarchical Text Language-Identification Algorithms

1
School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
2
Key Multi-lingual Laboratory of Xinjiang, Urumqi 830046, China
3
Department of Computer, Hotan Teachers College, Hotan 848000, China
*
Author to whom correspondence should be addressed.
Algorithms 2018, 11(4), 39; https://doi.org/10.3390/a11040039
Received: 7 February 2018 / Revised: 23 March 2018 / Accepted: 23 March 2018 / Published: 27 March 2018
Text on the Internet is written in different languages and scripts that can be divided into different language groups. Most of the errors in language identification occur with similar languages. To improve the performance of short-text language identification, we propose four different levels of hierarchical language identification methods and conducted comparative tests in this paper. The efficiency of the algorithms was evaluated on sentences from 97 languages, and its macro-averaged F1-score reached in four-stage language identification was 0.9799. The experimental results verified that, after script identification, language group identification and similar language group identification, the performance of the language identification algorithm improved with each stage. Notably, the language identification accuracy between similar languages improved substantially. We also investigated how foreign content in a language affects language identification. View Full-Text
Keywords: language identification; character N-gram; script identification; language group identification; similar language identification language identification; character N-gram; script identification; language group identification; similar language identification
Show Figures

Figure 1

MDPI and ACS Style

Hasimu, M.; Silamu, W. On Hierarchical Text Language-Identification Algorithms. Algorithms 2018, 11, 39.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map

1
Back to TopTop