On Hierarchical Text Language-Identification Algorithms
AbstractText on the Internet is written in different languages and scripts that can be divided into different language groups. Most of the errors in language identification occur with similar languages. To improve the performance of short-text language identification, we propose four different levels of hierarchical language identification methods and conducted comparative tests in this paper. The efficiency of the algorithms was evaluated on sentences from 97 languages, and its macro-averaged F1-score reached in four-stage language identification was 0.9799. The experimental results verified that, after script identification, language group identification and similar language group identification, the performance of the language identification algorithm improved with each stage. Notably, the language identification accuracy between similar languages improved substantially. We also investigated how foreign content in a language affects language identification. View Full-Text
Share & Cite This Article
Hasimu, M.; Silamu, W. On Hierarchical Text Language-Identification Algorithms. Algorithms 2018, 11, 39.
Hasimu M, Silamu W. On Hierarchical Text Language-Identification Algorithms. Algorithms. 2018; 11(4):39.Chicago/Turabian Style
Hasimu, Maimaitiyiming; Silamu, Wushour. 2018. "On Hierarchical Text Language-Identification Algorithms." Algorithms 11, no. 4: 39.
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.