Next Article in Journal
Alternating Direction Method of Multipliers for Generalized Low-Rank Tensor Recovery
Previous Article in Journal
siEDM: An Efficient String Index and Search Algorithm for Edit Distance with Moves
Article Menu

Export Article

Open AccessArticle

The Effect of Preprocessing on Arabic Document Categorization

School of Information Science and Engineering, Central south University, Changsha 410000, China
College of Computer Science and Electrical Engineering, Hunan University, Changsha 410000, China
Author to whom correspondence should be addressed.
Academic Editor: Tom Burr
Algorithms 2016, 9(2), 27;
Received: 21 January 2016 / Revised: 30 March 2016 / Accepted: 12 April 2016 / Published: 18 April 2016
PDF [5040 KB, uploaded 18 April 2016]


Preprocessing is one of the main components in a conventional document categorization (DC) framework. This paper aims to highlight the effect of preprocessing tasks on the efficiency of the Arabic DC system. In this study, three classification techniques are used, namely, naive Bayes (NB), k-nearest neighbor (KNN), and support vector machine (SVM). Experimental analysis on Arabic datasets reveals that preprocessing techniques have a significant impact on the classification accuracy, especially with complicated morphological structure of the Arabic language. Choosing appropriate combinations of preprocessing tasks provides significant improvement on the accuracy of document categorization depending on the feature size and classification techniques. Findings of this study show that the SVM technique has outperformed the KNN and NB techniques. The SVM technique achieved 96.74% micro-F1 value by using the combination of normalization and stemming as preprocessing tasks. View Full-Text
Keywords: document categorization; text preprocessing; stemming techniques; classification techniques; Arabic language processing document categorization; text preprocessing; stemming techniques; classification techniques; Arabic language processing

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Share & Cite This Article

MDPI and ACS Style

Ayedh, A.; TAN, G.; Alwesabi, K.; Rajeh, H. The Effect of Preprocessing on Arabic Document Categorization. Algorithms 2016, 9, 27.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Algorithms EISSN 1999-4893 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top