Text classification is an essential aspect in many applications, such as spam detection and sentiment analysis. With the growing number of textual documents and datasets generated through social media and news articles, an increasing number of machine learning methods are required for accurate textual classification. For this paper, a comprehensive evaluation of the performance of multiple supervised learning models, such as logistic regression (LR), decision trees (DT), support vector machine (SVM), AdaBoost (AB), random forest (RF), multinomial naive Bayes (NB), multilayer perceptrons (MLP), and gradient boosting (GB), was conducted to assess the efficiency and robustness, as well as limitations, of these models on the classification of textual data. SVM, LR, and MLP had better performance in general, with SVM being the best, while DT and AB had much lower accuracies amongst all the tested models. Further exploration on the use of different SVM kernels was performed, demonstrating the advantage of using linear kernels over polynomial, sigmoid, and radial basis function kernels for text classification. The effects of removing stop words on model performance was also investigated; DT performed better with stop words removed, while all other models were relatively unaffected by the presence or absence of stop words.
This is an open access article distributed under the Creative Commons Attribution License
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.