Twenty Years of Machine-Learning-Based Text Classiﬁcation: A Systematic Review

: Machine-learning-based text classiﬁcation is one of the leading research areas and has a wide range of applications, which include spam detection, hate speech identiﬁcation, reviews, rating summarization, sentiment analysis, and topic modelling. Widely used machine-learning-based research differs in terms of the datasets, training methods, performance evaluation, and comparison methods used. In this paper, we surveyed 224 papers published between 2003 and 2022 that employed machine learning for text classiﬁcation. The Preferred Reporting Items for Systematic Reviews (PRISMA) statement is used as the guidelines for the systematic review process. The comprehensive differences in the literature are analyzed in terms of six aspects: datasets, machine learning models, best accuracy, performance evaluation metrics, training and testing splitting methods, and comparisons among machine learning models. Furthermore, we highlight the limitations and research gaps in the literature. Although the research works included in the survey perform well in terms of text classiﬁcation, improvement is required in many areas. We believe that this survey paper will be useful for researchers in the ﬁeld of text classiﬁcation.


Introduction
Machine learning models provide the best alternative to traditional methods in the field of text classification.Text-classification-based research has been conducted extensively in recent years to improve the performance of machine learning models [1].Text classification can be done manually, but it is time-consuming and has a high cost.Manual classification comes with lots of errors and is less accurate because of human error and a lack of domain knowledge understanding.There was a huge revolution in text classification when machine learning models, such as the Support Vector Machines (SVM), Naive Bayes (NB), and Random Forest (RF) started to replace manual work, because these models not only reduce the time and cost but also are highly accurate for classification [2].Since the inception of machine learning models, numerous studies have been conducted to enhance, optimize, and refine the text classification process.
Text classification is done in four stages: (a) pre-processing, (b) text representation, (c) feature selection, and finally, (d) classification [19].The first stage is pre-processing in which the input is cleaned and shaped according to the need for classification.The noise present in the input is removed [20].Next, the input text is converted into a format, such as bag of words or n-gram.Feature selection is an optional step that involves identifying and picking important features.The size of the input is largely reduced at this stage [21].
Machine learning models have been applied to text classification and have produced promising results in various fields, such as finance [22], tourism [23], healthcare [24], and online news analysis [25].This paper reviews how, when, and where these models have been successful in text classification.We studied the pros and cons of many models separately and determined how they perform in each dataset.We considered papers from 2003, and the findings of the paper will help the research community to improve the classification in the future.Furthermore, we present a few gaps noticed in the literature along with a summary of some interesting future works in the text classification domain.
Previous survey papers on text classification [2,[26][27][28][29][30][31] discussed various aspects of text classification, such as feature extraction techniques, algorithms, evaluation methods, and limitations.They also provided an overview of deep-learning-based text classification models, popular datasets, future research directions, and a comparison of different methodologies.The papers also highlighted the strengths and limitations of traditional text classification methods and suggested directions for future work.Thangaraj et al. [26] examined articles on text classification techniques in Artificial Intelligence (AI) written between 2010 and 2017 and grouped the techniques according to the algorithms involved.The results were visualized as a tree structure to show the relationship between learning procedures and algorithms, and the paper identified the strengths, limitations, and current research trends in text classification.
Mironczuk et al. [31] presented an overview of the state-of-the-art of text classification by identifying and studying key and recent studies and objectives in this discipline.The paper covered six fundamental parts of text classification and analyzed the connected works qualitatively and quantitatively.
Kowsari et al. [2] presented a brief discussion of various text feature extraction techniques, dimensionality reduction methods, existing algorithms, and evaluation methods.The limitations of each technique were also discussed, along with their applications in real-world problems.
Wu et al. [30] examined text categorization models, such as the Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Attention Mechanisms, and others.The paper summarizes the shortcomings of traditional text classification methods and introduces the deep-learning-based text classification process.
Minaee et al. [27] conducted a comprehensive review of more than 150 deep-learningbased text classification models and discussed their technical contributions, similarities, and strengths.The paper also provides a summary of more than 40 popular text classification datasets and discusses future research directions.
Bayer et al. [28] surveyed data augmentation methods for textual classification and categorized more than 100 methods into 12 different groupings based on a taxonomy.The paper provides cutting-edge references and highlights promising methods as well as providing research perspectives for future work.
Li et al. [29] covered state-of-the-art approaches in text classification from 1961 to 2021, with an emphasis on models ranging from classical to deep learning.The paper developed a text classification taxonomy and provided a comparison of different methodologies, outlining their benefits and drawbacks.It also summarized major implications, prospective research objectives, and obstacles in the studied area.
Our work is different from previously discussed studies.It focuses on machinelearning-based text classification.Our study conducted a systematic review of 224 papers published between 2003 and 2022 that employ machine learning for text classification.This paper analyzes the differences in the literature in terms of six aspects: datasets, machine learning models, best accuracy, performance evaluation metrics, training and testing splitting methods, and comparisons among machine learning models.While other studies provide more specific perspectives on text classification techniques, such as deep learning models, data augmentation methods, and a comparison of classical to deep learning models, this article provides a broader and more comprehensive view of the field of machine-learning-based text classification.
The main contributions of this paper are to answer the following research questions: 1.
What is the most frequently used dataset for machine-learning-based text classification?2.
What are the frequencies at which machine learning models are used?3.
What is the maximum accuracy for each dataset? 4.
What is the most frequently used performance evaluation metric? 5.
What is the most successful train-test split method?6.
How do different machine learning models compare?
The paper is organized as follows: Section 3 presents a summary of all of the surveyed papers.Section 4 discusses the solutions proposed by various researchers for some of the common problems in text classification.Section 5 contains our notable observations on text classification.Lastly, the conclusion is discussed in Section 6.

Survey Methodology
To conduct this review, we used the Preferred Reporting Items for Systematic Reviews (PRISMA) [32].Systematic reviews frequently demonstrate a lack of understanding about shared principles that allow them to be replicated and scientifically sufficient.PRISMA is a standard peer-reviewed approach that employs a guideline checklist, which was closely followed in this manuscript, to contribute to the quality assurance and replicability of the revision process.A review protocol detailing the article selection criteria, search strategy, data extraction, and data analysis techniques was created.
In this review, only peer-reviewed research papers published in 2003-2022 and written in English were considered.Only applications of classification using textual data were considered.The research papers were identified through the keywords mentioned in Table 1 via sources like IEEE, Science Direct, Springer, ACM Digital Library, MDPI, and Hindawi.Initial screening was done at the title and abstract levels.Later, full text was retrieved, and works including studies related to text-based classification alone were included in this survey.Figure 1 presents the selection process for papers in this survey.The entire paper list and other data are attached in the appendix.The numbers of papers considered in this survey from 2003 to 2022 are depicted in Figure 2.

Overview of the Survey Results
To answer the previous 6 questions presented in Section 1, we summarize and categorize the literature considering many factors that are also presented in this section.

Study on the Dataset
In the surveyed papers, a comprehensive set of 56 distinct datasets was employed with the top three most commonly utilized being 20Newsgroup, Reuters, and Webkb, each being employed in a significant number of studies.Some datasets such as PAN-12, Tigrinya, and Emotion616 were rarely used.A total of 17.85% of the surveyed datasets used binary classification.Table 2 summarizes the frequency of the top 10 datasets used by various papers in the survey.Figure 3 displays the frequency of dataset use in the literature.

Study on Machine Learning Models
We found that SVM is the most frequently used machine learning model for text classification and was presented in 118 papers.NB and kNN are the next popular models for text classification.The maximum accuracy of 98.88 was obtained by SVM in the 20Newsgroup dataset.The frequency of each machine learning model along with the maximum accuracy obtained are presented in Table 3. Figure 4 displays the best accuracy level for each dataset, and Figure 5 shows the frequency of machine learning model use in the literature.

Study on Accuracy
Accuracy is one of the most important factors used to evaluate the performance of a machine learning model.A highly accurate value indicates that the model has perfectly learned the relationships among the input samples, and it is ready for classifying future values.Table 4 lists a few popular datasets and their maximum accuracy levels.

Study on Performance Evaluation
Many performance evaluation metrics can be used to validate the efficiency of a machine learning model.These metrics can also be used to measure the correctness of the training process of a machine learning model.Most papers use multiple performance evaluation metrics to validate the model.Twenty-four unique metrics were used in the surveyed papers, including ROC, the Jaccard Similarity Score, and RMSE.Table 5 displays the frequencies of various performance metrics.
Accuracy is the standard measure for classification; however, if the dataset is skewed, accuracy can be misleading [54].For an imbalanced dataset, F1 is the best metric to use [55].If deep learning models are used and if precision seems to be dropping, macro-F1 can be used as an alternate [56].There are 42 unique combinations of metrics.Table 6 presents the top 10 combinations and their respective frequencies.

Study on Train-Test Splits
Each supervised machine learning model has two stages: training and testing.A model is first given the training set to learn all the relationships among the input samples.Once the training has been completed, a new set known as a testing set is then fed into the classifier.This time, the classifier uses the previously learned knowledge to predict and validate the input samples present in the testing set.All performance evaluation metrics that were discussed in the previous subsection were used in this stage.Figure 6 displays the frequency of the train-test splits.10 Fold validation is the most widely used train-test split method (count = 45), followed by 5-Fold (count = 29).

Study on Machine Learning Algorithms
In this subsection, we present a comparison of different machine learning algorithms.

Definitions of Positive and Negative Accuracies:
In a paper, if there is a comparison between two machine learning models X and Y with accuracy levels a1 and a2, if a1 > a2, then X has a positive accuracy compared to Y and Y has a negative accuracy compared to X.

Support Vector Machine (SVM)
SVM is the most commonly used model in the literature.SVM is both a linear and a nonlinear classifier that can perform well, especially in multilabel scenarios [57].The core of SVM is the kernel.Choosing and optimizing the correct kernel will increase the accuracy of the classification [58].Figure 7 shows the number of papers in which SVM performs better than other algorithms (positive accuracies) and the number of papers in which other algorithms perform better than SVM (negative accuracies).kNN is a good classifier that can predict the class of an instance based on its nearest neighbors.The value of k determines the number of nearest neighbors to consider.The value of k is chosen to be an odd number to avoid race conditions [63].kNN can be a good model to remove all of the extreme values [64].Figure 9 displays the comparison of kNN with other machine learning models.3.6.4.Decision Tree (DT) The Decision Tree classifier works by constructing a tree-like structure consisting of various branches and classifying the samples by passing through the branches.Both numerical and categorical data can be used in the DT. Figure 10 shows compares the DT with other algorithms.The advantage of DT is that it can easily process high-dimensional data.The limitation is that the DT is not a stable classifier.Table 10 shows the top 5 DT accuracy levels.3.6.5.Random Forest (RF) RF is an ensemble of multiple DTs.The final classification result is decided by a majority vote by all DTs.RF can efficiently manage thousands of features.Figure 11 presents an accuracy comparison of RF with other models.One of the main reasons for choosing RF is that it can give good accuracy results with nonlinear data.The drawback of this model is that overfitting can easily occur.Table 11 presents the top 5 RF accuracy levels in our survey.3.6.6.Logistic Regression (LR) Logistic Regression is a statistical method for performing a binary classification.However, LR can also be extended for multiclass classifications.The requirement for LR is that the input samples should be linear.Figure 12 displays an accuracy comparison of LR with other machine learning models.
The advantage of LR is that it can perform well on smaller datasets.The drawback is that it cannot classify continuous variables.Table 12 displays the top 5 best LR accuracy levels in our survey.3.6.7.Summary of Machine Learning Classifiers Finally, the performances of machine learning classifiers (DT, LR, NB, RF, SVM, kNN) on each dataset (20Newsgroup, Amazon Review, Bike Review, Blogger, Chinese Microblogg, Counter, Gold, IMDb, PAN-12, Reuters, Spam-1000, Twitter, Webkb) are summarized in Figure 13.The best and most stable performance was achieved by SVM, while RF generally had the worst and most unstable (i.e., the largest variance) performance.

Deep-Learning-Based Models for Text Classification
Most of the machine learning-based models on text classifications rely on bag of words or term frequencies.However, there are many problems with these approaches, such as similarity (in terms of semantics), scalability, and ambiguity.Similarity is caused when two words with the same meaning are represented in two different ways, e.g., the words 'biscuit' and 'cookie' often have the same meaning, but the representation is different if the bag of words model is used.Similarly, the bag of words model represents each unique word by one vector; this causes exponential growth of unique vectors.Thus, it is not scalable.The final and most important problem is ambiguity.This is caused by the order of words, e.g., "The food is ready" and "Is the food ready" are represented the same way because of the unique words employed.The above-mentioned problems can be solved using deep-learning-based text representation methods, such as word embedding.Word embedding considers the semantics of the words, so different words with the same meaning are represented as the same vector or a similar vector.In word embedding, each word is mapped to an N-dimensional vector.Table 13 shows some examples of the most popular text representations across the years.
Recently, deep learning models have been used to achieve very good results on various text classification problems [27,30].Deep learning models, such as ANN, CNN, RNN, and LSTM seem to have better accuracy levels than machine learning algorithms for solving multiple NLP subproblems, such as part-of-speech (POS) tagging [71].There are many advantages of using deep learning over machine learning, such as handling noisy data, high accuracy, and better identification of the relationship between the input and output features.The few drawbacks of deep learning-based classification are overfitting and time consumption.Word2vec [72] is one of the most popular text representing methods that predicts the probability of word distribution based on neighboring words.It consists of two deep learning architectures known as the continuous bag of words (CBOW) and the skip-gram model.CBOW predicts the probability of neighboring words based on the center word.Skip-gram predicts the probability of the center word based on the neighboring words.Figure 18 shows the architecture of the word2vec model.Equation (1) displays how the probability estimation of a word is done based on its neighboring words.Equation (2) shows the representation of Word2Vec.
Doc2vec [73] is an extension of word2vec in which the semantic relationship can be expressed across a large number of words (paragraphs, documents).Each document is represented as a unique vector.Equation (3) shows the representation of Word2Vec.Figure 19 displays the architecture of the Doc2Vec model.

Transformers
The RNN model outperforms other machine learning models in extracting relationships across sequential text; however, the computational cost of running an RNN is very huge.This problem can be overcome by using parallel processing.Transforms use pretrained models to run each word in parallel and thus reduce the computational complexity.Deep learning methods provide various ways to overcome the problems in machine learning models.For example, in [75], the inverse document frequency was used to prevent the semantic problem.Creating multiple vectors, one for each unique meaning of a word, is also one solution, as proposed by [76].The bidirectional language model is used to adapt pre-trained models and their knowledge to classify multi-meaning input texts as proposed by [77].Online texts have been tagged using fuzzy logic in [78].Deep-learning-based models involve the development of complex nested and deep architectures, which naturally increases the computation time.In [79], novel work was proposed to decrease the execution time.Deep learning is also used in other problems, such as hierarchical classification, big data classification [80], and malware analysis [81].

Problems in Text Classification
This section explains some of the common problems faced by researchers in the field of text classification along with a few solutions proposed by the papers in the survey.Table 14 presents the most frequently used considerations to improve the accuracy of text clarifications using machine learning algorithms.Table 14.Considerations for improving the accuracy of text classification using traditional machine learning.

Increased Accuracy
Traditional machine learning algorithms can be used for text classification.Despite their better accuracy, there is lots of room for improvement.In this subsection, we present a few research works on improving accuracy.
In [82], a few improvements in the pre-processing stage are identified by considering the frequency of features, the initial letter, paragraphs, question marks, and full stops.The frequency of features can be assessed using term weighting or embedding.Reference [57] shows that term weighting can iterate over each word and has an upper hand over embedding techniques.However, few deep learning models, such as that presented in ref. [83], improve the accuracy of classification by using embedding.Some optimizations can be done to avoid the consideration of all features.Instead, the classifier can jump from one location to another [84].
Many works concentrate on integrating deep learning and machine learning models to increase the accuracy of text classification.Article [48] explains the use of NLP to increase the accuracy.In [54], resampling is performed along with NLP to increase the accuracy.When the number of features is large, feature fusion can be done to combine multiple features to obtain one or more prominent features.Feature fusion can outperform the classification model performance, as stated by [85].As each machine learning model has its advantages and disadvantages, an ensemble approach can be implemented to increase the accuracy of a classification task [33,85,86].Feedback systems can also improve the performance [87].Sometimes, the accuracy of the classifier can be increased in external ways, for example, domain experts can be involved in providing feedback during the training process [88].A few studies have shown [89] that the selection of subfeatures can also have a positive impact on the classification.

Feature Selection
The statistical process of reducing the number of input features in the classification is known as feature selection.Many research papers have shown that feature selection can increase the performance of a classification [90].Many criteria [70] are used to determine whether a feature should be selected for training purposes or not, such as, measuring its significance towards the class, finding the overlaps between classes, and determining unwanted features.An interesting paper [91] focused on selecting highly discriminative features (those which are present only in one class) in the training set.Moreover, the integration of multiple feature selection techniques can output decent results [92].
Many research works have concentrated on the addition of new features by, for example, considering missing features [60], side information [93,94], assigning weights [95], using semantic relationships [96,97], and utilizing the structure of documents [67,98].Performing subsampling [99] can also benefit the classification performance.
One of the problems with feature selection is redundancy, that is, the same feature may occur more than once in different forms.Although stemming can eliminate a few redundant features, it is difficult to remove all redundant features, because there may be duplicates in the form of synonyms.Topic modeling is one of the methods proposed by [51] to remove duplicates.Stemming in the English language may be very easy; however, other languages, such as Arabic, require custom stemming algorithms to [80].To improve the results, heuristic optimization methods have been applied to improve feature selection [100].Other methods, such as redundant feature mapping [101] and word co-occurrences [102], can also help with the classification process to improve the performance.

Feature Drift
For some kinds of classification, like spam and reviews, there is no constant list of features that can determine the class of a sample.Features change over time, for example, a feature that is responsible for determining the class of a sample will be no longer responsible for the same in the near future.Thus, to obtain a good performance over the longer term, feature drift should be considered.Incremental learning, as discussed in [103], is one of the methods that can solve feature drift.The information value over time should be determined for each feature so that it is easy to find out which feature is important over a period of time [104].

Representation of Features
There are lots of ways in which features can be represented, such as, the bag of words and semantic models.Choosing the best representation according to the problem needs to be addressed before performing the classification process.While performing the bag of words model, the order of words is lost.Few research works have been done to optimize the bag of words model using methods such as including semantic information [105] and integrating fuzzy concepts [41].Representing features in the form of word2vec can allow more relationships to be detected among them [106].Different weighting schemes can be embedded to get good results, such as merging TF-IDF and word2vec [107].When the number of features is greater, few works [108] show good performance when representing the relationship among features instead of considering all features independently.

Overfitting
Overfitting occurs when a model is trained too well.A high level of accuracy during training and a low level of accuracy during testing mean that the model is overfitted.An optimal number of samples should be selected for training and testing purposes.Reference [109] introduced a new metric known as the Rate of Overfitting (RO), which is used to determine the numbers of samples in the training and testing sets to avoid overfitting.

Short Text
A short text is one of the major difficulties in text classification, because it leads to limited features.Since the number of features is small, the classifier will struggle to learn all the relationships among classes.Reference [110] used an end-to-end learning hybrid network with multiple timescales.Other methods, such as character encoding [44], feature expansion [111], and rich feature generation [112] can also improve the performance of a classifier.
Sometimes, adding extra features from external sources (such as Wikipedia [113]) or additional datasets can increase the vocabulary size.To add extra features, topic modeling can be used to determine the classes of the extra features [66].Transfer learning is one of the proven methods for handling short text.A model that has already been trained on one dataset can be reused for short datasets.Reference [114] showed an improvement in short text classification by using transfer learning methodology.

Imbalanced Data
When there is an unequal distribution of data samples among the classes, the classifier learns very little about the minority class.Imbalanced data are one of the challenges for a machine learning model.In a previous study [115], the authors used sampling techniques, such as SMOTE-ENN to overcome the imbalanced data scenario.The authors of [56] talked about multi-task learning, which can solve the imbalanced data problem.Assigning weights to members of the minority class [116] can also give a good classification performance.Resampling and instance weighting are other methods proposed by [117] to handle imbalanced datasets.

Misclassification
Misclassification means that a classifier wrongly classifies a sample.A classifier with a high false positive or false negative rate means that the classifier is misclassifying the samples.Few optimization techniques are used on the evaluation side [118], such as finding the product of precision and using F1 to find the misclassification.In [119], deep learning models and machine learning models were mixed.This fusion reduces the rate of misclassification.Making use of a virtual category can also prevent a classifier from performing misclassification, as described in [120].A blocking mechanism was implemented in the research done by [121], whereby the classification is done iteratively and at each iteration, the misclassified samples are prevented from propagating to the next iteration.
Through the use of clustering and classification, most of the misclassified samples can be avoided by grouping positive samples with positive classes and negative samples with negative classes [122].
Identifying patterns in the pre-processing step can significantly prevent the samples from being wrongly classified.Regular expressions are used to extract the patterns and help the classifier to reduce misclassification [123].

Lack of Labeled Data
Labeled data are the best way to train a classifier.However, it is difficult to find a perfectly labeled dataset.Manual labeling is done by a domain expert and this requires significant time and costs.Moreover, machine learning has been used to replace all of the manual work.Many research targets automatically label data by using various methodologies, such as active learning [124], detailed pre-processing [125], and interactive visualization [126].
While labeled data are not available, supervised classification becomes difficult.Ref. [55] shows how a semi-supervised classification can be done with good accuracy.In most cases, an unsupervised algorithm, such as clustering [127] can be combined with a classifier to achieve greater accuracy [59].
Topic modeling is an efficient method to tackle the labeling issue.Topic modeling algorithms, such as LDA, are efficient for labeling the unlabeled dataset [128].

High Dimensional Data
High dimensional data not only reduce the speed of the classifier but also degrade the performance badly.Few optimization techniques, such as the light-weighting protocol [129], semantic concept extraction [130], and ensembling features [131] are done to reduce the number of dimensions of input.Feature selection is done most of the time to reduce the number of dimensions.Reference [132] used a filter-based feature selection method to reduce the number of features.
Clustering can also be used before the classification begins, as mentioned in [133].A good weighting scheme can be added to the clustering or classification step to increase the performance future [134].

Long Text
The classification of long text has many difficulties, such as redundancy and a mix of unwanted content.To solve this problem, in [135], the authors used feature fusing and identified the most important or related information from the text corpus.All previously mentioned problems and solutions are summarized in Figure 21.

Discussion
In this section, we discuss our thoughts on the future directions in the field of text classification.We list our observations as follows:

•
A good correlation factor can be found between a pre-classified dataset and a classified dataset.This can enable a transfer learning approach that can easily classify an unmapped instance; • Efficient feature extraction by incorporating textual algorithms (such as sentiment analysis, NLP) can focus on finding important terms (e.g., Smiley in social text classification); • Implementing the GAN model for generating dummy text can convert a short-text input to a normal-sized input.

Research Gaps
We have identified a few gaps in our review.These are listed below.
• Many research studies have focused on self-generated datasets.However, many datasets exist for a given domain.All of these datasets differ in terms of the format and structure of data.Thus, a multi-model classification should be developed to address this issue.

•
A publicly available database that contains the federated-based classifiers of the top datasets should be created.This will significantly help future researchers to develop high-quality and fast outcomes.This step will also enable researchers to compare their local results with community results.

•
The use of active learning can improve the performance of classification by using only a few inputs.The majority of the papers skip the use of active learning.Thus, future research works can focus on including active learning in the classification.

•
Text representation still requires improvement.Many research works should focus on labeling or segmenting the features.For example, there may be a pronoun that represents the noun from the previous sentence.Thus, a good labeling scheme should be developed.

•
The majority of research included in this survey focused on ranking features based on their frequencies.Highly frequent features are ranked the highest.However, this may not be a generalized case.This limitation can be overcome with the help of domain experts who rank the least frequent words based on their importance.

Recommendations
We present recommendations for improving the performance of text classification methods, as follows: • Increase Accuracy: Embedding methods can be improved by incorporating graphbased embedding approaches [83].
The SVM classifier can significantly increase the accuracy by focusing on improvements in kernels [136]; • Feature Selection: A future study has been proposed by [90] to analyze the feature space of the non-English corpus.The extraction of features can also be done by word basic or sentence basic methods [99]; • Misclassification: To reduce the rate of misclassification, fusion models [119] can be improved by assigning different weights to each model and by using some recent fusion models, such as hierarchical deep genetic networks and transfer-learning-based deep models.Creating a hybrid classification model by mixing both instance selection and feature selection could be done in the future [137]; • Feature Drift: In the future, work will determine whether a feature has significant importance in the upcoming period or not [103], thus removing very old features to improve the classification.
• Long Text: To improve the speed of classification on long-text information, the authors of [135] proposed the use of parallel computing.• Lack of labeled data: Implementing a machine learning model for labeling multi-class samples can be considered a good future direction [124].The pre-processing stage, which is done before the use of any machine learning model, can also be optimized to extract hidden labels [125].

Strengths and Weaknesses
We employed the PRISMA approach in this review and attempted to find as many suitable studies as possible.Through active conversations, we widened the search terms and databases and resolved any conflicts.Despite our intention to provide an international component to our analysis, we elected to limit our search to two databases recognized for their quality and commitment to research in order to ensure the rigor and quality of the papers included in our assessment.We prioritized the quality of the articles chosen above the scope of the study, although this resulted in a selection of just 224 research publications.Given this result, one wonders if, by integrating additional databases, more research from a larger range of nations prevented from being included.

Conclusions
Text classification is a foundation for many popular research areas such as sentiment analysis, web searching and summarizing, and spam detection.This paper comprehensively reviews articles on text classification.We selected papers from six publishers produced between 2003 and 2022 and presented an analysis on six aspects: dataset frequency, machine learning model frequency, best performance on each dataset, evaluation metric frequency, train-test splitting frequency, and a comparison among machine learning models.In this survey, we investigated 224 papers and conducted a comprehensive comparison.We found that SVM (59%), NB (46%), and kNN (33%) are the most commonly used machine learning models in the field of text classification.Additionally, 10-Fold validations is the most commonly used metric for validating the learning process of a classifier (22.50%).Accuracy is the most frequently used metric to measure the performance of a machine learning model (28%).SVM seems to perform better in many scenarios, while DT gives the worst results most of the time.Furthermore, we presented a summary of how machine learning is used to tackle various problems in the domain and also provided possible future directions for text classification.This systematic review serves as groundwork for researchers in the field of machine-learning-based text classification to further extend and optimize the models.Funding: This research received no external funding.

Figure 1 .
Figure 1.The criteria used for selecting papers in this survey.

Figure 2 .
Figure 2. Number of papers surveyed in each year.

Figure 3 .
Figure 3. Frequency of dataset use in the literature.

Figure 4 .
Figure 4. Best accuracy level for each dataset.

Figure 5 .
Figure 5. Frequency of machine learning model use in the literature.

Figure 6 .
Figure 6.Frequency of Train-Test splits used by various papers.

Figure 7 .
Figure 7. Accuracy differences by SVM while compared to other models.The main advantages of SVM can be summarized as (a) It is more accurate than other classifiers; (b) It works well with nonlinear distributions; (c) The overfitting chances are very low.The limitations of SVM are (a) Choosing the correct kernel is challenging; (b) There is a long training time; (c) It occupies more memory.

Figure 8 .
Figure 8. Accuracy differences between NB and other models.

Figure 9 .
Figure 9. Accuracy differences between kNN and other models.The main advantage of kNN is the simplicity of its implementation, and the limitations of kNN are (a) It needs more storage space; (b) It is highly sensitive to errors.

Figure 10 .
Figure 10.Accuracy differences between the DT and other models.

Figure 11 .
Figure 11.Accuracy differences between RF and other models.

Figure 12 .
Figure 12.Accuracy differences between LR and other models.

Figure 13 .
Figure 13.Summary of machine learning classifier performances achieved on various datasets.
Figures 14-17 show an accuracy comparison of ANN, CNN, RNN, and LSTM with other models.Word embedding is the most fundamental step in deep-learningbased text classification.

Figure 14 .
Figure 14.Accuracy differences between ANN and other models.

Figure 15 .
Figure 15.Accuracy differences between CNN and other models.

Figure 16 .
Figure 16.Accuracy differences between RNN and other models.

Figure 17 .
Figure 17.Accuracy differences between LSTM and other models.

Table 13 .
Popular Text Representation Methods.

Figure 19 .
Figure 19.The Doc2Vec Model.3.7.3.FastText FastText [74] improves the performance of the word2vec model by considering subwords.The final vector of a word w is the sum of all the vectors of subwords in w.FastText can create a new vector when the target word is not present in the training set.The architecture of FastText is shown in Figure 20.

Figure 21 .
Figure 21.Summary of the main problems and solutions considered for improving text classification.

Table 3 .
Frequency of model use in various papers in this area of literature across various year ranges.

Table 4 .
Maximum accuracy levels obtained from top datasets.

Table 5 .
Various performance metrics used by the survey papers.

Table 6 .
Top 10 combinations of performance metrics used in the survey.

Table 7
displays the top 5 SVM accuracy levels in our survey.

Table 7 .
Top Performance Levels of the SVM Classifier.NB is a probabilistic classification model based on the Bayes theorem.Ninety-two papers in our survey used NB to perform the classification.There are lots of variations of NB, such as multinomial NB, Bernoulli NB, and Gaussian NB.For text classification, multinomial NB is widely used.Figure8compares NB with other algorithms.
[61]It does not require much training data.The drawbacks of this algorithm are (a) It is not suited for small datasets[61]; (b) If the features are not independent, NB is not the best choice for classification.

Table 8
presents the top 5 best NB accuracy levels from our survey.

Table 8 .
Top Performances of the NB Classifier.

Table 9 displays
the top 5 kNN accuracy levels in our survey.

Table 9 .
Top Performances of the kNN Classifier.

Table 10 .
Top Performances of the DT Classifier.

Table 11 .
Top Performances of the RF Classifier.

Table 12 .
Top Performances of the LR Classifier.