A Comparative Analysis of Machine Learning Techniques for Cyberbullying Detection on Twitter

: The advent of social media, particularly Twitter, raises many issues due to a misunderstanding regarding the concept of freedom of speech. One of these issues is cyberbullying, which is a critical global issue that affects both individual victims and societies. Many attempts have been introduced in the literature to intervene in, prevent, or mitigate cyberbullying; however, because these attempts rely on the victims’ interactions, they are not practical. Therefore, detection of cyberbullying without the involvement of the victims is necessary. In this study, we attempted to explore this issue by compiling a global dataset of 37,373 unique tweets from Twitter. Moreover, seven machine learning classifiers were used, namely, Logistic Regression (LR), Light Gradient Boosting Machine (LGBM), Stochastic Gradient Descent (SGD), Random Forest (RF), AdaBoost (ADB), Naive Bayes (NB), and Support Vector Machine (SVM). Each of these algorithms was evaluated using accuracy, precision, recall, and F1 score as the performance metrics to determine the classifiers’ recognition rates applied to the global dataset. The experimental results show the superiority of LR, which achieved a median accuracy of around 90.57%. Among the classifiers, logistic regression achieved the best F1 score (0.928), SGD achieved the best precision (0.968), and SVM achieved the best recall (1.00).


Introduction
Due to the significant development of Internet 2.0 technology, social media sites such as Twitter and Facebook have become popular and play a significant role in transforming human life [1,2]. In particular, social media networks have incorporated daily activities, such as education, business, entertainment, and e-government, into human life. According to [3], social networking impacts are projected to exceed 3.02 billion active social media users each month globally by 2021. This number will account for approximately one-third of the Earth's population. Moreover, among the numerous existing social networks, Twitter is a critical platform and a vital data source for researchers. Twitter is a popular public microblogging network operating in real-time, in which news often appears before it appears in official sources. Characterized by its short message limit (now 280 characters) and unfiltered feed, Twitter use has rapidly increased, with an average of 500 million tweets posted daily, particularly during events [3]. Currently, social media is an integral element of daily life. Undoubtedly, however, young people's usage of technology, including social media, may expose them to many behavioral and psychological risks. One of these risks is cyberbullying, which is an influential social attack occurring on social media platforms. In addition, cyberbullying has been  We conducted an extensive review of quality papers to determine the machine learning (ML) methods widely used in the detection of cyberbullying in social media (SM) platforms.  We evaluated the classifiers investigated in this work, and test their usability and accuracy on a sizeable generic dataset.  We developed an automated detection model by incorporating feature extraction in the classifiers to enhance the classifiers' efficiency on the sizeable generic dataset.  We compared the performance of seven ML classifiers that are commonly used in the detection of cyberbullying. We also used the Frequency-Inverse Document Frequency (TF-IDF) and Word2Vec models for feature extraction. This comparison analysis helped to understand the limitations and advantages of ML in text classification models.
Accordingly, we formulated and aimed to answer the following research questions in this work:  What types of existing machine learning techniques/methods are being used extensively to detect cyberbullying in social media platforms?  How can an automatic cyberbullying detection model be developed with high accuracy and less processing time?  How can feature extraction be used to enhance the detection process?
The proposed approach detects cyberbullying by extracting tweets, classifying the tweets using text analysis techniques based on predefined keywords, and then classifying the tweets as offensive or non-offensive. Therefore, the outcomes of the current evaluation will help other researchers to choose a suitable and sufficient classifier for the datasets of global cyberbullying tweets collected from [12,13], because improvements are necessary to further increase the classification accuracy.
This paper is structured as follows. Section 2 is dedicated to the background and related work wherein the examined classifiers will be described. Section 3 provides an overview of the methodology adopted for the proposed research and a description of the dataset utilized for the experiment. Results are discussed in Section 4, and conclusions and future work are provided in Section 5.

Background and Related Work
For several years, the researchers have worked intensively on cyberbully detection to find a way to control or reduce cyberbully in Social Media platforms. Cyber-bullying is troubling, as victims cannot cope with the emotional burden of violent, intimidating, degrading, and hostile messages. To reduce its harmful effects, the cyberbullying phenomenon needs to be studied in terms of detection, prevention, and mitigation.
Presently, there is a range of global initiatives aimed at preventing cyberbully and improving the safety of internet users, including children [14,15]. In the literature, there are many studies to prevent cyberbully in what is called intervention and prevention approaches. Such approaches originate from the psychology and education fields. However, these approaches are globally rare. Besides, cyberbully victims often refuse to speak with a parent [16], teacher [17], or other adults [18]. They spend much time online [19], tend to get anonymous help [20], and post on the Internet a need for information and assistance [21]. However, the effective way of delivering cyberbullying solutions is through the Internet. Web-based approaches can also be used whenever and wherever the patient prefers [22]. For instance, the University of Turku, Finland, has established an anti-cyberbully program called Kiva [9], and Anti-Harassment campaign in France [10], and an anti-cyberbully initiative by the Belgian government [11].
Ideally, these prevention and intervention approaches should: (1) increase awareness of potential cyberbully threats through individualized intensive intervention strategies based on the victims' needs [23][24][25][26]; (2) provide health education and teach emotional self-management skills [27]; (3) increase awareness of victims in both reactive measures (e.g., deleting, blocking and ignoring messages), and preventive measures (e.g., increased awareness and security) [28]; provide practical strategies and resources that allow victims to cope with experienced stress and negative emotions [28]; (4) aim to reduce traditional bullying as well [29] since victims are often involved in both forms of bullying [30][31][32]; and (5) include empathy training, Internet labelling and healthy Internet behavior [33,34]. Thus far, there has been difficulty in preventing cyberbullying. Most parents and teachers rely on the awareness of children on the causes and impacts of cyberbullying. Some parents think that peer-mentoring is an effective way to prevent cyberbullying, particularly in the teenage years, when peers have a more significant impact than the family and school. Therefore, more specific approaches or online resources need to be developed to help the victims [24]. For example, Stauffer et al. [35] provided a prevention caveat stating that bully prevention programs produce a minimal change in student behavior [25].
Similarly, authors in [36] suggest that schools should take the following measures in formulating their cyberbullying prevention program: (1) Define cyberbullying; (2) Have strong policies in place; (3) Train staff, students, and parents on policy identify cyberbullying when they see it; and (4) Use internet filtering technologies to ensure compliance. Past research has indicated that social reinforcement may be a dominant protective factor in mitigating the adverse effects of cyberbullying [37,38]. To get the required reinforcement to minimize the related adverse effects of cyberbullying, they must seek help. However, some reports show that cyberbullying victims are unable to report bullying cases and prefer to be silent [6,39]. Some teenagers rarely seek assistance from their teachers or school advisors [40,41].
Based on the above issues of prevention approaches, the need to detect and filter cyberbullying on social media is highly needed. Thus, this section is dedicated to inspecting cyberbully detection techniques. As per the literature review, there are two main directions in detecting cyberbully: natural Language Processing and Machine Learning, as explained in the following sub-sections.

Natural Language Processing (NLP) in Cyberbullying Detection
One direction in this field is to detect the offensive content using Natural Language Processing (NLP). The most explanatory method for presenting what happens within a Natural Language Processing system is using the "levels of language" approach [42]. These levels are used by people to extract meaning from text or spoken languages. This levelling refers to the reason that language processing relies mainly on formal models or representations of knowledge related to these levels [42,43]. Moreover, language processing applications distinguish themselves from data processing systems by using the knowledge of the language. The analysis of natural language processing has the following levels: Dinakar et al. [44], for example, used a common-sense knowledge base with associated reasoning techniques. Kontostathis et al. [45] recognized cyberbullying content based on Formspring.me data, using query words used in cyberbullying cases. Xu et al. [46] use several natural language processing methods to detect signs of bullying (a new term relating to online references that could be bullying instances themselves or online references relating to off-line bullying cases). They use sentiment analysis features to identify bullying roles and Latent Dirichlet Analysis to identify subjects/themes. The authors in [46] are intended to set the basis for several tasks relating to identifying bullying and providing a call for other researchers to enhance these specific techniques. Therefore, Yin et al. [47]; Reynolds et al. [48]; and Dinakar et al. [44] are the earliest researchers working in NLP cyberbullying detection, who investigated predictive strength n-grams, part-speech information (e.g., first and second pronoun), and sentiment information based on profanity lexicons for this task (with and without TF-IDF weighting). Similar features were also used for detecting events related to cyberbullying and fine-grained categories of text in [49].
To conclude, some of the common word representation techniques used and proven to improve the classification accuracy [50] are Term Frequency (TF) [51], Term Frequency-Inverse Document Frequency (TF-IDF) [52], Global Vectors for Word Representation (GloVe) [53], and Word2Vec [54]. One of the main limitations of NLP is that of contextual expert knowledge. For instance, many dubious claims about the detection of sarcasm, but how one would detect sarcasm in a short post like "Great game!" responded to a defeat. Therefore, it is not about linguistics; it is about possessing knowledge relevant to the conversation.

Machine Learning in Cyberbullying Detection
Machine learning-based cyberbullying keywords are another direction of cyberbullying detection, which has been used widely by several researchers. Moreover, Machine learning (ML) is a branch of artificial intelligence technology that gives systems the capability to learn and develop automatically from experience without being specially programmed, often categorized as supervised, semi-supervised or unsupervised algorithms [55]. Several training instances in supervised algorithms are utilized to build a model that generates the desired prediction (i.e., based on annotated/labeled data). In contrast, unsupervised algorithms are not based on data and are mainly utilized for clustering problems [55,56].
Raisi and Huang [57] proposed a model for identifying offensive comments on social networks through filtering or informing those involved. They have used comments with offensive words from Twitter and Ask.fm to train this model. Other authors [58,59] built communication systems based on smart agents that provide supportive emotional input to victims suffering from cyberbullying. Reynolds [48] suggested a method for detecting cyberbullying in the social network "Formspring," focused on detecting aggressive trends in user messages, by analyzing offensive words; moreover, it uses a rating level of the threat identified. Similarly, J48 decision trees obtained an accuracy of 81.7%.
Authors in [60] describe an online application implementation for school staff and parents in Japan, with a duty to detect inadequate content on non-official secondary websites. The goal is to report cyberbullying cases to federal authorities; they used SVMs in this work and obtained 79.9% accuracy. Rybnicek [61] has proposed a Facebook framework to protect underage users from cyberbullying and sex-teasing. The system seeks to evaluate the content of photographs and videos and the user's actions to monitor behavioral changes. A list of offensive words was made in [62] using 3915 posted messages monitored from the Formspring.me web site. The accuracy obtained in this study was only 58.5% [62].
Another study [47] suggests a method for identifying and classifying cyberbullying acts as harassment, flaming, terrorism, and racism. The author uses a fuzzy classification rule; therefore, the results are inferior in terms of accuracy (around 40%), but using a set of rules, improved the classifier efficiency by up to 90%.
In [63], authors have developed a cyberbullying detection model based on Sentiment analysis in Hindi-English code-mixed language. The authors carried out their experiments based on Instagram and YouTube platforms. The authors use a hybrid model based on top performers of eight baseline classifiers, which perform better with an accuracy of 80.26% and an f1-score of 82.96%.
Galán-García et al. [64] suggested applying a real case of cyberbullying detection in Twitter using supervised machine learning. The study uses two different feature extraction techniques with various machine learning algorithms, and Sequential Minimal Optimization (SMO) classifier obtained (68.47%), the highest accuracy among the rest. In [65], the authors have proposed a cyberbullying detection approach based on Instagram's social network. The experiments were carried out based on image contents analysis and user's comments. The results show that uses multiple features can improve the classification accuracy of linear SVM, where the accuracy of SVM jumped from 0.72 to 0.78 by using image categories as an additional feature. Nahar et al. [66] propose creating a weighted directed graph model for cyberbullying that can be used to calculate each user's predator and victim scores while using a weighted TF-IDF scheme with textual features (secondperson pronouns and foul words) to improve online bullying.
Salminen et al. [67] suggest a hate content detection approach for multiple social media networks. The authors use a total of 197,566 comments from four platforms: YouTube, Reddit, Wikipedia, and Twitter, with 80% of the comments labelled non-hateful, and the remaining 20% was hateful. The experiments were conducted using several machine learning algorithms to test each feature separately to evaluate their accuracy based on features selection. In addition to machine learning classifiers, Dadvar et al. [68] suggested an appropriate strategy combining roles typical to cyberbullying, content-based, and user-based. The results showed better performance with the combined use of all features. Van Hee et al. [69] developed the corpus of Dutch social media messages and annotated the same in different categories of cyberbullying, such as threats and insults. The authors also added the comprehensive details that the participants involved in bullying (victim, cyber predator, and bystander identification). Zhao et al. [70] extended the insult to create bullying features based on word embedding and obtained an f-measure of 0.78 with an SVM classifier. In addition, the novel features were derived from a dictionary of standard terms used by neurotics in social networks. The authors in [71] have used the Word2Vec embedding model-based neural network, which was utilized to represent textual health data with a semantic context. Moreover, unique domain ontologies are incorporated into the Word2Vec model. These ontologies provide additional details on a neural network model that recognizes the semantic sense of uncommon words. New semantic information utilizing the Bi-LSTM model is employed to precisely distinguish unstructured and structured health data. A different wok is used the decision tree C4.5 classifier based on TF-IDF weighting method to detect and classify hoax news on Twitter. N-gram is also utilized to extract features to the suggested C4.5 classifiers [72]. In [73], authors have suggested a novel model that incorporates the most relevant documents, reviews, and tweets from social media and news articles. In addition, they integrated a topic2vec with Word2Vec and created a word embedding model representing each word in a document with a semantic meaning and a low-dimensional vector. The authors also used ML to classify the data using the models as mentioned earlier. Table 1 summarizes and shows the comparison results of the related studies.
As cyberbullying is considered a classification issue (i.e., categorizing an instance as offensive or non-offensive), several supervised learning algorithms have been employed in this study for the further evolution of their classification accuracy and performance in detecting cyberbullying in SM, in particular on Twitter. The classifiers adopted in the current study are as follows:

Logistic Regression
Logistic regression is one of the well-known techniques introduced from the field of statistics by machine learning [74]. Logistic regression is an algorithm that constructs a separate hyper-plane between two datasets utilizing the logistic function [75]. The logistic regression algorithm takes features (inputs) and produces a forecast according to the probability of a class suitable for the input. For instance, if the likelihood is ≥ 0.5, the instance classification will be a positive class; otherwise, the prediction will be for the other class (negative class) [76], as given in Equation (1). In [77][78][79][80][81], logistic regression was used in the implementation of predictive cyberbullying models.
if hθ (x) ≥ 0.5, y = 1 (Positive class) and if hθ (x) ≤ 0.5, y = 0 (Negative class) As stated in [82], LR works well for the binary classification problem and functions better as data size increases. LR iteratively updates the set of parameters and attempts to minimize the error function [82].

Logistic Light Gradient Boosting Machine
LightGBM is one of the powerful boosting algorithms in machine learning, and it is known as a gradient boosting framework that uses a tree-based learning algorithm [83]. However, it performs better compared to XGBoost and CatBoost [84]. Gradient-based One-side Sampling (GOSS) is used in LightGBM to classify the observations used to compute the separation. The LightGBM has the primary advantage of modifying the training algorithm, which significantly increases the process [85], and leads in many cases to a more efficient model [85,86]. LightGBM has been used in many classification fields, such as online behavior detection [87] and anomalies detection in big accounting data [88].
However, LightGBM was not commonly used in the area of cyberbullying detection. Thus, in this study, we attempt to explore LightGBM in cyberbullying detection to evaluate its classification accuracy.

Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is an optimization algorithm used to find parameter values (coefficients) of a function (f), which minimizes cost (cost) function [89]. SGD, in contrast, performs a parameter update for each training example x (i) and label y (i) , as given in Equation (2).
Therefore, SGD was used in building cyberbullying prediction models in social networking platforms in [90][91][92]. The authors in [82] claim that SGD performs faster than NB and LR, but the error is not minimum as in LR.

Random Forest
Random Forest (RF) classifier is an ensemble algorithm [93] that matches multiple decision-tab classifiers on different data sub-samples, using average data to enhance predictive accuracy and control of fitting [94]. Ensemble algorithms combine more than one algorithm of the same or different kinds for classifying data [95][96][97][98][99]. RF was commonly used in the literature for the development of cyberbullying prediction models; examples are the studies conducted by [97][98][99]. Consequently, RF consists of several trees used randomly to pick the variables for the classifier data. In the following four simplified steps, the construction of the RF takes place. In the training data, N is the number of examples (cases) and M the number of attributes in the classifier.


In the training data, N is the number of examples (cases), and M is the number of attributes in the classifier.  Selecting random attributes produces a set of arbitrary decision tresses. For each tree, a training set is selected by selecting n times out of all existing N instances. The remaining instances in the training set are used by predicting their classes to estimate the tree's error.  M random variables are chosen for the nodes of each tree to base the decision at that node. In the training package, the most exceptional split is determined using specific m attributes. Each tree is built entirely and not pruned, as can be done in the development of a regular tree classifier.  This architecture produces a large number of trees. For the most common class, those decision trees vote. Such processes are denominated RFs. RF builds a model consisting of a group of treestructured classifiers, where each tree votes for the most popular class [93]. The one selected as the output is the most highly voted class.

AdaBoost
Adaptive boosting (AdaBoost) is an ensemble learning method, and it is a prevalent boosting technique that was initially developed to make binary classifiers more efficacious [100,101]. It uses an iterative approach to learn from weak classifiers' errors, and transform them into strong ones. Therefore, each training observation is initially assigned equal weights. It uses several weak models and attributes higher weights to experimental misclassification observations. As the results of the definitive boundaries obtained during several iterations are combined using several low models, the accuracy of the erroneously classified observations is improved. Thus, the accuracy of the overall iteration is enhanced [102]. An example of AdaBoost classifier implantation is shown in Figure 1, where it showed a similar dataset that has two features and two classes in which week learner #2 improve by mistake made by weak leaner #1 and the accuracy of the misclassified observations is further improved when the two-week classifier are combined (strong leaner). Moreover, AdaBoost has been used in cyberbullying detection by some researchers like [103] and [63], as well as, the work in [104] who used it for cyberbullying detection, where they obtained an accuracy of 76.39% with AdaBoost, utilizing unigrams, comments, profile, and media information as features.

Multinomial Naive Bayes
Multinomial Naive Bayes (Multinomial NB) is widely used for document/text classification problems. However, in the cyberbullying detection field, NB was the most commonly used to implement cyberbullying prediction models, such as in [78] and [64,105,106]. NB classifiers were developed by applying the theorem of Bayes among features. This model assumes that a parametric model produces the text and makes use of training data to determine Bayes-optimal parameter estimates of the model. With those approximations, it categorizes produced test data [107]. NB classifiers can accommodate an arbitrary number of separate continuous or categorical functions. Assuming the functions are distinct, a task for estimating high-dimensional density is reduced to estimating one-dimensional kernel density. The NB algorithm is a learning algorithm based on the Bayes theorem's use with strong (naive) assumptions of independence. Therefore, in [108], NB was discussed in detail.

Support Vector Machine Classifier
Support Vector Machine (SVM) is a supervised machine learning classifier widely utilized in text classification [61]. SVM turns the original feature space into a user-defined kernel-based higherdimensional space and then seeks support vectors for optimizing the distance (margin) between two categories. SVM originally approximates a hyperplane separating the two categories. SVM accordingly selects samples from both categories, which are nearest to the hyperplane, referred to as support vectors [109].
SVM seeks to efficiently distinguish the two categories (e.g., positive and negative). If the dataset is separable by nonlinear boundaries, specific kernels are implemented in the SVM to turn the function space appropriately. Soft margin is utilized to prevent overfitting by giving less weighting to classification errors along the decision boundaries for a dataset that is not easily separable [101]. In this research, we utilize SVM with a linear kernel for the basis function. Figure 2 shows the SVM classifier implementation for a dataset with two features and two categories where all samples for the training are depicted as circles or stars. Support vectors (referred to as stars) are for each of the two categories from the training samples, meaning that they are nearest to the hyperplane among the other training samples. Two results of the training were misclassified because they were on the wrong side of the hyperplane.
Therefore, SVM was used to construct cyberbullying prediction models in [104] and found to be effective and efficient. However, the work in [61] reported that the accuracy decreased when the data size increased, suggesting that SVM may not be ideal for dealing with frequent language ambiguities typical of cyberbullying.

Materials and Methods
This section describes the dataset used for cyberbullying detection on Twitter, its visualization and the proposed methodology for conducting sentiment analysis on the dataset selected, as well as discussing the evaluation metrics of each classifier used.

Dataset
Detecting cyberbullying in social media through cyberbullying keywords and using machine learning for detection are theoretical and practical challenges. From a practical perspective, the researchers are still attempting to detect and classify the offensive contents based on the learning model. However, the classification accuracy and the implementation of the right model remain a critical challenge to construct an effective and efficient cyberbullying detection model. In this study, we used a global dataset of 37,373 tweets to evaluate seven classifiers that are commonly used in cyberbully content detection. Therefore, our dataset is taken from two sources [8,45]; and has been divided into two parts. The first part contains 70% of the tweets used for training purposes, and the other part contains 30% used for predications purpose. The evolution of each classifier will be conducted based on the performance metrics, as discussed in Section 4. Figure 3 illustrates the proposed model of cyberbullying detection, where it has four phases: the preprocessing phase, the feature extraction phase, classification phase, and evaluation phase. Each phase has been discussed in detail in this section.

Pre-Processing
The preprocessing step is essential in cyberbullying detection. It consists of both cleaning of texts (e.g., removal of stop words and punctuation marks), as well as spam content removal [112]. In the proposed model, it has been applied to remove and clean unwanted noise in text detection. For example, stop words, special characters, and repeated words were removed. Then, the stemming for the remaining words to their original roots has been applied as a result of this preprocessing, and the dataset containing clean tweets is produced for the proposed model to be run and predicted.

Feature Extraction
Feature extraction is a critical step for text classification in cyberbullying. In the proposed model, we have used TF-IDF and Word2Vec techniques for feature extraction. TF-IDF is a combination of TF and IDF (term frequency-inverse document frequency), and this algorithm is based on word statistics for text feature extraction. This model considers only the expressions of words that are the same in all texts [72]. Therefore, TF-IDF is one of the most commonly used feature extraction techniques in text detection [16]. Word2Vec is a two-layer neural net that "vectorizes" words to process text. Its input is a corpus of text, and its output is a set of vectors: attribute vectors representing words in that structure [49]. The Word2Vec method uses two hidden layers of shallow neural networks, continuous bag-of-words (CBOW), and the Skip-gram model to construct a high-dimensional vector for each word [15]. The Skip-gram model is based on a corpus of terms w and meaning c. The aim is to increase the likelihood of: where T refers to text, and is a parameter of p (c |w; θ). Figure 4 illustrates the Word2Vec model architecture, where CBOW model attempts to find a word based on previous terms, while Skip-gram attempts to find terms that could fall in the vicinity of each word. technique implements both training models. The basic idea behind the two training models is that either a word is utilized to predict the context of it or the other way around-to use the context to predict a current word.
Utilizing TF-IDF is weighted by its relative frequency instead of merely counting the words, which would overemphasize frequent words. The TF-IDF features notify the model if a word appears more often in a statement than the entire text corpus does typically. Prior work has found TF-IDF features useful for cyberbullying detection in SM [113]. As with BOW, the TF-IDF vocabulary is constructed during model training and then reused for test prediction. Both BOW and TF-IDF are considered to be simple, proven methods for classifying text [114]. In Equation (4), the mathematical representation by TF-IDF of the weight of a term in a document is given.
In this case, N is the number of documents and d f(t) is the number of documents in the corpus containing the word t. In Equation (4), the first term enhances the recall, while the second term enhances the word embedding accuracy [52].

Classification Techniques
In this study, various classifiers have been used to classify whether the tweet is cyberbullying or non-cyberbullying. The classifier models constructed are LR, Light LGBM, SGD, RF, AdaBoost, naïve Bayes, and SVM. These classifiers have been discussed in Section 2, and the evaluation of their performance is carried out in Section 4.

Results and Discussion
This section presents the results of the experiments and discusses their significance. First, each classifier's performance results have been listed and discussed in Table 2, where it shows the evaluations of each classifier in terms of precision, recall, and F1 score, respectively. Secondly, the training time complexity of each algorithm is illustrated in Table 3. These will be discussed in detail in the following sections.

Evaluation Metrics
The effectiveness of a proposed model was examined in this study by utilizing serval evaluation measures to evaluate how successfully the model can differentiate cyberbullying from noncyberbullying. In this study, seven machine learning algorithms have been constructed, namely, LR, Light LGBM, SGD, RF, AdaBoost, Naive Bayes, and SVM. It is essential to review standard assessment metrics in the research community to understand the performance of conflicting models. The most widely used criteria for evaluating SM platforms (e.g., Twitter) with cyberbullying classifiers are as follows: Accuracy Accuracy calculates the ratio of the actual detected cases to the overall cases, and it has been utilized to evaluate models of cyberbullying predictions in [79] and [60,65]. Therefore, it can be calculated as follows: where tp means true positive, tn is a true negative, fp denotes false positive, and fn is a false negative.


Precision calculates the proportion of relevant tweets among true positive (tp) and false positive (fp) tweets belonging to a specific group.  Recall calculates the ratio of retrieved relevant tweets over the total number of relevant tweets.  F-Measure provides a way to combine precision and recall into a single measure that captures both properties.
The three evaluation measures listed above have been utilized to evaluate cyberbullying prediction models in [67,79,98,104]. They are calculated as follows:

Performance Result of Classifiers
The proposed model utilizes the selected seven ML classifiers with two different feature extraction techniques. These techniques were set empirically to achieve higher accuracy. For instance, LR achieved the best accuracy and F1 score in our dataset, where the classification accuracy and F1 score are 90.57% and 0.9280, respectively. Meanwhile, there is a slight difference between LR, SGD, and LGBM classifier performance, where SGD achieved an accuracy of 90.6%, but the F1 score was lower than LR. However, the LGBM classifier achieved an accuracy of 90.55%, and the F1 score was 0.9271. This means LR performs better than other classifiers, as shown in Table 2.
Moreover, RF and AdaBoost have achieved almost the same accuracy, but in terms of F1 Score, RF performs better than AdaBoost. Multinomial NB has achieved low accuracy and precision with a detection rate of 81.39% and 0.7952, respectively, and we can notice that the excellent recall levelsout the low precision, giving a good F-measure score of 0.8754 as illustrated in Table 2.
Finally, SVM has achieved the lowest accuracy and precision in our dataset, as shown in Figure  5. Nevertheless, it achieved the best recall compared to the rest of the classifiers implemented in the current research. Furthermore, some studies have looked at the automatic cyberbullying detection incidents; for example, an effect analysis based on lexicon and SVM was found to be effective in detecting cyberbullying. However, the accuracy decreased when data size increased, suggesting that SVM may not be ideal for dealing with common language ambiguities typical of cyberbullying [61]. This proves that the low accuracy achieved by SVM is due to the large dataset used in this research.  F-measure is one of the most effective evaluation metrics. In this research, the seven classifiers' performances were computed using the F-measure metric, as shown in Figure 6. Furthermore, the performances of all ML classifiers are enhanced by producing additional data utilizing data synthesizing techniques. Multinomial NB assumes that every function is independent, but this is not true in real situations [115]. Therefore, it does not outperform LR in our research as well. As stated in [116], LR performs well for the binary classification problem and works better as data size increases. LR updates several parameters iteratively and tries to eliminate the error. Simultaneously, SGD uses a single sample and uses a similar approximation to update the parameters. Therefore, SGD performs almost as LR, but the error is not as reduced as in LR [92]. Consequently, it is not surprising that LR also outperforms the other classifiers in our study.  Table 3 shows the time complexity of the best and the worst algorithms in terms of training and prediction time. The results in Table 3 indicate that Multinomial NB has achieved the best training time, and RF has obtained the worst training time, 0.014s and 2.5287s, respectively. Meanwhile, LR outperforms all the classifiers implemented in this research. However, there were slight differences between SGD and Multinomial NB compared to LR, as shown in Table 3.

Conclusions
Cyberbullying has become a severe problem in modern societies. This paper proposed a cyberbully detection model whereby several classifiers based on TF-IDF and Word2Vec feature extraction have been used. Furthermore, various methods of text classification based on machine learning were investigated. The experiments were conducted on a global Twitter dataset. The experimental results indicate that LR achieved the best accuracy and F1 score in our dataset, where the classification accuracy and F1 score are 90.57% and 0.9280, respectively.
Meanwhile, there is a slight difference between LR, SGD, and LGBM classifier performance, where SGD achieved an accuracy of 90.6%, but the F1 score was lower than LR. However, the LGBM classifier achieved an accuracy of 90.55%, and the F1 score was 0.9271. This means that LR performs better than other classifiers. Moreover, during the experiments, it was observed that LR performs better as data size increases and obtains the best prediction time compared to other classifiers used in this study. Therefore, SGD performs almost as LR, but the error is not minimal as in LR.
The feature extraction is a critical aspect in machine learning to enhance the detection accuracy. In this paper, we did not investigate many feature extraction techniques. Thus, one of the improvements is to incorporate and test different feature extractions to improve the detection rate of both classifiers LR and SGD. Another limitation that we are working on is building a real-time cyberbully detection platform, which will be useful to instantly detect and prevent the cyberbully. Another research direction is working on cyberbully detection in various languages, mainly in an Arabic context. Funding: This work was supported by Prince Sultan University, Riyadh, Saudi Arabia.

Conflicts of Interest:
The authors declare no conflict of interest.