Cyberbullying Detection: Hybrid Models Based on Machine Learning and Natural Language Processing Techniques

: The rise in web and social media interactions has resulted in the efortless proliferation of offensive language and hate speech. Such online harassment, insults, and attacks are commonly termed cyberbullying. The sheer volume of user-generated content has made it challenging to identify such illicit content. Machine learning has wide applications in text classiﬁcation, and researchers are shifting towards using deep neural networks in detecting cyberbullying due to the several advantages they have over traditional machine learning algorithms. This paper proposes a novel neural network framework with parameter optimization and an algorithmic comparative study of eleven classiﬁcation methods: four traditional machine learning and seven shallow neural networks on two real world cyberbullying datasets. In addition, this paper also examines the effect of feature extraction and word-embedding-techniques-based natural language processing on algorithmic performance. Key observations from this study show that bidirectional neural networks and attention models provide high classiﬁcation results. Logistic Regression was observed to be the best among the traditional machine learning classiﬁers used. Term Frequency-Inverse Document Frequency (TF-IDF) demonstrates consistently high accuracies with traditional machine learning techniques. Global Vectors (GloVe) perform better with neural network models. Bi-GRU and Bi-LSTM worked best amongst the neural networks used. The extensive experiments performed on the two datasets establish the importance of this work by comparing eleven classiﬁcation methods and seven feature extraction techniques. Our proposed shallow neural networks outperform existing state-of-the-art approaches for cyberbullying detection, with accuracy and F1-scores as high as ~95% and ~98%, respectively.


Introduction
Social media is an interactive tool that brings people together to share information. The primary function of Online Social Networks (OSNs) is to allow people to communicate virtually by using the internet. However, such technologies have also resulted in several additional social issues, one of them being 'cyberbullying.' Although bullying has existed in society before these technologies, the perceived protection of the online interfaces has resulted in increased cyberbullying. Cyberbullying is commonly defined as an intentionally violent or aggressive behaviour using electronic media carried out by an individual or a group targeting a victim online [1]. This action involves repeated online insulting, harassing, or attacking a target verbally [2]. Malicious social media users use sexist remarks, offensive language, hate speech, toxic comments, and abusive language to target victims. Such content torments social media users, adversely affecting their mental health, techniques. The results are compared by using four evaluation metrics, namely accuracy, precision, recall, and F1-score. The models employed in this study outperform several state-of-the-art cyberbullying detection mechanisms. Section 4.3 of this paper, Baseline Comparison, discusses the performance of models found in the literature along with the results derived from this study.
The contributions of this paper are as follows: 1.
We propose a novel architecture for cyberbullying detection that employs a bidirectional GRU by using GloVe for text representation. The proposed mechanism outperforms the existing baselines that employed Logistic Regression, CNN, and Bidirectional Encoder Representations from Transformers (BERT).

2.
We also propose a novel CNN-BiLSTM framework for the task which yields results comparable to the existing baselines. 3.
We provide a comparative study on the classification performance of four traditional machine learning and seven neural-network-based algorithms. 4.
We experiment with several feature extraction techniques and determine best-suited approaches for feature extraction and text embedding for both traditional machine learning and neural-network-based methods.

5.
We establish the efficacy of shallow neural networks for cyberbullying classification, thus moderating the need of complexly structures deep neural networks.
The rest of the paper is organized as follows: Section 2 explains the existing literature and mechanisms developed for efficient cyberbullying detection. Section 3 describes the methodology of the proposed work, the mathematical background of algorithms used, and the novel framework to accommodate all neural network algorithms. Section 4 lists the experimental results obtained and compares their performance; finally, Section 5 concludes the contributions and future prospects.

Related Work
Cyberbullying is classified as generalized abuse, largely towards appearance, interests, intelligence, or previous posts of the recipients. Hate speech is differentiated from cyberbullying in being defined as abuse directed specifically towards a unique, non-controllable attribute of a group of people, such as race, sexuality, and gender identity. Davidson et al. [5] identified it as a derogatory language intended to humiliate or to insult the members of the targeted group. In certain instances, people use terms not belonging to hate speech but are offensive to specific groups. Warner and Hirschberg [15] suggest that some African Americans use the term 'nigga' in their day-to-day language, and terms such as 'hoe' and 'bitch' are frequently used on social media. Such terms do not fall within the boundaries of hate speech but are offensive to specific societal sections; hence, they are categorized as offensive speech.
There is a considerable availability of real-world datasets in the present scenario such as the Twitter dataset [10], the Chinese Sina Weibo dataset [13], and the Kaggle dataset [16] for cyberbullying, hate speech, and offensive language detection. These labeled datasets allow the use of machine learning and deep learning algorithms by aiding supervised and semi-supervised training. Ample availability of annotated training data aids in building efficient supervised frameworks. However, the task of online cyberbullying detection holds certain limitations that have to be addressed. The drawbacks lie in the low availability of positively labeled cyberbullying posts because the datasets available are highly imbalanced. Wulczyn et al. [17] crowdsourced and aggregated a vast corpus of annotated Wikipedia articles with over 100K items extracted from talk pages. Another annotated dataset was presented by Warner and Hirschberg [15], collecting commonly used hate-speech terms from Twitter. They identified that hate speech targets specific groups of a particular ethnicity, race, caste, or creed and found that a correlation exists between hate speech and stereotypical words.
Classification frameworks of cyberbullying posts primarily utilize Natural Language Processing (NLP) methods [18]. Term Frequency-Inverse Document Frequency (TF-IDF) is an established method for extracting textual features from the data [19]. It measures a word's importance in a given document by using its frequency and inverse frequency count. Several NLP techniques such as TF-IDF, Vector Space Model (VSM), Linear Discriminant Analysis (LDA), and Latent Semantic Analysis (LSA) have been designed for such feature extraction [20][21][22]. A Naïve Bayes approach was implemented by Kwok and Wang [23] on a Twitter dataset comprising racist and non-racist comments, which demonstrated average classification performance when using the unigram bag-of-words model for feature extraction. A combined approach utilizing linguistics, n-grams, syntactic, and distributed syntactic features was designed by Nobata et al. [7] for detecting online hate speech. Yin et al. [24] proposed a supervised approach with TF-IDF for cyberbullying detection that uses content, context, and sentiment as textual features. Warner and Hirschberg [15] performed hate speech classification by using Support Vector Machines (SVM) linked with word sense disambiguation and using a lexicon of stereotypical words as features. A systematic review of existing research in this domain is compiled by Tokunaga [25], discussing the cyberbullying typologies, detection frameworks, and potential directions.
With the advent of deep learning algorithms for cyberbullying detection, Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) approaches have primarily been employed. Themeli et al. [26] performed hate speech detection by employing traditional machine learning and deep neural network models. Their experimental results demonstrate higher performance of Bag of Words (BoW) over GloVe and N-gram graphs when combined with Logistic regression and a three-layered neural network. Bu and Cho proposed a novel ensemble framework that uses two deep learning models for knowledge transfer: a CNN for capturing character-level syntactic features of the text and a Long-term Recurrent Convolutional Network (LRCN) for extracting semantic features [2]. Agrawal and Awekar [27] experimented on deep learning models by domain transferring the knowledge by using CNN, LSTM, Bi-LSTM, and Bi-LSTM with attention using random, GloVe, and Sentiment-Specific Word Embedding (SSWE). Aroyehun and Gelbukh [28] established the efficacy of deep neural networks by using seven different combinations of CNN, LSTM, and Bi-LSTM models and compared the results with traditional machine learning algorithms for aggression detection, incorporating data augmentation, and pseudo-labeling for the same. Mishra et al. [3] proposed a novel method of Twitter user profiling for cyberbullying detection by using authors' community-based data in addition to textual information. Rawat et al. [9] also relied on user information for abusive content detection by employing web scraping and exploratory data analysis to analyze the characteristics of users involved in spreading hate speech by combining traditional machine learning algorithms, sentiment analysis, and topic modeling for malicious user detection. The offensive tweet detection model by Aglionby et al. [29] proposes a multi-layer RNN and Gradient Boost Decision Tree (GBDT) classifier framework with a self-attention mechanism that enhances text classification. Chen et al. [30] analyzed embedding methods for words and sequences experimenting with word-level and sentence-level embedding techniques. Chu et al. also explored deep learning models with word embeddings for abuse detection [31] by developing an RNN with LSTM and two CNNs with word and character-level embeddings. Anand and Eswari [32] developed an LSTM and a CNN network and analyzed their effect in the presence and absence of GloVe embeddings for cyberbullying detection. Badjatiya et al. [11] explored the performance of CNN and LSTM with various text embedding models observing GDBT combined with LSTM as their best performing model. Pavlopoulos et al. [33] established that their proposed RNN with attention mechanism outperforms Logistic Regression, a Multi-Layer Perceptron (MLP), and a vanilla-CNN model for cyberbullying detection. Banerjee et al. [34] proposed a simple convolutional network with GloVe embeddings performing better than RNN GloVe and several traditional machine learning techniques. A Bi-LSTM network with an attention mechanism was proposed by Agarwal et al. [35] in order to classify cyberbullying posts using under sampling and class weighting for avoiding class imbalance in the dataset.

Methodology
Online cyberbullying detection frameworks primarily rely on traditional machine learning algorithms. However, these traditional machine learning algorithms pose a disadvantage due to the inability to yield high accuracies on vast volumes of data for supervised classification. Existing studies are advancing towards utilizing neural networks that overcome this limitation and provide better results and robust mechanisms. This section covers various popular traditional machine learning and shallow neural network approaches. We discuss the proposed methodology of our classification frameworks and the architectures of all the proposed networks.

Preprocessing and Feature Extraction
Algorithms for text classification cannot process raw data due to their inability to understand high-level human language directly. The text undergoes conversion into vector notation in order to be processed by classification algorithms. Prior to this step, raw textual data undergoes several preprocessing steps, often referred to as data cleaning. Figure 1 illustrates the workflow of the proposed methodology. Input data are preprocessed by removing empty rows, punctuation, special characters, numerical values, stopwords, lowercasing text, tokenization, and stemming. In order to create vector notations of the input text, we experiment with several methods. For traditional machine learning models, we use four methods: Count Vectorization, TF-IDF word unigram, TF-IDF word bigram and trigram, and TF-IDF character bigram and trigram. For the proposed shallow neural networks, we use GloVe, FastText, and Paragram as the embedding representations. We employ a stratified 5-fold cross validation technique that splits the dataset into five sets of training and testing, preserving the class distribution of the original dataset in each split. This technique is employed to obtain statistically grounded results by averaging results from runs on individual splits of the original dataset. Count Vectorization: This is a simple statistical method for generating embedded vectors of input text [36]. We use the frequency of the occurrence of a term in a document in order to generate its embedding vector. A matrix is created for the entire document set where rows contain each document and columns represent each word. The cells contain the values of occurrence frequency of a term in a document. We use this matrix as the feature representation for training the traditional machine learning algorithms.
TF-IDF: Term Frequency-Inverse Document Frequency (TF-IDF) [37] is a statistical approach that uses the occurrences of words as a measure for extracting textual features. A word's importance is directly proportional to its frequency in the document and inversely proportional to its frequency in the entire document set. For a term w i in a document x j , where its occurrence is n i,j in x j , we calculate the term-frequency, TF i,j , given by Equation (1).
Here, ∑ k n k,j denotes the sum of occurrences of a term w i in the entire document set. Next, we compute the inverse document frequency (IDF) by taking the logarithm of the total number of documents divided by the number of documents with the term w i as in Equation (2).
Here, |D| denotes the total number of documents, and j : w i ∈ x j is the number of documents with the term w i . Once we obtain the individual TF and IDF values, we compute the required TF-IDF of the term w i , given by Equation (3).
GloVe: Global Vectors (GloVe) [38] for word representations is an unsupervised technique for deriving word embeddings from text input. We utilize an A × A term-based co-occurrence matrix to obtain representations. The cooccurrence matrix is used to examine the semantic relationship between terms. For instance, high cosine similarity is demonstrated between words such as 'queen' and 'king' or 'mother' and 'woman.' The technique learns from a large Wikipedia and Gigaword corpus in an unsupervised fashion. For a word i with its vector representation w i , the objective function is denoted by Equation (4): where i and k are words with a similar context, and P ik is the probability of them occurring together. We use these co-occurrence probabilities as features by capturing both statistics and the context of the words. FastText: Introduced by Facebook's AI Research Lab (FAIR), FastText [39,40] is a skip-gram-based model [41] for enhanced word representations. The effectiveness of this technique lies in its consideration of the morphology of words in a language. Other embedding techniques denote each word as a distinct vector. In contrast, FastText is specially designed to handle words of the same root using character n-grams. Naturally, it contains the sub word information for every word by dividing each word into a bag of its n-gram combinations. For example, for a word 'language' with n as 3, the bag of character n-grams will contain 'la,' 'lan,' 'ang,' 'ngu,' 'gua,' 'uag,' 'age,' and 'ge'. FastText enables understanding the context of unknown words by breaking them into smaller forms and matching the similarities with those within its training corpus.
Paragram: Paragram [42] is another word representation technique designed to capture better contextual similarities. It uses the ParaPhrase DataBase (PPDB) and performs finetuning [43], counter-fitting [44], and attract-repel [45] in order to inject synonym and antonym features as a vectorization constraint. The technique is comparatively robust due to better contextual understanding.

Traditional Machine Learning Approaches
We employed four popular machine learning approaches for cyberbullying detection: XGBoost, Naïve Bayes, SVM, and Logistic Regression. After the preprocessing and feature extraction phases, the vectorized text was input to these classifiers in order to evaluate their performance.
XGBoost: Extreme Gradient Boosting (XGBoost) [46] is a Gradient Boosting Decision Tree (GBDT) enhancement. The algorithm employs multiple decision trees with low accuracy (weak learners) and combines them to provide higher accuracy. The trees are built in stages, and the residuals or errors of prior models from previous stages inform the next stage of trees by using gradient descent. That is standard gradient boosting. Building on top of gradient boosting principle, XGBoost incorporates several other optimizations such as parallelized implementation, tree pruning, hardware optimization, regularization, sparsity awareness, cross validation, and weighted quantile sketch to render its performance more efficient and effective.
The algorithm advances in the direction of the tree, which minimizes the objective function. For a document D = {(x i , y i )||D| = n, x i ∈ R m , y i ∈ R} with n samples and m eigenvalues, where x i denotes a sample and y i denotes its category, the predictions are calculated by using Equation (5) where f n (x i ) is the error value between true and predicted classes. The solution of the above objective function is calculated by using maximum likelihood estimation as discussed by Chen and Guestrin [47]. Naïve Bayes: Naïve Bayes (NB) algorithm [48] is used for probabilistic classification. It is widely used for various practical applications due to its efficiency in reducing computational costs. It is a scalable algorithm applicable to large-sized datasets, also resulting in high classification accuracies. Its principally assumes that a feature in a category is independent of its presence in another category. The probability of a document D pertaining to a class C is given by Equation (6).
In order to predict that a data point x with features a i . . . a d belongs to a particular category, the prediction θ(x ) is given by Equation (7).
SVM: Support Vector Machine (SVM) [49] is a supervised algorithm that uses the separation margin between data points of classes as a classification criterion. The original mdimensional feature space is reduced to a user-defined dimensional space. Support vectors are then determined to optimize the margin distance among data points of different categories. The algorithm automatically determines these support vectors found nearest to the separating margins (hyperplanes). Equation (8) defines a linear SVM optimization equation: where α i denotes the term weight, and C represents the model's error and relative importance. In order to predict that a data point x belongs to a particular category, the prediction θ(x ) is given by Equation (10): where, w * = ∑ l i=1 α i y i x i . Logistic Regression: Logistic Regression (LR) [50] is another statistical algorithm that works on predicting probabilities rather than classes. The logistic function is used to form a hyperplane in order to classify data points in the given classes. Textual features are input to the algorithm employed to generate forecasts about a data point belonging to a particular class. The function is given by Equation (11) where the positive class is determined by h θ (x) ≥ 0.5, y = 1, and the negative class is determined by h θ (x) ≤ 0.5, y = 0.
Here, θ is the parameter (weight) on an input variable x parameterizing the space of linear functions mapping from X to Y.

Neural Network Approaches
The elimination of manual feature extraction has made neural networks extremely popular in the research community. Neurons within the network are responsible for automatically extracting essential features that help to differentiate content belonging to different classes. The need of neural networks arises due to large dataset sizes that most of the traditional machine learning algorithms fail to accommodate. Additionally, neural networks offer robustness and higher classification results. We compare the following architectures of popular neural networks for cyberbullying classification: CNN, LSTM, Bi-LSTM, GRU, Bi-GRU, CNN-BiLSTM, and Attention-BiLSTM. To execute our approach, we design a novel framework accommodating each of these models, depicted by Figures 2-5. The methodology and architectures of these models are discussed below.    CNN: After demonstrating extreme efficiency in image classification tasks, convolutional neural networks [51] are widely adopted for text classification. We develop Text-CNN with a single hidden layer and employ it for cyberbullying detection. A convolution opera-tion * on functions f and g is performed by reversing and shifting one of these functions as described by Equation (12).
A representation of the proposed architecture incorporation of a CNN is shown in Figure 2. The model takes in preprocessed 30-dimensional text input and performs the respective embedding by using GloVe, FastText, and Paragram with 300-dimensional word vectors for each of them. The embedded text passes through a one-dimensional convolutional layer with a kernel that moves over the convolutions with a filter size of three and stride of one. The information is then passed through a max-pooling layer that outputs the max value from each resulting data matrix. The ReLU activation function is used after, which the input is fed to a series of fully connected layers. A dense layer of dimension 50 sends the information through a dropout layer of probability 0.25 to a final dense layer. The dimension of the recurrent dense layer is set to two, which is equal to the number of classes. Predictions are generated by using a softmax function that outputs the labels for each item in the dataset.
LSTM: Long Short-Term Memory networks (LSTMs) [52] are a special type of Recurrent Neural Networks (RNNs) that are more advantageous compared to RNNs in terms of information retention. LSTMs overcome the problem of vanishing gradient descent encountered in traditional RNNs [53]. LSTMs are highly preferred for tasks such as text classification and predictive modeling due to their extensive memory capacity. Such a network selectively decides what information is necessary to be transferred to further neurons and which data can be forgotten or omitted. These networks employ backpropagation and a gated mechanism. A basic LSTM network consists of an input (i t ), output (o t ), and a forget gate ( f t ), represented by Equations (13)- (15).
Here, x t denotes an input text, h is used to represent the state of the input where h t is called current state, and h t−1 denotes the previous state. W and b are the weights and bias for each gate, respectively. Here, σ denotes the activation function used, which is ReLU in the case of the proposed model.
Bi-LSTM: Bidirectional LSTM (Bi-LSTM) [54] is a robust mechanism used to enhance backpropagation in LSTM networks. While the information in an LSTM travels unidirectionally, Bi-LSTM allows data to move in both forward and backward directions. A Bi-LSTM processes inputs both reverse and serially. Architecturally, it is simply combining two LSTMs but in opposite directions. This allows the network to remember information from past to future by using the forward layer and future to past layer by using the backward LSTM layer. For a given sequence of inputs x t−1 , x t , x t+1 , . . . , x n , the output from the forward layer → h is calculated, whereas for a reverse sequence, x n , x n−1 , x n−2 , . . . , x t−1 , the output ← h is calculated through the backward layer where h = o t * tanh(C t ) and where C t is a vector produced by the activation function. The output of the Bi-LSTM network is denoted by Equation (16): Y T = y t−1 , y t , . . . , y t+n where y t = σ ( → h , ← h ), and σ is a concatenation operation. GRU: Gated Recurrent Units (GRUs) [55] are also a type of RNN with a gated mechanism designed to deal with the vanishing and exploding gradient problem. These provide more testing accuracies than traditional RNNs because of the ability to remember long-term dependencies. GRUs are a more straightforward and dynamic version of LSTM networks specifically designed for updating or resetting information in their memory cells. The network constitutes an update gate that combines input and a forget gate present in LSTMs. Additionally, there is a reset gate for refreshing the memory contents. These are lightweight and have fewer parameters than LSTMs. For an input vector x t at time t with vector, parameter, and matrices such as b, W and U, respectively, the update gate and reset gate are given by Equations (17) and (18): where h t is defined by Equation (19) with as the Hadamard product.
σ g and ∅ h are the activation function and hyperbolic tangent, respectively. Bi-GRU: A bidirectional GRU [54] is a dual-layered structure similar to a Bi-LSTM with forward and backward neural networks. The idea of this structure is to transfer entire contextual information from the input to the output layer. Similarly to a bidirectional LSTM, in a Bi-GRU, the input information travels through a neural network in the forward direction and a neural network in the backward direction. The outputs from both these forward and backward layers are fused to provide the final output. An architectural representation for classification using LSTM, Bi-LSTM, GRU, and Bi-GRU is displayed in Figure 3. In order to use a specific model at a time, the outputs from the embedding matrix are fed to the supposed neural network model and then sent to the series of fully connected layers. The parameters of fully connected layers stay the same as in the CNN for all models proposed in this work.
CNN-BiLSTM: We propose a combination network constituting a convolutional and a BiLSTM layer, illustrated in Figure 4. The input text, after undergoing embedding, is initially fed to a BiLSTM layer of size 100. Features from this layer further undergo convolution operation by using a one-dimensional convolutional layer with a ReLU activation. Outputs are processed under max-pooling operation and passed on to the series of fully connected layers. This combination allows us to use the retentive power of LSTMs and feature extraction capability of CNNs, thus forming a more robust classifier.
Attention-BiLSTM: Another model that we propose combines a Bi-LSTM network with a hierarchical attention model. As suggested by the name, the attention model [56] pays special attention to words possessing higher importance in the document. In the proposed architecture represented in Figure 5, information processed through the Bi-LSTM network is passed through an attention layer with multiple neurons and then to the fully connected layers. The mechanism encodes only selective valuable information by understanding the context and enhancing the final output. This allows the model to run successfully on sufficiently large input texts. We adopt the self-attention mechanism proposed by Vaswani et al. [56]. The model assigns non-zero weights to all input items. We employ scaled dot product as the similarity function. The attention value for a given query Q is calculated by using key-value pairs as source by obtaining the similarities between each key K and the query. This is mathematically represented by Equation (21). A softmax function is used to normalize the weights in order to calculate the final attention provided by Equation (22), where d k is the key dimension.

Implementation Details
The implementation for all experiments is carried out using Python 3 on Google Colab with a RAM of 13.53 GB. For the preprocessing tasks, we employed RegexTokenizer for word tokenization, Porter Stemmer for stemming, and WordNet Lemmatizer. As discussed above, the embedding dimension is set to 300. The meta-parameters and hyperparameters in the shallow neural networks such as the input dimension, size of recurrent dense layers, dropout, and activation function are chosen upon several experimentation to yield the best possible results. A dense layer of 50 units provided the lowest loss value and highest accuracies in all experiments. The consequent dense layer is of size two depending on the number of classes in the datasets. A dropout value of 0.25 reduced overfitting considerably. The meta-parameters specific to each neural network such as the kernel size, stride in the CNN network, and the sizes of the LSTM, Bi-LSTM, GRU, and Bi-GRU networks are also decided upon trial-and-error experimentation with several values among their individual ranges. We employed the Adam optimizer and binary cross entropy for the models. Training is executed on a batch size of 64 instances for five epochs each.

Experimental Result Analysis
In this section, we describe the datasets used and report the experiment results. The results are evaluated using four metrics: accuracy, precision, recall, and F1-scores. We graphically compare the performance of all algorithms used on two real-world datasets. The baseline comparison with existing literature is provided in Section 4.3.

Datasets
Wikipedia Attack Dataset: This corpus was crowdsourced by Wulczyn et al. [17] in 2017 using Wikipedia articles. The dataset consists of discussion comments in the English language extracted from the 'talk page' of Wikipedia websites. The comments are extracted by accessing the revision history of Wikipedia pages to obtain all interactions, including removed comments. The collected corpus is cleaned by removing HTML content and keeping plain text only. The dataset is strapped out of all bot messages, and only human-made comments were retained. The dataset version that we use consists of 115,864 user comments with 13,590 cyberbullying text and 102,274 non-cyber bullying comments. Figure 6 illustrates the word clouds for attack and non-attack classes for this dataset. Wikipedia Web Toxicity Dataset: Another corpus of comments is proposed by Wulczyn et al. [17] from Wikipedia collected from 'article talk namespace.' It is a binary labeled corpus with 159,689 comments containing 15,365 toxic and 144,324 non-toxic comments. The scraping procedure of the dataset is the same as the Wikipedia Attack dataset using the revision history of article pages. The discussion comments are collected, and administrative and bot comments are removed to constitute the final corpus. Figure 7 illustrates the word clouds for toxic and non-toxic classes for this dataset.

Result Analysis
We evaluate and compare the results of all classifiers used on two datasets. Table 1 illustrates the experimental results of traditional machine learning algorithms by using four feature extraction approaches. The results on XGBoost, Naïve Bayes, SVM, and Logistic Regression are graphically represented in Figures 8 and 9. Table 2 illustrates the results of the proposed shallow neural networks and their performance comparison is shown in Figures 10 and 11. While accuracy is the simplest and most intuitive metric of model performance, it is not suitable for unbalanced datasets. The precision and recall as well as F1-score (which is the harmonic mean of precision and recall) have also been reported, and their performances are also over 90% for the majority of cases.   For the Wikipedia Attack Dataset, we observed that SVM achieves the highest F1score of 98.12% using TF-IDF word unigram, followed by XG Boost providing 97.14% and Logistic Regression providing 97.13% F1-scores with Count Vectorization and TF-IDF character bigram and trigram, respectively. XG Boost and Logistic Regression appear to perform better than Naïve Bayes on this dataset despite Naïve Bayes achieving the highest accuracy of 95.44% when using Count Vectorization. SVM demonstrates a lower performance on this dataset, especially with TFIDF character bigram and trigram.
For the Wikipedia Web Toxicity dataset, the highest F1-score of 98.77% is displayed by SVM with TFIDF word unigram embeddings. SVM with 98.32% F1-score follows it using TFIDF character bigram and trigram. Naïve Bayes using Count Vectorization follows it with 98.14% score. Overall, Logistic Regression demonstrates high results with all types of feature extraction techniques. SVM follows it, which is followed by XG Boost and then Naïve Bayes. However, when compared against the proposed shallow neural networks, traditional machine learning is observed to have lower scores and demonstrated inconsistency.  By observing the performance of shallow neural networks, the majority of results are over 95% considering all evaluation measures. Moreover, these figures are higher than those reported for traditional machine learning models (see Section 4.3, Baseline Comparison). For the Wikipedia Attack dataset, Bi-GRU with GloVe embeddings is the best performing model with 98.56% F1-score and 96.98% accuracy. We observed that GloVe embeddings offered a higher and more consistent rate in better classification than Paragram and FastText. Although the other two embedding methods have also performed well, GloVe is simply the winner. Amongst the proposed shallow neural networks, Bi-LSTM and Bi-GRU models performed better than the rest. For the Wikipedia Web Toxicity Dataset, Bi-LSTM models have demonstrated great performance. The highest goes to 98.69% F1-score with GloVe embeddings followed by 98.65 with Attention-BiLSTM using FastText and then 98.61% with CNN using Paragram. On this dataset, the best performing models are CNN-BiLSTM, Attention-BiLSTM, and BiGRU. The results over 95% are quite similar; thus, we assume that the proposed framework allows all the above models to perform classification with high accuracies. Summarizing the results, all common metrics indicate good performance. Accuracy is well over 90% for the proposed shallow neural networks and just over 80% for all traditional machine learning models across all datasets. The neural network approaches demonstrate better performance than traditional machine learning algorithms. With only a single hidden layer, each neural network architecture provides high performance. The CNN-BiLSTM combination framework is equally capable as other proposed shallow networks. Additionally, the parametric settings of the network, neuron count, size of fully connected dense layers, and dropout probabilities, which have been decided upon experimentation, yield optimum results. The proposed architecture is observed to work well with all the neural networks utilized. We summarize the key observations as follows: 1.
Neural networks demonstrated higher performance than state-of-the-art traditional machine learning algorithms due to their robustness and capability to handle large datasets.

2.
Count Vectorization, although being an old statistical technique, manages to consistently provide good results.

3.
Across all preprocessing steps, Logistic Regression displayed the highest average performance amongst all machine learning techniques used, followed by SVM, XG Boost, and Naïve Bayes in the said order.

4.
GloVe embeddings resulted in a maximum number of high outputs than FastText and Paragram, although similar results were achieved by the other two methods in a similar fashion. 5.
F1-measures convey high performance through all neural network models. By observing the accuracy scores, we conclude that RNN networks such as GRU, Bi-GRU, and Bi-LSTM offered highest performance. Attention mechanism is also close to achieving results similar to these.

Baseline Comparison
In order to validate the efficacy of our work, we performed a baseline comparison with recent state-of-the-art techniques for cyberbully detection. The comparison of results on two datasets is provided in Table 3. The techniques used for comparison have been re-implemented in accordance with the environment settings mentioned in the existing studies. Bourgonje et al. [57]   As observed by Table 3, the results obtained by our proposed methods have outperformed the existing approaches on these datasets. On the Wikipedia Attack dataset, our proposed model, Bi-GRU with GloVe embedding technique, achieves 96.98% accuracy and 98.56% F1-score, which is higher than the existing methods. On the Wikipedia Web Toxicity Dataset, the results achieved by Bi-GRU with GloVe embeddings outperformed the existing baselines with 96.01% accuracy and 98.63% F1-score. The achieved results are~2-3% higher than the state-of-the-art methods in terms of F1 measure. This indicates that the proposed framework is also capable of handling class imbalances in the datasets. The evaluation metrics detailed in Table 3 validate the efficiency of our proposed methods. In addition, most of our proposed models outperformed the existing state of the-art, as observable in Table 2. It is notable that the proposed single layer neural network displays higher classification efficiency than the existing CNN and BERT deep models. For the Wikipedia Attack Dataset, the precision, recall and F1 measures achieved from all our shallow neural networks are higher than the existing methods [27,35,57,58]. For the Wikipedia Web Toxicity dataset, all the results achieved using neural network methods are exceptionally higher than the existing ones.

Conclusions and Future Prospects
With the expansion in the online space, cyberbullying has emerged as a ubiquitous problem having dire consequences on people and society. This research focuses on investigating several dimensions of cyberbullying detection. We explored eleven classification techniques, including traditional machine learning and shallow neural networks. In addition, we have also used seven types of feature extraction and embedding techniques. The results are established by performing experiments on two real-world datasets. We propose a novel neural network framework, establishing optimum network settings, dense, and dropout layer sizes. The framework accommodates various classifiers and achieves high results overall, outperforming several baselines. We provided a comparative study discussing the performance of all the methods utilized. The results are compared on a scale of four evaluation metrics in order to establish the concreteness of this study. The usefulness of this study lies in identifying robust mechanisms for online cyberbullying detection. Additionally, the proposal of shallow neural networks moderates the need of complex deep neural networks, thus economizing resources. We observe that neural networks highly outperform traditional machine learning algorithms. We establish that bidirectional neural networks perform better in all scenarios. The attention mechanism is also observed to perform exceptionally well. We observe that traditional machine learning algorithms such as SVM, Naïve Bayes, XGBoost, and Logistic Regression provide lower results compared to the shallow neural networks. Overall, we suggest using bidirectional RNNs and attention-based models for further advances in cyberbullying detection. This study paves a way towards developing better mechanisms to fight this online ailment.