A Review on Deep-Learning-Based Cyberbullying Detection

: Bullying is described as an undesirable behavior by others that harms an individual physically, mentally, or socially. Cyberbullying is a virtual form (e


Introduction
Bully that occurs through the Internet is called cyberbullying, or cyber harassment [1].There are different forms of cyberbullying that we can observe nowadays.For example, writing indecent textual content and sharing inappropriate visual content, e.g., memes.Social media platforms such as Facebook, Instagram, Twitter, etc. have made it easier for us to create content, interact with others and connect with others.However, unfiltered exchange of message content and the missing protection of private information can lead to bullying on different social media platforms [2].Cyberbullies could be in any form, including flames, vitriolic comments, sending offensive emails, humiliating pictures, mean remarks made by comments, and harassing others by posting on blogs or social media.Bullies may bring severe consequences such as depression, which may even lead people to commit suicide [3,4].
Detecting cyberbullying is important to stop the threatening problem.Detection of cyberbullying is a difficult task due to the lack of identifiable parameters and the absence of a quantifiable standard.These contents are short, noisy, and unstructured, with incorrect spelling and symbols.Sometimes users intentionally obfuscate the words or phrases (e.g., b***h, a**, etc.) in the sentence to deceive automatic detection [5].Researchers use traditional machine learning (ML) algorithms to identify cyberbullying (i.e., text and image format), whereas the majority of the existing solutions are based on supervised learning methods [6].Due to the subjective nature of bully expressions, traditional ML models perform lower in detecting cyber harassment than the deep learning (DL)-based approaches [7].A recent study shows that DL models outperform traditional ML algorithms regarding cyberbullying identification.Deep Neural Networks such as Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU) [8], Long Short-term Memory (LSTM) [9], Bi-LSTM [10] and several other DL models can be used to detect this problem.
Introducing DL-based models for detecting cyberbullying over traditional models has several benefits.When the data size is large, several studies [11][12][13][14] have shown that DL algorithms outperform the traditional ML algorithms.Extracting features manually for text and image classification is a tedious and error-prone task.Sometimes exploiting traditional ML models are not reasonable to extract features, whereas in DL-based models, the task is performed automatically in the hidden layers.However, extracting features intelligently is an essential task during cyberbullying detection from text and image [15][16][17][18].In addition, understanding the context of the text or images increases the chance of providing better accuracy [19][20][21][22][23].When we have minimum domain knowledge, the performance of ML algorithms is prone to deteriorating over time during solving complex problems [24].
Furthermore, conventional ML models suffer in model adaptability and transferability.For instance, if we train a model over a YouTube dataset and reuse the model over a Twitter dataset, using ML will not provide the desired results.DL models outperform ML models when we encounter complex linguistic expressions such as harassment with cyberbullying [25].
Figure 1 shows a typical cyberbullying detection pipeline where different steps from social media data input to cyberbullying detection have been explained.In this pipeline, the input dataset can contain either the text data or the image data, which are collected from social media.The cyberbully image data can be extracted by using two methods: optical character recognition (OCR) and image similarity.On the other hand, the raw text data are sent to data preprocessing for improving the data quality.Various text preprocessing steps including data cleaning, tokenization, stemming, lemmatization, and stop word removal are used for the reduction of dimensionality.After the preprocessing step, feature extraction is carried out to transform the raw data into numerical features, which are more meaningful to a machine learning model.Next, the outcome is sent to a deep-learning-based cyberbully detection module for detecting the cyberbully contents.Finally, the cyberbullying content is classified as bully or non-bully.Consider the case where we need to identify cyberbullying on a social media site such as Twitter.The text of tweets would be analyzed by a text-based model to spot any words or phrases that suggest harassment, aggression, or discrimination.Tweets containing the words "kill yourself", "ugly" or "stupid", for instance, might be labeled as cyberbullying.Again, consider that we want to identify cyberbullying on an Instagram-like photo-sharing app.Images with offensive gestures, hate symbols, or violent scenes are just a few examples of visual cues that can be used by an image-based model to analyze the content of images and detect cyberbullying.
In this paper, we have reviewed the research works that focus on automated cyberbullying detection using either DL models [26][27][28][29] or supervised learning techniques [30,31].A few papers focused on the DL models, but none concentrated on the frameworks and area of applications to detect cyberbullying.Prior survey papers do not present an overall ecosystem for cyberbullying detection methods to understand the DL-based solution systems' comprehensive structure.These studies do not show publicly available datasets for cyberbullying detection, and a few survey papers address the open issues and challenges [26,27].The absence of a globally acknowledged definition of cyberbullying is one of the major issues in the studied literature on automated cyberbullying detection.In this study, we first develop a clear taxonomy for our DL-based cyberbullying ecosystems.Figure 2 shows the graphical presentation of the detailed taxonomy of our cyberbullying ecosystem.The ecosystem broadly encompasses data representation techniques, DL models, and DL frameworks.To predict cyberbullying behavior, we collect datasets from the Internet where the content can be texts or images.Machine learning algorithms are typically fed vectorized numeric data, but the natural language or images are non-numeric data.To represent the data to be compatible with machine learning algorithms, we use data representation techniques to present the data in a numeric form.To convert the text into numeric form, we generally use two types of word-embedding techniques: pretrained and non-pretrained.Pretrained word-embedding techniques include but are not limited to Word2Vec [32,33], GloVe [34], ELMo [35], fastText [36], and BERT [37] whereas One-hot encoding and TF-IDF [38] are nonpretrained.To convert the image data to numeric data, we use effective, Graph, or ANN-based methods.After converting the non-numerical data into numeric data, we deliberately choose a suitable deep-learning algorithm.If the problem requires a generative model, then we might choose the Boltzmann Machine (BMs) [39], Deep Belief Network (DBN) [40], Deep Autoencoder (DAE) [41], and Generative Adversarial Network (GAN) [42] techniques.If the problem demands a discriminative model, Convolutional Neural Network (CNN) [43] and Recurrent Neural Network (RNN) [44] might be chosen.However, the problem may need to utilize a hybrid model if the dataset is multi-modal (i.e., image, text, speech) or different algorithms may enhance the accuracy by fusing multiple techniques.To simplify the process of building and training Deep Neural Networks by providing pre-built libraries and abstractions, several popular deep learning frameworks (i.e., TensorFlow, Torch, Theano, etc.) have been introduced.Note that Sections 4-7 present the contents of this cyberbullying ecosystem in detail.
Although numerous studies have been conducted on cyberbullying, a limited number of survey papers on DL-based cyberbullying have been found in the literature.We reviewed existing surveys that cover various aspects of cyberbullying.In this paper, we present several applications related to cyberbullying detection, mainly in social media, YouTube, Wikipedia and Q/A discussion forums, using RNN and CNN-based techniques.Users likely communicate with each other through these virtual platforms, and the perpetrators exhibit their creepy nature through digital devices.We also present the datasets that are used in various cyberbullying detection applications, which have different modalities such as text, photographs, collages, memes, etc.Finally, we also discuss the challenges and open issues of detecting cyberbullying, which might be a thought-provoking matter for future researchers.Since cyberbullying has a strong involvement in human psychology, how users respond to this misdemeanor might be exciting due to its multi-modal nature, i.e., image, emotion, culture, language, etc.The motivation of this review paper lies in scrutinizing the shortcomings of state-ofthe-art approaches to address the automatic detection of cyberbullying.In addition, we have identified the gaps in the existing literature and have filled out the latest improvement in the above aspects.We conduct a complete review of the existing problems, lack of traditional representation and ML models, contemporary frameworks, available datasets and scope of future works.In summary, this paper has the following salient contributions:

•
We present a DL-based cyberbullying defense ecosystem with the help of a taxonomy.We also discuss data representation, models and frameworks for DL techniques.

•
We compare several RNN, CNN, attention, and their fusion-based cyberbullying detection studies in the existing literature.

•
We analyze several text and image datasets extracted from social media and virtual platforms related to cyberbullying detection.

•
We identify the challenges and open issues related to cyberbullying.
The organization of the paper is presented graphically in Figure 3.In section 2, we briefly present the existing surveys related to our work.Section 4 discusses the data representation techniques.Sections 5 and 7 present DL-based models and frameworks, respectively.Sections 6 and 8 present applications of DL models in cyberbullying and several popular datasets regarding cyberbullying, respectively.Section 9 presents the challenges and open issues of DL models in cyberbullying.

Related Works
This section briefly discusses a few notable review papers on machine learning-based cyberbullying detection.We also present a comparison between our work with these existing works to show the novelty of our work.We have mentioned the survey papers according to the year of publication.
Haidar et al. [30] first detected cyberbullying in Arabic.They also offered a brief background on cyberbullying, related technologies, and an exhaustive survey on multilingual cyberbullying detection techniques.They finally proposed a plan to address the problem of Arabic cyberbullying.
Salawu et al. [26] presented a systematic review on cyberbullying detection approaches.They divided the existing approaches into four categories based on their substantial literature review: supervised learning, lexicon-based, rule-based, and mixed-initiative approaches.Supervised learning-based techniques commonly use classifiers such as SVM and naive Bayes to create predictive models for cyberbullying detection.Lexicon-based techniques identify cyberbullying using word lists and the presence of words within the lists.Mixed-initiative approaches combine human-based reasoning with one or more of the above-mentioned approaches to identify bullying.Rule-based approaches compare text to predetermined rules to identify bullying.The authors discovered two significant obstacles in cyberbullying detection research: the shortage of labeled datasets and academics' failure to take a holistic approach to cyberbully while creating detection systems.Their study effectively presents the current state of cyberbullying detection research with traditional ML techniques.
Rosa et al. [27] analyzed the existing research on automatic cyberbullying detection in depth.Their findings revealed that cyberbullying is frequently misinterpreted in the literature, resulting in erroneous systems with limited real-world utility.Furthermore, there is no standard methodology for evaluating these systems, and the natural imbalance of datasets continues to be an issue.They identified the future trend of research on the issue toward a position more consistent with the phenomenon's description and depiction, allowing future systems to be more practical and focused.
Al-Garadi et al. [31] studied existing publications to detect aggressive behavior using ML approaches.They summarized and recognized the critical factors for detecting cyberbullying through ML techniques, especially supervised learning.For this purpose, they have utilized accuracy, precision-recall and f-measure to determine the area under the curve function for modeling the behaviors in cyberbullying.
Elsafoury et al. [29] reveal some challenges and constraints of cyberbullying detection.Their paper represents a systematic literature review on automated cyberbullying detection that wraps all the steps in the ML pipeline.They also demonstrate that utilizing slang-based word embedding improves the detection of cyberbullying.
Kim et al. [28] give a thorough analysis of the past ten years of computational research concentrating on developing ML models for cyberbullying detection.A saturated corpus of 56 papers examined how humans are involved and considered directly or indirectly in building these detection algorithms.The authors focused on current algorithms' congruence with theories of cyberbullying.They then examined if and how current algorithms have incorporated humans.Finally, they shed insight into how academics have envisioned using current detection algorithms.Their evaluation reveals essential gaps in this research area due to the lack of human-centeredness in algorithm creation.
A comparison of automated cyberbullying detection methods, including data annotation, preprocessing, and feature engineering, is presented in the study by Al-Harigy et al. [45].Emoji use in cyberbullying detection and the application of self-supervised learning to annotation are also covered.Due to the detrimental effects of cyberbullying, particularly on social media where anonymity can foster hate speech and cyberbullying, the paper emphasizes the need for efficient cyberbullying detection.
We have summarized the above-mentioned studies in Table 1 where the existing surveys of machine-learning-based cyberbullying detection are compared with various features of our work.We have also compared these studies with ours according to their methodology of conducting the systematic review that is illustrated in Table 2.The limitations of the existing survey in the area of detecting cyberbullying using deep learning are shown in Table 1.To the best of our knowledge, there is no survey of deep-learning-based cyberbullying detection in existence because the majority of survey papers in the field are outdated.Although there is very little discussion about the applicability of deep learning models for solving this problem, as shown in Table 1, the majority of the papers did not discuss the strengths and weaknesses of the models in the context of classifying cyberbullying.Since it was not the primary focus of the existing studies, the taxonomy of deep-learning-based cyberbullying classification is not covered in any existing surveys.Taxonomy helps in organizing and adding clarity to complex ideas by categorizing them into practical categories.When complex concepts are broken down into smaller, more manageable parts, it is easier to understand and communicate ideas.A thorough discussion of taxonomy is crucial for that purpose.The majority of the existing survey papers omitted discussing image-based data representation techniques, but each paper briefly discussed text-based data representation techniques.However, since it is currently necessary to detect cyberbullying from images, we discuss them in our paper.Selecting an appropriate framework from the wide choice is also crucial in order to implement the models robustly while dealing with the problem of classifying cyberbully.In contrast to the existing studies, which lack a discussion of the framework, our study explicitly states the applicability of various frameworks based on the problem.Another crucial factor is the accessibility of the datasets mentioned in the studies, without which it would be challenging for the researchers to assess the viability of their research hypothesis.We also discuss cultural diversity, data representation, multimedia and multilingual content, and the impact on mental health as part of the discussion of challenges and future trends.The majority of existing studies did not go into detail about these issues.Therefore, we include this in our study because it is essential to fully understand the difficulties and potential future trends before beginning any work.
Table 2 depicts the comparison of methodology with the existing surveys.From the keywords of each existing survey, it is clear that no existing surveys have focused on deeplearning-based cyberbullying detection, which is a necessity nowadays, as deep learning models surpass the traditional machine learning models.Additionally, the most recent year of the surveys in use is 2020, but the current year is 2023.Three years of time between survey papers is significant.Note that deep-learning-based ideas in detecting cyberbullying have emerged during this period.

Methodology
We are particularly interested in relevant English-language articles: reputed journals and conferences published between January 2017 and January 2023 in academic databases (e.g., IEEE Xplore, ScienceDirect, ACM Digital Library, Wiley, Springer Link, Taylor & Francis, MDPI, etc.) and patents.
Figure 4 shows that we conducted a comprehensive search for related articles on Google Scholar, using various combinations of initial keywords such as "cyberbullying" and "deep learning", "cyberbullying" and "detection", "cyberharassment" and "deep learning", and "social media" and "cyberbullying".After screening 1331 article titles, we removed duplicate content and subsequently excluded low-tier journals and conferences.In the third round, we excluded articles that did not align with our research content, and finally, we shortened our list further by excluding contributions that were deemed insignificant.
We have selected 63 relevant articles for inclusion in this paper, as they closely align with the focus of our study.We have exclusively included primary research in our review.To further enhance our search, we have conducted an additional search using keywords such as "deepfake" and "cyberbullying", focusing on the subfields of title, abstract, and keywords, spanning the period of January 2017 to January 2023.

Data Representation Techniques
In many situations, we use an independent representation of words or images as input to the DL network.If these words or images are better understood by these representations, then it is expected that the predictive performance improves.Thus, exploiting a better representation technique is important since it affects the overall performance of the DL model.In this section, we mainly present major data representation techniques (i.e., text and meme) by which we experience prominent cyberbullying attacks.Note that data representation techniques are shown as the left-most branch of our taxonomy shown in Figure 2.
In the following sections, we first discuss different word-embedding techniques to represent text data: One-hot encoding, TF-IDF, Word2Vec, GloVe, ELMo, fastText and BERT.

One-Hot Encoding
One-hot encoding is a technique for converting categorical input (i.e., words) to integers so that ML algorithms can use it.The majority of ML algorithms cannot deal with categorical data directly.This technique transforms a categorical variable into a set of binomials, or a binary vector with a value of 0 or 1.The number of columns in this method is equal to the number of classes in the category.This approach is useful for converting data such that it may be utilized for ML.However, the approach has been criticized because it simply adds more columns.As a result, the dataset becomes massive, and the algorithm that has many columns might decrease the accuracy.

TF-IDF
Term Frequency Inverse Document Frequency (TF-IDF) [38] is used to determine how relevant a term is in a document, with word relevance referring to the quantity of information provided about the term's context.Term frequency (TF) is a metric that quantifies how frequently a term appears in a document.If a term appears more frequently in a text than other terms, it is more relevant to the content than other terms.In addition, the inverse document frequency (IDF) score is calculated by dividing the total number of documents by the total number of documents in the collection that contains them.The approach aids in reducing the weight of terms that appear often across a collection of papers.Overall, TF-IDF, which is essentially the multiplication of TF and IDF scores, is used to identify the relevant needs for a text so that the most significant and informative words may be readily found.In our context, we found few cyberbullying detection works using TF-IDF [46,47].

Word2Vec
Word2Vec [32,33] is a method for recreating word linguistic contexts.The method has a neural network with two layers.A vast corpus of words is used as input, and the result is a vector space with hundreds of dimensions.A matching vector space is allocated to each unique word in the corpus.Word vectors in the corpus are arranged in such a way that words with similar contexts or nearly identical meanings are clustered together in the space.Word2Vec is a computationally fast approach for learning word embeddings from raw text.Word2vec uses two separate methods: the Continuous Bag-of-Words (CBOW) model and the Skip-Gram model.The architecture of these two methods has been shown in Figure 5.

GloVe
GloVe [34] is an unsupervised ML technique that stands for Global Vector for Word Representation.Stanford created GloVe to construct word embedding by aggregating a corpus's global word to word co-occurrence matrix.The outcome of embedding in vector space reveals intriguing linear substructures of the word.

ELMo
The acronym ELMo [35] stands for Embeddings from Language Model.This wordembedding approach is used to represent a series of words as a corresponding sequence of vectors.Character-level tokens are used as inputs to construct word-level embeddings in a bi-directional LSTM.ELMo is a sophisticated computer model for converting words to numbers.

FastText
The Facebook research team created FastText [36] as a library.It has two uses.The first is efficient word representation learning, and the second is sentence categorization.The method supports both supervised and unsupervised representations of words and sentences.On Facebook, if anyone puts a status update on their Facebook timeline about purchasing a bike, after a few moments, they may see an ad related to bikes.Facebook uses the text data to serve you better ads by using FastText.Figure 6 shows the word embedding for 3-gram in FastText.4.1.7.BERT Bidirectional Encoder Representation from Transformers (BERT) [37] is based on the transformer architecture.Wikipedia (2500 million words) and Book Corpus (800 million words) are part of a vast corpus of unlabeled text that has been pre-trained.The success of BERT mainly lies in the pre-trained step, which has been trained with a large number of texts.The BERT model gathers information from both the left and right sides of a sentence context.Figure 7 shows an example of bi-directionality.If we forecast the nature of a word by choosing other words to its left or right sides, by selecting both sides of this term, BERT precisely predicts the exact meaning.The transformer is the foundation of the BERT architecture.BERT has two variants: BERT base and BERT large.BERT base has 12 layers of transformer blocks, 12 attention heads, and 110 million of parameters.BERT large, on the other hand, has 24 transformer layers, 16 attention heads, and 340 million parameters. Figure 8 shows the architecture of BERT base and BERT large.Figure 9 and 10 shows the input representation of BERT model and output as the embedding of BERT base respectively.BERT has been pre-trained on two natural language challenges.The first is Masked Language Modeling (MLM), which studies word relationships.The second is Next Sentence Prediction (NSP), which is necessary to comprehend how sentences relate to one another.There are some variations of BERT that are also used for the cyberbullying detection problem.(3) train on longer sequences; and (4) dynamically change the masking pattern over the training data.The authors used a novel dataset, CCNEWS, and suggested that if more data are used during pre-training, downstream tasks can be improved further.Yani et al. [49] utilized RoBERTa to detect cyberbullying on the popular social media platform Twitter.After experimental analysis, they obtained an accuracy score of 86.9% and an F1 score of 77.5%.
ALBERT (A Lite BERT) [50]: Improving the model performance is not always possible due to GPU/TPU memory limitations and longer training times.To mitigate the issue, the authors reduced two parameters to lower the memory consumption and to increase the training speed of BERT.A number of studies show that ALBERT presents better performance compared to BERT over GLUE, RACE, and SQuAD benchmarks.Tripathy et al. [51] used an ALBERT-based fine-tuning model for cyberbullying detection, as it does not require large amounts of data for fine-tuning.The experimental results show that their proposed method outperformed the current approaches CNN + word2Vec, CNN + GRU, and BERT implementations in terms of an F1 Score of 95%.
ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) [52]: BERT corrupts the input by replacing some token with MASK and by training a model to reconstruct the original model.The authors corrupted the tokens with plausible alternatives sampled from a small generator network that improves the model performance significantly.
DistilBERT (Distilled BERT) [53]: DistilBERT is a pre-trained smaller general-purpose language representation model that is a faster variant of the BERT model.The version is fine-tuned with good performance on a wide range of tasks and is designed for lowresource environments.It achieves similar performance to the original BERT model while using fewer resources.The approach leveraged knowledge distillation during the pretraining phase and reduced the size of general BERT by 40%, yet it retained a language understanding capability of 97%.The model is faster, smaller and lighter to pre-train.
Several studies [54][55][56] used DistilBERT for cyberbullying detection in social networks.Their experimental results show that they obtained a promising performance while using DistilBERT as the word-embedding technique as well as the fine-tuned classifier.
MobileBERT [57]: BERT suffers from heavy model sizes and high latency, which cannot be applied to limited resource devices such as mobile phones.Sun et al. developed MobileBERT for compressing and accelerating the BERT model, which is task-agnostic and can be applied to various downstream NLP tasks by simple fine-tuning.The model is carefully designed to create a balance between self-attentions and feed-forward networks.To train MobileBERT, first the authors trained a specially designed teacher model, and then, they transferred the knowledge from the teacher to MobileBERT, which is 4.3× smaller and 5.5× faster than the general BERT model.
On a variety of natural language understanding tasks, Bidirectional Encoder Representation from Transformers (BERT) has produced outstanding results.In numerous studies, BERT outperformed conventional machine learning algorithms, achieved cutting-edge performance, and demonstrated promising results in the detection of cyberbullying.Studies [22,[58][59][60] show that BERT achieved high accuracy and F1 scores to classify cyberbullying in various types of online content, such as tweets and comments.The performance of BERT may vary depending on the dataset and task.Here are a few examples of state-of-the-art works that demonstrate the extensive recent research on using BERT for cyberbullying detection.
Using a pre-trained BERT model along with deep learning (DL) models, Mazari et al. [58] proposed a multi-aspect hate speech detection approach based on text classification in multiple labels.Bidirectional Long-Short Term Memory (Bi-LSTM) and/or Bidirectional Gated Recurrent Unit (Bi-GRU) are stacked on GloVe and FastText word embeddings to create the DL models that are used.The proposed approach, which detects hate speech on social media, received a ROC-AUC score of 98.63%.BERT was used by Coban et al. [59] to detect Turkishlanguage Facebook activity content.They reported BERT as the best classifier for the problem after conducting a thorough experimental analysis because it produces the most cutting-edge results when compared to other conventional machine learning and deep learning methods, with a macro F1 score of 92.8.
BERT was applied to three real-world datasets: Formspring, Twitter, and Wikipedia by Paul et al. [22].According to experimental findings, BERT performs significantly better than conventional machine learning algorithms such as CNN, RNN + LSTM, and Bi-LSTM, with attention in terms of F1 scores.In this study, they solved the cyberbullying detection and classification problem with state-of-the-art performance on three widely used datasets.To better represent the meanings of the semantics of the words, Feng et al. [61] suggested a BHF model that makes use of BERT and a Hierarchical Attention Network (HAN).According to their experimental findings, the BERT and HAN (also known as BHF) combination provides a more precise semantic representation of each word, leading to higher accuracy scores.
A domain-specific BERT model for identifying hate speech that is posted online is being developed by Ishaq et al. [62].The suggested method introduces "HateSpeechBERT (HS-BERT)", a domain-specific language representation model based on BERT and pre-trained on substantial datasets of hate speech.They demonstrate in their study that HSBERT provides state-of-the-art results when compared to other models by comparing its performance against the general-domain BERT through extrinsic and intrinsic evaluations.To assess the effectiveness of BERT in detecting cyberbullying in a social media context, Mozafari et al. [60] used BERT to categorize cyberbullying in two social media datasets.In terms of precision, recall, and F1 scores, their experimental findings are encouraging when compared to the prior research in this area.Overall, these studies demonstrate the effectiveness of using BERT and the combination of BERT with other models for cyberbullying detection, and they highlight the potential for future research in this area.
A summary of different word-embedding techniques used in cyberbullying detection is shown in Table 3.We have compared these techniques based on context-sensitive, traditional ML-based, RNN-based, transformer-based, and transfer-learning-based models.We can observe that several word-embedding methods depend on the context.Since Currently, the state-of-the-art word embedding is BERT [37], which provides satisfactory outcomes during model building in cyberbullying-related problems.In our study, we have noticed that BERT, the only transformer-based method, is the most potential word-embedding technique used to deal with text-based cyberbullying problems.We have also observed that all the ML-based word-embedding techniques are pre-trained.BERT is also pre-trained by using unlabeled data collected from the English Wikipedia and BooksCorpus, each of which contains 800 million words.One-hot [74] and TF-IDF approaches are utilized in a few studies in the identification of cyberbullying.However, these techniques perform weaker since they tend to be context insensitive.The study mostly made use of Word2Vec, GloVe, and ELMo.However, BERT has lately seen sharp growth and outstanding outcomes in its use as a word embedding approach.

Efficacy of Various Embeddings for Detecting Cyberbullying
In this subsection, we will provide a concise overview of the efficacy of identifying cyberbullying behaviors.The ability of Word2Vec to grasp the semantic meaning of words is crucial in detecting cyberbullying.Aldhyani et al. [75] demonstrates a comparative analysis of different embeddings along with Word2Vec.For example, the word "ugly" can be used in cyberbullying to insult someone's physical appearance and cause emotional harm.However, it can also be used metaphorically in a harmless context, such as expressing dislike for a piece of clothing by saying "That shirt is so ugly".Word2Vec can differentiate harmful and harmless messages in cyberbullying detection.The mapping between the target word to its context word implicitly builds the relationship into the vector space of words, which can be inferred by these word vectors.GloVe uses global matrix factorization to generate word vectors, which can be particularly useful in examining the context of words.One of the advantages of using GloVe is that it can effectively grasp the associations between words and their co-occurrence patterns, allowing for a more subtle understanding of the meaning behind the text.For example, a model can be trained to categorize a text as cyberbullying or non-cyberbullying based on the presence or absence of certain word clusters identified by GloVe.
FastText is similar to Word2Vec but uses subword information to generate embeddings.The main advantage of FastText is its capability to handle morphological variations in text, which are common in cyberbullying messages, and thus, it can make precise predictions.Below are some studies that have compared the effectiveness of various word embedding techniques.
Pericherla et al. [76] conducted a study to evaluate the performance of different wordembedding techniques, including Bag of Words (BoW), TF-IDF, word2vec, GloVe, FastText, and several language models (ALBERT, ELECTRA, GPT-2, XL-NET, and RoBERTa), in detecting cyberbullying using a Twitter dataset labeled for sexism and racism.The study found that the majority of language models achieved high F1 scores compared to traditional word embeddings such as BoW and TF-IDF, as well as semantic word embeddings such as word2vec, GloVe, and FastText.
Eronen et al. [77] investigated the efficacy of linguistically backed word embeddings in detecting cyberbullying.They trained Word2Vec Skip-Gram embeddings with encoded linguistic information, as well as using dependency structure-based contexts.Their findings suggested that lemmatization can be an effective preprocessing method for increasing detection efficacy with pre-trained word embeddings.
Alhloul et al. [78] conducted a study using lemmatization to extract the roots of each word and utilized the TF-IDF embedding technique.They used a UNICEF dataset of tweets categorized into six classes: age, ethnicity, gender, religion, type of cyberbullying, and non-cyberbullying.Their study found an accuracy of 97.10% and an F1 score of 97.12% in classifying tweets of cyberbullying.
Overall, different embedding techniques can be effective for cyberbullying detection, particularly when it is used with deep learning algorithms.However, similar to any machine learning model, its efficacy depends on the data quality and the selection of hyperparameters.

Image Data Representation
In the following subsections, we describe several techniques to represent two-dimensional image data such as cognitive image representation, BSP representation, Bio-inspired model representation, MPS representation, and Deep Neural Networks-based image representation.

Cognitive Image Representation
Cognitive image representation [79] is based on the notion that humans recognize images by making successive approximations with increasing resolution for specific regions of interest.Such an image format is appropriate for creating the learning models for the objects, which should be retrieved from picture databases.This method is based on the inverse spectrum pyramid (ISP) decomposition method for image representation, which is a novel way of encoding digital pictures.The picture is decomposed into successive approximations based on any type of 2D orthogonal transform (DCT, WHT, etc.).The obtained transform coefficients are used to construct the spectrum pyramid's successive tiers.This technique enables the creation of interactive systems in which the user may create numerous types of questions.Image archiving, image coding, image transmission systems, remote medical diagnostics, and patient monitoring are only a few examples of important application fields.

BSP Representation
Different images can be represented using a Binary Space Partitioning (BSP) [80] tree.First, using the Binary Quaternion Moment-Preserving (BQMP) thresholding approach, the entire image is binarized.Second, a dividing line is chosen to split the output image into two sections, at least one of which is reasonably homogeneous.Finally, a color is assigned to each region to reflect the portion of the input image.The element values of the representative color are computed as the mean of the red, green, and blue components of all the pixel colors in the region.Finally, these color values are stored along with the dividing line parameters and are utilized as the picture representation at the first partition level.The method is continued until no more areas can be partitioned or a set number of iterations has been achieved.As a result, at the end of the jth iteration, one has a j number of hierarchical picture representations.Figure 11 is a BSP tree representation of an image.

Bio-Inspired Model Representation
The bio-inspired model [81] is an image representation model based on a non-classical receptive field (nCRF) and reverse control mechanisms offered by biological systems for inspiration.Using a multi-layer neural network based on the human visual system, the model is utilized for image representation and image analysis.The neural model simulates a ganglion cell's non-classical receptive field and its local feedback control circuit, and it can self-adaptively and consistently depict images beyond the pixel level.Experiments on image reconstruction, distribution, and contour detection show that this technique can accurately represent images at a cheap cost while also producing a compact and abstract approximation that may be used for further image segmentation and integration.This representation schema excels at extracting spatial relationships from various image components and emphasizing foreground information.This representation schema is very effective in extracting spatial connections from various components of images and highlighting foreground items from the background, particularly in natural images with complex scenarios.

MPS Representation
In addition to being an efficient image coding scheme, MPS [82] provides a flexible semantics-driven image representation that enables many typical operations in visual computing and communications.The MPS is made up of edges that are retrieved and sorted from fine to coarse scales in order.MPS is a type of picture representation that is intermediate in complexity.Many popular image operations, such as classification, restoration, detection, and content-based information extraction, can be performed directly in the MPS framework without first transforming the coded image back to the spatial domain because the representation consists of high-level semantic primitives such as edges of various scales and types.It has usage in compression, scene categorization, and other areas.[83] In several computer vision applications, DNNs have demonstrated strong image representation performance.Three common building blocks for DNNs are the Restricted Boltzmann Machine (RBM), Auto-Encoder (AE), and Convolutional Neural Nets (Con-vNet).Some task-specific DNN designs, such as Convolutional Deep Belief Networks (CDBN), Reconstruction Independent Component Analysis (RICA), and Deconvolutional Networks (DN), are suggested based on these building blocks.Many computer vision tasks, such as handwritten digit identification and object recognition, benefit from these approaches.The obvious conclusion drawn from this research is that successive layers of DNNs extract different characteristics at different scales, ranging from low-level features to higher-level features.

Optical Character Recognition (OCR)
Optical character recognition (OCR) [84] enables computers to read printed or handwritten text and to turn it into digital text that can be edited, searched, and analyzed.OCR can be used to analyze text-based content on social media sites, online forums, and messaging services in order to detect cyberbullying.
Studies [85,86] proposed a multimodel cyberbullying detection framework where they applied OCR to detect cyberbullying from image data.In addition to that, they employed another method named Image Similarity to classify cyberbullying from image data.Kumari et al. [87] utilized OCR to extract text from the images to classify cyberbullying in image data.For instance, Instagram uses OCR to find bullying in pictures and captions.The program looks for offensive language in the captions and images, and if it finds any, it notifies the user that their post may be offensive [88].Similarly, Facebook employs OCR technology to find offensive material such as cyberbullying, hate speech, and graphic images [89].Gao et al. [90] proposed a novel method for identifying cyberbullying on Chinese social media platforms.The system extracts text from images using a combination of OCR and image processing techniques and then uses deep learning algorithms to categorize the text as either normal or abusive.Borah [91] identifies cyberbullying on Indian social media platforms.OCR was used by the system to extract text from images, and machine learning algorithms were then used to determine whether any of the text was threatening or offensive.
OCR technology can instantly analyze text-based content, spotting behavioral patterns and identifying offensive language.Message tonality analysis and detecting sarcasm and other subtle forms of bullying can also be performed using technology.These systems can detect threatening or offensive language that might otherwise go unnoticed by conventional text-based analysis techniques by extracting text from images using OCR.These systems can also learn and adapt over time with the help of machine learning algorithms, which enhance their precision and efficiency.Social media platforms and online forums can promote a safer and more positive online environment by utilizing OCR technology for cyberbullying detection.

Deep-Learning-Based Models
For cyberbullying detection, many DL-based models have been applied over different applications.A few popular models are Deep Neural Network (DNN), Boltzmann machines, deep belief network, deep autoencoder, etc.Note that DL-based models are shown as the middle branch of our taxonomy shown in Figure 2. Table 4 presents high-level characteristics of different deep learning models and how these models are suitable to handle cyberbullying-related textual and image-based identification.In addition, it depicts the applications of cyberbullying for each deep-learning-based model along with its limitations.We briefly describe the popular models.

Deep Neural Network (DNN)
Deep Neural Networks (DNN) [92] are artificial neural networks with numerous hidden layers between the input and output layers.When an ML system employs multiple layers of nodes to extract high-level functions from input data, it is referred to as a Deep Neural Network.It entails translating facts into a more abstract and creative component.Similar to other neural network architectures, it has synapses, biases, neurons, functions, and weights.DNNs can represent complex non-linear connections.
As DNN is a type of ANN with multiple hidden layers so that if the model needs to learn more complex non-linear functions in that case, DNN can be used instead of ANN.DNNs are feed forward networks that transfer data from the input layer to the output layer without looping back.As a result, DNN does not perform well in the field of text classification or computer vision.Backpropagation of error is used to update weights and biases such that the latent neurons are activated at appropriate values.DNN is thought to be the key to a solution when the pattern utilized for discriminating is so complicated that standard statistical and numerical techniques fail.
Many difficulties can be developed with naively trained DNNs, just as they might with ANNs.Overfitting and computation time are two typical problems.To overcome the overfitting problem, a dropout [93] layer between the hidden layers can be used, and another approach is early stopping [94], and these are both regularization techniques.Figure 12 shows the Deep Neural network architecture.

Boltzmann Machines (BMs)
A Boltzmann machine [39] is a symmetrically linked network of neuron-like units that make stochastic decisions on whether to turn on/off.Boltzmann machines use a basic learning technique used to uncover interesting characteristics in the training data that indicate complicated regularities.
In networks with multiple layers of feature detectors, the learning process is sluggish, but in "restricted Boltzmann machines" with a single layer of feature detectors, it works faster.By building limited Boltzmann machines and using the feature activations of one as the training data for the next, several hidden layers may be learned quickly.
There are different types of Boltzmann machines: Restricted Boltzmann machine [96], Deep Boltzmann machine [97], and Spike-and-slab RBMs [98].The Boltzmann machine is a relatively broad computing medium in theory.For example, if the machine is trained on images, it may hypothetically model the pattern of images and use that model to finish an incomplete photograph.Figure 13 shows an example of a Boltzmann machine with two hidden units and three visible units.
Boltzmann machines are normally used to tackle diverse computational issues; for example, for an inquiry issue, the loads present on the associations can be fixed and are utilized to address the expense capacity of the improvement issue [39].Boltzmann machine [99] (2 hidden units, 3 visible units).

Deep Belief Network (DBN)
Deep belief networks [40] are probabilistic generative models composed of several layers of stochastic, latent variables.Latent variables with binary values are referred to as hidden units or feature detectors.Undirected, symmetric connections link the top two layers, providing an associative memory.Directed connections are sent down from the higher layer to the lower layers.The states of the units in the lowest tier make up a data vector.
A DBN may learn to probabilistically recreate its inputs when trained on a collection of instances without supervision.Then, the layers serve as feature detectors.After completing this learning step, a DBN can be taught to perform a classification task under supervision.The procedure of training a DBN model consists of two parts.Each RBM layer is trained unsupervised, the input should be mapped into distinct feature spaces, and as much information as possible should be maintained.As a supervised classifier, the LR layer is then put on top of the DBN [100].Figure 14 shows the architecture of a deep belief network (DBN).

Deep Autoencoder (DAE)
A deep autoencoder (DAE) [41] comprises two symmetrical deep-belief networks: one with four or five shallow layers for encoding and the other with four or five layers for decoding.In image search and data compression, the deep autoencoder is commonly utilized.In the case of image compression, deep autoencoders are beneficial for semantic hashing [102].Topic modeling, or statistically modeling abstract subjects that are scattered over a collection of texts, is where deep autoencoders are useful.
Many autoencoders are trained using a single-layer encoder and decoder; however, utilizing multiple (deep) encoders and decoders gives several benefits.The computational cost of modeling some functions can be reduced by an order of magnitude when using depth.Depth can reduce the quantity of training data required to learn some functions tremendously [103].Deep autoencoders produce superior compression than shallow or linear autoencoders [104].
Autoencoders are most commonly used for dimensionality reduction and information retrieval, although recent variants have been used for a variety of other tasks.Principal component analysis, dimensionality reduction, retrieval of information, detection of anomalies, processing of images, drug development, popularity forecasting, and machine translation are the major tasks where deep autoencoders are used [103].Figure 15 shows the architecture of a deep autoencoder.

Generative Adversarial Network (GAN)
Goodfellow et al. [42] proposed a model GAN that uses minimax game theory to train the generation model.GANs are a type of generative modeling that uses DL techniques.
In its training phase, GAN presents the challenges as a supervised learning problem with two sub-modals.The generator model creates new instances, but the discriminator model attempts to categorize them.It tries to figure out if the object is genuine from the domain or a forgery (generated).The two models are trained in an adversarial zero-sum game until the discriminator model is tricked roughly half of the time, indicating that the generator model is producing believable instances.
The applications of GAN is increasing rapidly in the sectors of fashion, art and advertising, science, video games, malicious applications, and transfer learning.Inverse methods such as bidirectional GAN (BiGAN) [106] and adversarial autoencoders [107] learn a mapping from a latent space to the data distribution, whereas the conventional GAN model learns a mapping from a latent space to the data distribution.Semi-supervised learning, interpretable ML, and neural machine translation are some of the applications of bidirectional models.Figure 16 shows the actual form of GAN.

Recurrent Neural Network (RNN)
RNN [44] stands for Recurrent Neural Network, which is used for the sequential text data as input, for example, if there is a sentence and there needs to be a prediction of whether this sentence contains positive context or negative context.In such a situation, we can use RNN.Spam classifiers, time-series data, sales forecasting, stock forecasting, and many more problems can be addressed with a better accuracy by using RNN.For other models, when the input is given as a sequence of text data by using text preprocessing techniques (such as Word2Vec, TF-IDF, BagOfWord, etc.), we need to preprocess the raw data and convert them into vectors.For applying ML algorithms over the sequential data, we need to convert them into vectors.When a sentence is converted into a vector, the sequence information is discarded.Once the sequence information is discarded, the accuracy will decrease.We will also discuss text representation.Since we may analyze cyberbullying from textual content, RNN is used for controlling this sequence information.RNN has an internal memory that helps it to control the sequence information.In other neural networks, all the input is basically the vector, which is totally independent, but in RNN, every input is dependent on its previous output and current input.In this way, RNN restores the context of the whole sentence.
In Figure 17, the current state input is h t = f (h t−1 , X t ) .Then, we have to apply activation functions such as sigmoid, ReLU, or tanh, and then, the output will be, y t = W ht h t where W ht is the weight of the output.RNN has some problems: 1.The training of an RNN is very difficult.2. It cannot process with a very long sequence of sentences.3. RNN does not support long-term memory storage.
For solving these problems, LSTM was introduced.

Long Short-Term Memory
LSTM stands for Long Short-Term Memory [109] network, which is basically the modified version of RNN.LSTM is used for remembering the past data for a long period, which is mainly possible for backpropagation during the training period.As we can see in Figure 18, three gates: Forget, Input, and Output gate, represent the LSTM network.Thus, the LSTM cell contains the following components: 1. Forget Gate "f "; 2. Cell State "C"; 3. Input Gate "i" ; 4. Output Gate "o"; 5. Hidden state "h"; 6. Memory state "C".
Here is the diagram for a LSTM cell at the time step t.Here, x -element wise multiplication; + -element wise addition; C t = current cell memory; C t−1 = previous cell memory; o t = output gate; f t = forget gate; σ = sigmoid function; w, b = weight vectors; h t−1 = previous cell output; x t = input vector; h t = current cell output.
Forget Gate: From the previous hidden state, we obtain some information.The forget gate decides which information is important and which is not based on the previous state information.It basically passes current input x t and previous state output h t−1 into a sigmoid function, which gives the value between 0 and 1.If the value is important, then the sigmoid output gives the value closer to 1.Then, this output is passed to the cell state and will be multiplied with previous cell state values.The equation of forget gate.
Input Gate: The current x t and the previous h t−1 are passed into a sigmoid activation function, which transform the value between 0 and 1, and these values are stored into a vector.In this case, 0 indicates important and 1 indicates not important.Again, the same x t and h t−1 are passed into a tanh activation function, which transforms the value between −1 and 1.A vector is created with all these possible values of the tanh function.Finally, the output of both the sigmoid function and tanh function will be multiplied and passed to the call state.
Cell State: Now the network has both input and forget gate information, which is required in the cell state to decide and store the information from the new state.After that, the previous cell state and the output of the forget gate will be multiplied.Then, the values are dropped if the output of the multiply is 0. Next, the result of this multiplication will perform addition with the input gates' result and will generate a new cell state.
Output Gate: The gate finds the value of the next state, which also includes the information of the previous state's input.Here, again the current x t and the previous h t−1 are passed into another sigmoid function.On the other side, the new cell state value is passed into a tanh function.Then, the output of the tanh function and the sigmoid function are multiplied and the final result is generated, which is passed into the next hidden state.

Convolutional Neural Network (CNN)
Convolutional Neural Networks (ConvNets or CNNs) [43] are one of the most common types of neural networks used to recognize and classify images.CNNs are commonly utilized in domains such as object detection, facial recognition, and so on.The convolution layer, pooling layer, activation layer, and fully connected layer are the major layers of CNN architecture.
There are multiple layers in CNN that process and extract features from the data.To perform the convolution operation, there are several filters in the convolution layer.To perform operations on elements, there is a ReLU layer in CNN, and the rectified feature map is the output from this layer.Then, the rectified feature map passes to the pooling layer.Pooling reduces the dimension of the feature map.Then, the pooling layer converts the twodimensional vector space into single-dimensional vector space by flatting it.The flattened vector space then passes to the fully connected layer and then classifies the input image.
The applications of CNN are in the field of image recognition, video analysis, natural language processing, anomaly detection, drug discovery, health risk assessment and biomarkers of aging discovery, checkers game, computer go, time series forecasting, cultural heritage, and 3d datasets.

Hybrid Models (LSTM-CNN, CNN-LSTM)
The LSTM-CNN architecture for cyberbullying detection is a Deep Neural Network model that combines the advantage of both LSTM and CNN to detect cyberbullying.Processing sequential data, such as text, is where the model performs well.
The architecture consists of three main components: an embedding layer, an LSTM layer, and a CNN layer as shown in Figure 19.The embedding layer converts the input text into a vector representation.After this, dropout can be applied to prevent overfitting.Then, the main structure of this architecture is built with a Bidirectional LSTM layer followed by a CNN layer, which is an extension of traditional LSTMs that can improve model performance on sequence classification problems that allow the model to capture both local and global context information.The LSTM-CNN architecture can be trained to identify messages or posts as cyberbullying or non-cyberbullying.The model takes in a sequence of words and outputs a probability score for each class.The choice between LSTM-CNN and CNN-LSTM in the context of cyberbullying detection may rely on the type of input data.For instance, LSTM-CNN may be more appropriate if the input data are textual, such as social media postings or chat logs, because it can model the temporal dependencies in the data.On the contrary, CNN-LSTM might be a better fit if the input data are visual, such as images or videos, because it can simulate the spatial and temporal relationship in the data.
In [63], we found both CNN-LSTM and LSTM-CNN experiments, and they showed that LSTM-CNN performs better than CNN-LSTM because the CNN layer would receive the word embeddings as input, which will further be pooled to a smaller dimension, and then, the LSTM layer will use the ordering of said features to learn about the input's text ordering.

Attention-Based Model
By increasing the accuracy of automated systems, attention-based deep learning models have made a significant contribution to the field of cyberbullying detection [112].These models can identify various data types, such as text, images, and videos, and can capture contextual information.Attention mechanisms are useful for this goal because they provide interpretability and resilience against noise, in which non-bullying content usually obscures cyberbullying behavior.Attention-based deep learning models have been successful in identifying and categorizing cyberbullying behavior on different platforms.We briefly explain some widely used attention-based deep learning models below.

Transformers
Transformers are deep learning models based on attention that have shown effectiveness in detecting cyberbullying [113][114][115].Since its primary application is machine translation, the transformer model has been used for various natural language processing tasks, including the detection of cyberbullying.Transformers are made to process text sequences by utilizing mechanisms for self-attention to record the connections between every element in the input sequence, enabling them to track long-range dependencies in the data.Therefore, transformer models are suitable for detecting cyberbullying in social media content, which frequently contains long and intricate messages.Researchers obtained state-of-the-art performance in detecting and preventing cyberbullying behavior on social media content by using fine-tuned pre-trained transformer models.

BERT (Bidirectional Encoder Representations from Transformers)
BERT (Bidirectional Encoder Representations from Transformers) has demonstrated outstanding results in detecting cyberbullying since it is an extensive language model that can extract contextual information from a given text.The transformer model can capture intricate associations between words and contexts.BERT can accurately detect bullying behavior in social media messages when the cue is subtle or indirect [22,116,117].

Hierarchical Attention Networks (HAN)
Hierarchical Attention Networks (HAN) have demonstrated promising results at both the document and sentence levels, especially in detecting cyberbullying.HANs are neural networks that concentrate on significant portions of the input text by using an attention mechanism that enables them to collect the most relevant features for classification.HANs have been used to understand the common tone of a social media message and the presence of certain bullying behaviors at the sentence level [61,118,119].

Convolutional Neural Networks with Attention (CNN-Att)
Convolutional Neural Networks with Attention (CNN-Att) have also presented exciting results in the detection of cyberbullying.To extract the most significant features, CNN accomplishes the tasks from the input text.These features are then carefully weighted by an attention mechanism to understand how relevant they are to the classification task.By utilizing the patterns and textual structures in social media messages, CNNs with Attention have been used to detect cyberbullying in social media messages [78].

Long Short-Term Memory Networks with Attention (LSTM-Att)
Long Short-Term Memory Networks with Attention (LSTM-Att): By fusing the capacity of LSTMs to capture long-term dependencies in sequential data with the interpretability of attention mechanisms, the model has demonstrated promising results in the detection of cyberbullying.Recurrent neural networks of LSTM variants can deal with sequential data of different lengths, making them suitable for modeling text data.By incorporating attention mechanisms with LSTMs, the model can focus on the most crucial portions of the input text, improving its ability to understand and categorize social media bullies.Therefore, LSTM-Att might be a powerful tool for enhancing the precision of automated systems to detect cyberbullies [120].Attention-Based Bi-LSTM (AB-LSTM) is also an effective neural network model for detecting cyberbullying on social media sites such as Twitter [115,121,122].

Gated Recurrent Units with Attention (GRU-Att)
The gated Recurrent Units with Attention (GRU-Att) model has also demonstrated exciting results in detecting bullying behavior in a social media text.The model can capture long-term dependencies in sequential data faster than the LSTM model while capturing the important part of the input text by combining Gated Recurrent Units (GRUs) with attention mechanisms.This makes it possible for the model to more accurately interpret and categorize the behavior of cyberbullying on social media platforms [71,123,124].
Attention-based deep learning models have demonstrated promising outcomes in the detection and prevention of cyberbullying behavior on social media platforms.These models, including the transformer, GRU with Attention, BERT, HAN, and CNN with Attention, have outperformed conventional machine learning methods and are capable of capturing complex relationships in text data.However, the quality and quantity of the training data, the selection of the hyperparameters, and the unique characteristics and design decisions of the model are all important factors that affect how well these models perform.As a result, even though attention-based deep learning models offer a promising method for identifying cyberbullying, careful assessment and validation of the models are required before applying them to real-world situations to ensure their efficacy, dependability, and moral implications.
As shown in Table 4, LSTM, Bi-LSTM, and CNN are frequently used models for the identification of cyberbullying.The LSTM and CNN models have recently been used in a variety of natural language processing (NLP) applications, because these models produce better results.Convolutional layers as well as maximum pooling or max-overtime pooling layers are used in CNN models to extract higher-level features.CNNs may be trained to extract character-level embeddings and n-grams, which are crucial for finding instances of cyberbullying in text.A CNN is an effective technique for detecting cyberbullying because its filters may be used to identify various patterns and elements in the text at various levels of abstraction, but CNN has some limitations such as capturing long-term dependencies, which is challenging, and requiring fixed size input, and it is significantly slower due to high operations.On the other hand, long-term dependencies between word sequences can be captured by LSTM models, which is a vital requirement in the context of cyberbullying detection [125].Long-term dependencies can be captured by LSTM in text data, which is crucial for spotting abusive language or behavior patterns over time.Additionally, it can process variable-length input sequences, which is helpful for handling text data with a range of durations, such as comments or posts on social media.GRU is another RNN-based model that is employed to detect cyberbullying problems.In terms of time and space complexity, GRU is more effective than LSTM [111], although LSTM can produce more accurate results while working with datasets that contain longer sequences.As the focus point is cyberbullying detection and the texts are generally too long, GRU is not frequently used for this purpose.
A Deep Belief Network (DBN) is rarely used for cyberbullying detection.In the majority of cases, the network is used as one of the components of hybrid models [126].DBN is an unsupervised learning method, as opposed to perceptron and backpropagation neural networks.The noise in the input data can be reduced by using autoencoders, which greatly increases the effectiveness of deep learning models.In addition, autoencoders are frequently employed to address the issues with unsupervised learning and to spot anomalies.However, the drawbacks of autoencoders are what make them ineffective for the goal of cyberbullying detection.The limitations of autoencoders include imperfect decoding, misinterpreting important variables, and using too much lossy compression [127].
The application of attention models in cyberbullying detection tasks helps the models perform better by helping them to concentrate on the most important sections of the input text.When identifying delicate or nuanced instances of cyberbullying, attention mechanisms can assist the model to recognize the keywords or phrases in the text.The issue of vanishing gradients, which can prevent recurrent neural networks such as LSTMs and GRUs from accurately capturing long-term relationships in the text, is another issue that attention models can assist in solving.Attention models can aid in reducing this problem and in enhancing the model's overall performance by enabling the model to selectively attend to certain areas of the input text.
The remaining models are not typically employed in cyberbullying detection tasks.Rather, they are primarily utilized in the development of hybrid models that serve to enhance overall model performance in the realm of cyberbullying detection.In a nutshell, CNN and RNN-based models (i.e., LSTM, Bi-LSTM) outperform the other deep learning models (i.e., GRU, DBN, MLPs, BMs, etc.) in the context of cyberbullying detection.This is why these models are widely used to perform the detection of cyberbullying.
Table 4. DL models with applications in cyberbullying detection along with their strengths and weaknesses.

DL Models Used in Cyberbullying Applications Area of Applications Limitations
Restricted Boltzmann Machines (RBMs) [96] Turkish social media contents [145], Arabic content [74] Dimensionality reduction, classification, regression, feature learning, topic modeling, and collaborative filtering Training is more difficult because it is difficult to calculate the energy gradient function, the CD-k algorithm used in RBM is not as well known as the backpropagation algorithm, weight adjustment Gated Recurrent Units (GRU) [146] Social Commentary [21], Facebook and Twitter aggressive speech [115], Bangla text [18], Formspring.me,MySpace and YouTube content [135] Sequence learning, Solved Vanishing-Exploding gradients problem

Slow convergence and low learning efficiency
Attention-based model [147] Twitter bullied text identification [78], social media text analysis [112], online textual harassment detection [71], contextual textual bullies [148], Instagram bullied text identification [118], Abusive Bangla Comment detection [121], Trait-based bullying detection [114] The method provides a simple and efficient architecture with a fixed length vector to pay attention of a sentence's high-level meaning The model requires more weight parameters, which results in a longer training time

Performance Comparison of DL Models in Cyberbullying Detection
It is important to investigate the performance of a deep learning model for a classification problem such as cyberbullying detection.Training accuracy, validation accuracy, learning curves, and early stopping are crucial metrics that can be used to assess the model during the training and testing phases.Training loss and validation loss are also important metrics to measure the performance of a deep learning model.When we train our model, we usually evaluate the performance of a deep learning model using a test dataset.
Four fundamental concepts are utilized to assess the performance of a model of classification task: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).In a binary classification problem, TP refers to the cases where the model correctly identifies a positive instance, TN refers to the cases where the model correctly identifies a negative instance, FP refers to the cases where the model incorrectly identifies a negative instance as positive, and FN refers to the cases where the model incorrectly identifies a positive instance as negative.
A TP would be a circumstance in which a model correctly recognizes a piece of content as cyberbullying in the case of cyberbullying classification, while a TN would be a circumstance in which a model correctly identifies non-cyberbullying content.On the other hand, an FP would be a situation where the model incorrectly identifies non-cyberbullying content as cyberbullying, while an FN would be a situation where the model incorrectly identifies cyberbullying content as non-cyberbullying.
Based on the four fundamental concepts (TP, TN, FP, FN), accuracy, precision, recall, F1 score, MCC, and area under the receiver operating characteristic (AUC-ROC) curve are frequent performance analysis evaluation metrics for the classification task.
Accuracy: This metric counts the percentage of all predictions made by the model that came true, both positively and negatively.A higher accuracy means that more instances of cyberbullying have been correctly classified by the model.

Accuracy =
TP + TN TP + FP + TN + FN Precision: The precision metric calculates the ratio of true positives to true positives plus false positives.A higher precision means the model is more accurate at classifying instances of cyberbullying and has fewer false positives.Precision = TP TP + FP Recall: This metric is the ratio of true positives to true positives plus false negatives.A higher recall means the model is more accurate at spotting instances of cyberbullying and has fewer false negatives.

Recall =
TP TP + FN F1 score: This metric, which gives an overall assessment of the performance of the model, is the harmonic mean of precision and recall.A higher F1 score means that the model is more accurate at classifying instances of cyberbullying and has balanced precision and recall.
The metric is a balanced metric that penalizes false positives and false negatives while accounting for both true positives and true negatives.Model performance is better when the MCC is higher.
Area Under the Receiver Operating Characteristic Curve (AUC-ROC): This metric measures the trade-off between the true positive rate and the false positive rate and is a graphical representation of the performance of the model.Better model performance is indicated by a higher AUC-ROC.
Several deep-learning-based cyberbullying detection methods have been conducted by using individual models and presenting the performance of the models with popular evaluation metrics [21, 134,149,150].Raj et al. [149] compared different deep learning and traditional machine learning approaches for cyberbullying classification tasks.They utilized LSTM, Bi-LSTM, GRU, and Bi-GRU for the cyberbullying classification.They employed the Wikipedia Attack Dataset and Wikipedia Web Toxicity Dataset for this purpose.The experimental results show that Bi-LSTM and Bi-GRU outperform other deep-learning-based models in this context on accuracy and F1 score.They obtained the best accuracy and F1 scores of 96.98% and 98.56%, respectively, by using Bi-GRU on the Wikipedia Attack Dataset.On the other hand, they achieved the best accuracy and F1 scores of 96.5% and 98.69%, respectively, using Bi-LSTM over the Wikipedia Web Toxicity Dataset.Bharti et al. [150] applied different deep learning models over the Twitter dataset and found that Bi-LSTM outperformed other deep learning models with accuracy, precision, and F1 scores of 92.60%, 96.60%, and 94.20%, respectively.Iwendi et al.
[21] used Bi-LSTM, GRU, LSTM, and RNN models for cyberbullying detection over the DISCo Kaggle dataset where Bi-LSTM outperformed other models with an accuracy score of 82.18%.Agarwal et al. [134] utilized Bi-LSTM with attention layers over the Wikipedia dataset and compared their study with some existing works.Experimental results show that their proposed method outperformed the existing works in terms of precision, recall, and F1 score of 89%, 86%, and 88%, respectively.
In the literature, there are several studies that also present the performance evaluation of different hybrid deep-learning-based approaches [66,115,151,152].Alotaibi et al. [115] proposed a multichannel deep learning framework where they proposed a combination of transformer block, Bi-GRU, and CNN to detect cyberbullying in Twitter comments where the model outperforms the individual model with an accuracy score of 88%.Bu et al. [66] applied a combined CNN and LRCN model to detect cyberbullying from SNS comments where the model obtained an AUC-ROC and accuracy scores of 88.54% and 87.22%, respectively.Murshed et al. [151] proposed a hybrid method named DEA-RNN that combines Elman-type Recurrent Neural Networks (RNN) with an optimized Dolphin Echolocation Algorithm (DEA) that optimizes the training time to detect cyberbullying on Twitter.They also applied Bi-LSTM and RNN models for the performance comparison, and the experimental results show that their proposed method outperformed the models in terms of accuracy, precision, recall, and F1 score of 90.45%, 89.52%, 88.98%, and 89.25%, respectively.Similarly, Raj et al. [152] combined CNN and Bi-LSTM to classify cyberbullying in real-time posts on Twitter.For the experimental analysis, they employed two other combinations of deep learning models: CNN + Bi-GRU, and Bi-LSTM + Bi-GRU.Experimental results show that their proposed method outperformed the other two combinations in terms of an accuracy score of 95%.Beniwal et al. [153] proposed a hybrid model that combines CNN and Bi-GRU to detect cyberbullying from the Kaggle Toxic comment classification dataset.The proposed model obtained the best accuracy and F1 scores of 98.39% and 79.91%, respectively.Table 5 shows the performance comparison of deep-learning-based cyberbullying-detection systems on different datasets where it is classified whether the model the study used is a hybrid model or not.In addition, the best-performing model in that experiment is reported along with the scores of performance matrices.According to the most recent studies in the area of cyberbullying detection, hybrid deep learning models show more promising results than individual deep learning models.This decision is supported by the growing understanding that classifying cyberbullying is a challenging task that calls for a combination of methods and approaches in order to produce accurate and trustworthy results.In independent models, the methods may have some limitations, which are complemented when we use hybrid models.These models present exciting results in terms of enhancing the overall effectiveness and stability of cyberbullying classification systems.

DL in Cyberbullying Detection
Several studies [21, 64,115,134] have been conducted on the automatic identification of cyberbullying by using different independent ML techniques.A few studies [21,134] exploit RNN-based techniques, i.e., LSTM, BiLSTM, RNN, etc., while some other studies [5,137] use CNN-based techniques (i.e., CNN, PCNN, Char-CNNS, etc.) to detect cyberbullying from different sources.However, we also observe that some authors [63] perform integration of RNN-CNN-based techniques.In this section, we briefly discuss different applications of DL models in cyberbullying detection.In Table 6, we have organized the papers based on three main themes: improvements of DL models, optimization of model performance, and improving data capabilities.Furthermore, we have included a comprehensive groupwise analysis of the significant contributions made by these papers, as well as their potential impact on future research directions.
[21] applied three DL models: Bi-LSTM, LSTM, and RNN, to investigate the performance of DL algorithms in identifying bullying (i.e., insults) in social media.They discovered that Bi-LSTM outperforms other models in terms of accuracy and F1 scores after extensive testing.They also asserted that DL is the most effective method for detecting cyberbullying and related cyber challenges.Anindyati et al. [64] constructed a DL-based model employing three common text classification algorithms: LSTM, Bi-LSTM, and CNN, to detect bullying on Twitter in Indonesia.
Marwa et al. [2] applied a DL technique on a large human-labeled dataset to categorize cyberbullying.LSTM, Bi-LSTM, and CNN were the DL models employed in their tests compared to other algorithms.Agarwal et al. [134] developed an RNN-based technique to identify and categorize cyberbullying posts.To decrease data imbalance and remove ambiguities in classification, they employed a Tomek Link approach to accomplish undersampling.Their classification model was Max-Pooling combined with a Bi-LSTM network and an attention layer.To test their model, they utilized Wikipedia datasets.
Alotaibi et al. [115] offered automation to identify violent cyberbullying act.The approach uses multichannel DL-based on BiGRU, transformer blocks, and CNN models to determine whether a Twitter comment is hostile.They also integrated three well-known hate speech datasets to assess the model performance.Luo et al. [135] presented a BiGRU-CNN sentiment classification model for cyberbullying identification.The BiGRU layer, attention mechanism layer, CNN layer, fully connected layer, and classification layer are different parts of the model.They trained and tested their proposed model using the Kaggle text dataset and the emoji dataset scraped from social networks, which outperforms traditional algorithms.Lu et al. [5] presented the Char-CNNS (Character-level Convolutional Neural Network with Shortcuts) model for detecting cyberbullying in social media discourse.Since the content available on social media is short, noisy, and unstructured with wrong spellings and symbols, they chose the character as the smallest unit of learning to overcome spelling mistakes and purposeful obfuscation.They conducted experiments over the Chinese Weibo dataset and the English Tweet dataset.Results of their experiment show that outstanding performance on the cyberbullying detection task is competitive with the state-of-the-art approaches.Zhang et al. [137] proposed a new pronunciation-based convolutional neural network (PCNN) to handle the difficulty of noise and distortion in social media postings and messages in detecting cyberbullying.They also used three strategies in their model to solve the problem of class imbalance: threshold-moving, cost function modifying, and a hybrid solution.Ahmed et al. [18] built a model to identify cyberbullying in Bangla and Romanized Bangla writings by using ML and DL methods.
In their experiment, they discovered that for one dataset, CNN, a DL algorithm outperforms other ML and DL models, whereas ML models outperform DL models for the other two datasets.
Buan et al. [7] introduced a neural network design for cyberbullying detection that is based on an existing design, which combines convolution layers with LSTM layers.They also introduced a novel activation mechanism known as SVM-like activation, which is accomplished by using L2 weight regularization.They evaluated their suggested model using the bullying traces dataset to classify the challenge between open aggressiveness, covert aggression, and non-aggression in social media writings.Gada et al. [63] suggested an LSTM-CNN model for text-based cyberbullying detection that captures sentence semantics.In addition, they developed a web application for their suggested paradigm.Bu et al. [66] suggested an approach that combines two DL models, one of which is a character-level CNN and the other a word-level LRCN.The first model extracts low-level syntactic information from a character sequence.It is also noise-resistant.The second model, which works in tandem with the CNN model, gathers high-level semantic information from a series of words.They also demonstrated that their suggested ensemble technique outperforms the state-of-the-art algorithms for detecting cyberbullying in comments of social networking sites.
Agrawal et al. [68] developed models by implementing DL models.They used three real-world datasets: Formspring (https://spring.me/accessed on 18 April 2023), Twitter (https://github.com/zeeraktalat/hatespeech/accessed on 18 April 2023), and Wikipedia (https://figshare.com/articles/dataset/Wikipedia_Talk_Corpus/4264973accessed on 18 April 2023), to conduct comprehensive tests.The study gives some interesting insights on the detection of cyberbullying, such as that swear words are not sufficient for detecting cyberbullying.Powerful models that are used for detecting cyberbullying are not expected to depend on such handcrafted features.Dadvar et al. [25] found that DL-based models outperform traditional ML models.They used Wikipedia, Twitter, and Formspring datasets.Al-Ajlan et al. [67] suggested optimized Twitter cyberbullying detection based on DL (OCDD), a unique technique to the cyberbullying detection.To preserve the meaning of the words, their suggested approach encodes a tweet as a set of word vectors rather than collecting characteristics from tweets and feeding them into a classifier.For the classification phase, they employed DL, and for parameter tuning, they used a metaheuristic optimization approach.
Golem et al. [136] offered classical ML, DL, and a mixture of both approaches.To test their algorithms, they used data from Twitter and Facebook (https://sites.google.com/view/trac1/shared-task accessed on 18 April 2023).They ensembled classic ML with DL algorithms by using a voting mechanism.Yadav et al. [73] suggested a unique strategy to identify cyberbullying on social media platforms that improve on current findings by combining a pre-trained BERT model with a single linear neural network layer as a classifier.Their algorithm trains and tests on two manually labeled social media datasets: Formspring (a Q&A forum) and Wikipedia, using a consolidated DL approach.Paul et al. [22] demonstrates a unique use of BERT for detecting cyberbullying.They claim that a simple classification model based on BERT can obtain state-of-the-art results in the three realworld corpora of Formspring, Twitter, and Wikipedia.They discovered that their model outperforms prior studies when compared to slot-gated or attention-based Deep Neural Network models.Paul et al. [6] suggested a DL-based early identification framework that predicts whether a post is classified as bully/nonbully, and they analyzed data for each of the modalities (both separately and fusion-based).Furthermore, the frameworks perform outstandingly.
A DL algorithm needs to understand the pattern from the data, such that it requires a huge amount of data.The performance of the DL model improves when it has a huge amount of training data; otherwise, it does not perform that well.Another reason is that DL models can learn more complex, non-linear functions.It reduces the hassle of feature engineering, as it is performed by the DL algorithm itself.DL models perform well when it comes to complicated problems such as natural language processing, speech recognition, and image classification.

Deep Learning Frameworks
DL frameworks provide a high-level programming interface for building blocks of designing, training and validating Deep Neural Networks.Note that DL frameworks are shown as the right-most branch of our taxonomy shown in Figure 2. In Table 7, we briefly explain 13 different DL frameworks, their strengths, and their limitations.In addition, we have added the supported DL algorithms of these frameworks.Moreover, we specify the usage of these frameworks in the classification of cyberbullying.Wide range of models including CNN, RNN, GAN, Transformer, etc. [155] Chats and Tweets [14], Bangla Text [18], Offline Content [129], Social Media text analysis [112], Comments and Toxicity [156], Multilingual Tweets and Hate speech [157], Wikipedia talk page [158], Post of Social Network platform Gab [159,160]  Twitter [2,75,115,162], Bully, Sentiment, Emotion and Sarcasm from Twitter and Reddit [124], Social media content [68,115,163], Twitter and Wikipedia Chats and Tweets [14], Wikipedia, Twitter, Formspring and YouTube [25], Social networks' text and image [25], online textual harassment [71] Torch/ PyTorch Social Network platform Gab [159], Twitter, Wikipedia, Formspring [22], Harmful meme of COVID-19 [166], Memes of US politics [167], Image from online [168], Cyberbert: BERT for cyberbullying identification [22], Social media content [73] Theano DL models that are used in NLP tasks [175] No Works Found May not be the best choice for small-scale projects.
CNNs and RNNs with a focus on distributed training and image processing [181].

Applicability of Different DL Frameworks
We observed different DL frameworks to pinpoint their suitability of using in cyberbully detection and prediction problems.Some frameworks are flexible and easy to implement while others are quick to implement in any deep-learning-based prototypes.In some cases, a few frameworks are fast to deploy different machine learning models, while other frameworks have quick processing speed.
In order to define layers, network types (CNNs, RNNs), and standard model designs for image or natural language processing, these frameworks provide easy accessibility to their respective libraries.We find that several studies used TensorFlow [14,129,182,183] in their experiments, because the framework is flexible, easy to use and suits natural language processing tasks.A rich collection of libraries make Tensorflow easily usable for the researchers in the field of natural language processing and in cyberbullying detection.The TensorFlow interface is not difficult to understand and use and the framework does not make the platform complicated for the beginners.
However, Keras is also a popular framework [20,25,171,184] for cyberbullying detection since the framework is built to provide a simple interface for quick prototyping by building active neural networks that can work upon TensorFlow.Another widely used framework for cyberbullying detection [19,73,168,185] is PyTorch, which is developed based on the Torch [186] library of Lua programming language.PyTorch is fast and more Pythonic than the rest of the frameworks.
Theano is another heavily used framework in cyberbullying-related tasks [137,170,171] due to its quick processing speed.We have also observed other research that examined cyberbullying by using different frameworks, such as Caffe [187], Deeplearning4j [188], and MXNet [178].Furthermore, other well-known frameworks such as Chainer, DyNet, Lasagne, H2O, etc., can be used for handling large amounts of data; although in the current literature, researchers hardly use these frameworks in cyberbullying detection problems.

Datasets for Experiments
Researchers have conducted several studies to identify cyberbullying over the years.People may encounter cyberbullying by different form of contents such as text, images, collage, meme and others.In this section, we present different datasets, which are relevant to cyberbullying, DL architecture, and tasks that have been conducted in previous studies in Table 8.Most datasets are collected from social media, i.e., Twitter, YouTube, and Wikipedia.Users are likely to interact on social media such as in real-life society.Thus, they might experience different behavior from others, including bullying.On the other hand, we find a limited amount of cyberbullying datasets with images and text.

Challenges, Open Issues, and Future Trends
Detecting cyberbullying is a problem connected with human psychology and emotional response to how an individual reacts toward it due to different factors (i.e., image, emotion, culture, etc.).In the following subsections, we discuss different challenging and open issues with cyberbullying with DL.

•
Require a large amount of dataset: Large volumes of labeled data are required for DL.For example, the creation of self-driving cars involves millions of photos and hundreds of hours of video [198].It is commonly known that data preparation consumes 80-90% of the time spent on ML development.Furthermore, even the strongest DL algorithms will struggle to function without good data and present weak performance to handle biased and unclean data during model training [199].
• High computational power: DL takes a lot of computational power.The parallel design of high-performance GPUs is ideal for DL.When used in conjunction with clusters or cloud computing, this allows development teams to cut DL network time for training from weeks to hours or less [198].

•
Reasoning of prediction unexplainable: DL result prediction follows the Black-Box testing approach.Thus, it is not capable of making any explainable predictions.Since DL's hidden weight and activation are non-interpretable, its predictions are considered as non-explainable [200] In the present world, data are very dynamic.Data are changing due to various factors, which may be constantly changing, such as location, time, and many other factors.However, DL models are built using a defined set, which is called the training dataset.Later, the performance of the model is measured by the data, which also comes from the same distribution of the training data, and eventually, the model performs well.Later, the same model may start performing poorly due to the changing the characteristics of the data, which are not entirely different, but have some variations from the training data.This is difficult to manage in DL to retrain the old models.

Challenges in Cyberbullying detection
• Cultural diversity for cyberbullying: Language is one of the important parts of the culture of a nation.Since cyberbullying has become a common problem among different nations, we may not expect a good prediction model by using a dataset of one nation and testing over the dataset of another culturally varied nation.• Language challenge: Capturing context and analyzing the sentiment from different types of sentences is a difficult task and challenging work for cyberbullying detection.
For example, "The image that you have sent so irritated me and I would rather not contact with you any longer!" is not easy to detect as cyberbullying without investigating from a rationale factor, albeit that model shows negative sentiment [26].• Dataset challenge: Retrieving data from social media is not an easy task, as it relates to private information.Moreover, social media sites do not share user data publicly.Due to these issues, it is hard to gather quality data from social sites, which causes the lack of quality data to improve learning.Another challenging task is to annotate or label the data because they require a domain expert to label the corpus [202].• Data representation challenge: Setting up an effective cyberbullying-detection system is difficult due to the need for human interaction and the nature of cyberbullying.Furthermore, the nature of cyberbullying is challenging to identify in the cyberbullying detection problem.The vast majority of the exploratory works directly identified bullying words in social media.However, separating content-based features have their own difficulties.For the absence of appropriate information, the performance of the model might decay [203].• Natural Language Processing (NLP) challenges: The biggest challenge in natural language processing is understanding the meaning of the text.The relevant task is to build the right vocabulary, link the various components of the vocabulary, establish context, and extract semantic meaning from the data [204].Misspelling and ambiguous expressions are other challenges that are very difficult to solve for the machine.

•
Reusability of pre-trained model for sentiment analysis and cyberbullying: Although cyberbullying detection and sentiment analysis are related tasks, these two tasks have significant differences from each other; therefore, the pre-trained model of one task is likely to be difficult to use to predict another task.Sentiment analysis involves determining the overall emotional tone of a text, where the sentence is positive, negative, or neutral.On the contrary, cyberbullying detection involves identifying specific patterns of harmful words.Yet, there are some sentiment analysis approaches that can be used to identify cyberbullying.Atoum et al. [205] proposed an approach for detecting cyberbullying using sentiment analysis techniques.Nahar et al. [206] presented a novel method for identifying online bullying on social media sites from sentiment analysis.Dani et al. [207] presented a novel framework for supervised learning that uses sentiment analysis to identify cyberbullying.Overall, while sentiment analysis models may be helpful for cyberbullying detection, they cannot be directly reused without significant modifications and additional training.Cyberbullying detection (i.e., yes/no classes) largely needs to identify negative words, which are used to harass a person, while sentiment analysis has three different classes (i.e., negative, positive, and neutral) where negative patterns are part of the problem.In this case, positive and neutral categories are also dominant class labels.Since the nature of the outputs is different in two different problems, we cannot completely reuse one pre-trained model for other cases.

Future Trends
Challenges and issues of technology may unveil the opportunity to conduct further research.There are many avenues to extend the above issues for deploying concrete research.We mainly discuss a few possible aspects as future trends.

•
Multilingual and multimedia content: In current times, social media and other virtual platforms are widely used among different levels of users in terms of age group, culture, language, taste, education, etc.Since social media is a vital platform for propagating cyber harassment, users may use multilingual and multimedia content; therefore, we may put more attention on building efficient cyberbullying detection systems for multilingual and multimedia content.

•
Cyberbullying detection-specific word embedding: In recent times, researchers are introducing different domain specific word-embedding techniques, because these platforms produce accurate results for relevant sets of vocabularies.For example, Med-BERT is used for health-domain-based BERT-aware embedding systems.In this connection, researchers may propose a specialized word-embedding system for cyberbullying detection problems.

•
Cyberbullying detection in SMS and email: Users are concerned with combating cyberbullying problems, which largely propagate through social media platforms.However, future researchers may put more attention on investigating Short Message Service (SMS)-and email-based cyberbullying detection methods.

•
Cyberbullying impact on mental health: Cyberbullying may leave a long-term impact on the mental status of an individual.Some may take a life-threatening step or commit self-injury to curb the severity of the harassment and take death for granted.Therefore, mental health researchers can consider this issue as a timely topic and introduce different methods to fight against cyber harassment.• Use of cutting-edge deep learning: With the advancement of deep-learning-based methods, we may introduce more subtle and delicate techniques to detect cyberbullying problems.For example, stacked and multi-channel CNN or Bi-LSTM-based cyberbullying-based frameworks or their advanced version or hybridization of these models may produce more sophisticated solutions to counter the problems.

Conclusions
Cyberbullying is a kind of harassment using digital technologies, which might take place on smartphones, social media sites, messaging applications, etc.The targeted indi-viduals will likely become agitated by repeated behavior, angering and shaming from the rouge users.This can affect the victim mentally and physically and may lead to severe trauma or mental disorder.In this study, we have thoroughly investigated cyberbullying detection-related existing studies that are based on DL techniques.We also conducted a holistic review to identify the strength and future direction of these works.Future researchers will benefit from this timely review since they can find the existing datasets, the research challenges, and the open issues in this area.
We plan to thoroughly investigate hybrid deep learning models used for the detection of cyberbullying in the future.The research on the identification of cyberbullying in texts and images has been explored in this paper; however, the classification of cyberbullying in speech, videos, or deep fakes is hardly found.In addition, we are interested in performing an extensive analysis of the personalized behavior (i.e., personality, values, etc.) of online users.In the literature, we could not find significant research work on the association between cyberbullying behavior and perpetrators' mental health issues, which could be an interesting part of the research.Additionally, a review of a recommender system can be beneficial for future research in this area because it will be extremely helpful in recognizing patterns in cyberbullying.The research could be associated with the link prediction research because a user can be monitored well ahead by observing his/her day-to-day online behavior so that he/she cannot be turned into a bullier in the course of time.There are several domain-specific word-embedding models in the literature (i.e., Med-BERT for the health domain).We suggest that future enthusiasts on cyberbullying research may plan for cyberbully BERT so that the pre-trained model easily predicts the bully behavior online.

Figure 3 .
Figure 3. Organization of the paper.

Figure 4 .
Figure 4. Online resources inclusion and exclusion process flowchart.

Figure 7 .
Figure 7. Capturing context by BERT of two sentences.

Table 1 .
Comparison of our survey with existing surveys (addressed: , not addressed: , not applicable: N/A).

Table 2 .
Comparison of methodology with existing surveys.

Input Token Embeddings Segment Embeddings Position Embeddings Figure 9. Input
representation of BERT model.The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.

Table 5 .
Performance comparison of deep learning models on different datasets

Table 6 .
Major contributions and prospective future works of cyberbullying detection.

Table 7 .
Strengths, limitations, suitability of DL algorithms, and application in cyberbullying of DL frameworks.
. • Security issue: Preventing the DL models from security attacks is the biggest challenge nowadays.Based on the occurring time, there are two types of security attacks.One is poisoning attack, which occurs during the training period, and another one is evasion attack, which occurs during interference (after training).By corrupting the data with malicious examples, poisoning attacks compromise the training process.On the other hand, evasion attacks use adversarial examples to confuse the entire classification process [201].• Models are not adaptive: