An Assessment of Deep Learning Models and Word Embeddings for Toxicity Detection within Online Textual Comments

: Today, increasing numbers of people are interacting online and a lot of textual comments are being produced due to the explosion of online communication. However, a paramount inconvenience within online environments is that comments that are shared within digital platforms can hide hazards, such as fake news, insults, harassment, and, more in general, comments that may hurt some-one’s feelings. In this scenario, the detection of this kind of toxicity has an important role to moderate online communication. Deep learning technologies have recently delivered impressive performance within Natural Language Processing applications encompassing Sentiment Analysis and emotion detection across numerous datasets. Such models do not need any pre-deﬁned hand-picked features, but they learn sophisticated features from the input datasets by themselves. In such a domain, word embeddings have been widely used as a way of representing words in Sentiment Analysis tasks, proving to be very effective. Therefore, in this paper, we investigated the use of deep learning and word embeddings to detect six different types of toxicity within online comments. In doing so, the most suitable deep learning layers and state-of-the-art word embeddings for identifying toxicity are evaluated. The results suggest that Long-Short Term Memory layers in combination with mimicked word embeddings are a good choice for this task.


Introduction
In these years, short text information is continuously being created due to the explosion of online communication, social networks, and e-commerce platforms. Through these systems, people can interact with each others, express opinions, engage in discussions, and receive feedback about any topic. However, a paramount inconvenience within online environments is that text spread by digital platforms can hide hazards, such as fake news, insults, harassment, and, more in general, comments that may hurt someone's feeling. These comments can be considered to be the digital version of personal attacks (e.g., bullying behaviors) that can cause social problems (e.g., racism), and they are felt to be dangerous and critical by people who are struggling to prevent and avoid them. The risk of such a phenomenon has increased with the event of social networks and, more in general, within online communication platforms (https://medium.com/analytics-vidhya/twittertoxicity-detector-using-tensorflow-js-1140e5ab57ee, accessed on 11 February 2021). An attempt to deal with this issue is the introduction of crowdsourcing voting schemes that give the possibility to denounce inappropriate comments in online environments to the users. Among many others, Facebook, for example, allows its users to report a post in terms of violence or hate speech [1]. This scheme allows Facebook to identify fake accounts, offensive comments, etc. However, these methodologies are often inefficient, as they fail to detect toxic comments in real time [2], becoming a requirement within social network communities. A toxic post might have been published online much earlier than the time it is reported and, during the time it is online, it might cause problems and offenses to several users which might have undesired behaviors (e.g., leaving the underlying social platform). Therefore, detecting toxicity within textual comments through novel technologies has great relevance in the the prevention of adverse social effects in a timely and appropriate manner within online environments [3].
In the last years, the use of data for extracting meaningful information to interpret opinions and sentiments of people in relation to various topics has taken hold. Today, textual online data are parsed to predict ratings about online courses [4], sentiments associated to companies and stocks within the financial domain [5], and, recently, healthcare [6], toxicity in online platforms [7]. All of these approaches fall within the Sentiment Analysis research topic, which classifies data into positive or negative classes, and it includes several subtasks, such as emotion detection, aspect-based polarity detection [8], etc. To detect such knowledge, supervised Machine Learning-based systems are designed and provided by the research community to support and improve online services to mine and use the information. Training data are required to employ supervised Machine Learning based tools; however, the amount of labeled data might result insufficient, thus making challenging the design of these tools. This is more stressed with the spread of Neural Networks and deep learning models, which can reproduce cognitive functions and mimic skills that are typically performed by the human brain, but need large amount of data to be trained. With the elapse of time, the interest in these technologies as well as their use for the identification of various kinds of toxicity within textual documents are grown [1].
Word embeddings are one of the cornerstones in representing textual data and feed Machine Learning tools. They are representations of words mapped to vectors of real numbers. The first word embedding model (Word2Vec) utilizing Neural Networks was published in 2013 [9] by researchers at Google. Since then, word embeddings are encountered in almost every Natural Language Processing (NLP) model used in practice today. The reason for such a mass adoption is their effectiveness. By translating a word to an embedding, it becomes possible to model the semantic importance of a word in a numeric form and, thus, perform mathematical operations on it. In 2018, researchers at Google proposed the Bidirectional Encoder Representations from Transformers (BERT) [10], a deeply bidirectional, unsupervised language representation able to create word embeddings that represent the semantic of words in the context they are used. On the contrary, context-free models (e.g, Word2Vec) generate a single word embedding representation for each word in the vocabulary independently from the word context.
Within this scenario, in this paper various deep learning models that are fed by word embeddings are designed and evaluated to recognize toxicity levels within textual comments. In details, four deep learning models built using the Keras (https://keras.io/, accessed on 11 February 2021) framework are designed, and four different types of word embeddings are analyzed.
To this aim, the current state-of-the-art toxicity dataset released during the Kaggle challenge on toxic comments (https://www.kaggle.com/c/jigsaw-toxic-commentclassification-challenge/overview, accessed on 11 February 2021) is used.
The reader notices that this paper analyzes the performances of deep learning and classical Machine Learning approaches (using tf-idf and word embeddings) when tackling the task of toxicity detection. Basically, we want to assess whether the syntactic and semantic information lying within the text can provide hints on the presence of certain toxicity classes. This is not possible in some domains and tasks: for example, for the problem of identifying empathetic vs. non-empathetic discussion within answers of a therapist during motivational interviews, it has been initially observed that syntactic and semantic information do not provide any clue for the classification task, leading to very low accuracies [11].
Thus, for a fair analysis, it is important that the dataset does not contain any unbalanceness. Machine Learning classifiers fail to cope with imbalanced training datasets, as they are sensitive to the proportions of the different classes [12]. As a consequence, these algorithms tend to favor the class with the largest proportion of observations, which may lead to misleading accuracies. That is why we preprocessed the mentioned dataset to make it balanced and then applied a 10-fold cross-validation to tackle the proposed task.
Thus, this paper provides the following contributions: • We analyzed four deep learning models based on Dense, Convolutional Neural Network (CNN), and Long-Short Term Memory (LSTM) layers to detect various levels of toxicity within online textual comments. • We evaluate the use of four word embedding representations based on Word2Vec [9,13] and Bidirectional Encoder Representations from Transformer (BERT) [10] algorithms for the task of toxicity detection in online textual comments. • We provide a comparison between deep learning models against common baselines that are used within classification tasks of textual resources. • We release contextual word embeddings resource trained on a dataset, including toxic comments.
The source code that is used for this study is freely available through a GitHub repository (https://github.com/danilo-dessi/toxicity, accessed on 11 February 2021).
The remainder of this paper is organized, as follows. Section 2 includes a literature review and discusses current methods for toxicity detection in textual resources. Section 3 formalizes the problem. Section 4 describes the word embeddings and deep learning models adopted in this research work. Research results and their discussion are reported in Section 5. Finally, Section 6 concludes the paper and illustrates future directions to further tackle the detection of toxic comments.

Related Work
A few past works have already addressed the challenge of detecting toxicity within textual comments that are left by users within online environments. Generally, they rely on Sentiment Analysis methods [14][15][16][17][18][19] to detect and extract the subjective information and classify emotions and sentiments to determine whether a toxicity facet is present or not. For doing so, NLP, Machine Learning, Text Mining, and Computational Linguistics are the most prominent technologies employed [20,21]. Sentiment Analysis methods, like many others within the Machine Learning domain, can be mainly split into two categories. i.e., supervised and unsupervised. Supervised techniques require the use of labeled data (training set) to train a model that can be applied to unseen data in order to predict a sentiment or an emotion [22][23][24]. These methods often are limited by the lack of labeled data, or by the fact that there are not either good or enough examples for certain categories (e.g., in case of dataset imbalance) [25]. On the other hand, unsupervised Sentiment Analysis approaches usually rely on semantic resources, like lexicons, where words are assigned to scores for reflecting words relevance for target categories to infer sentiments and emotions of the input data [26][27][28]. Supervised and unsupervised approaches are both largely explored in literature for Sentiment Analysis tasks, which include Sentiment Analysis polarity detection (i.e., identifying whether a certain text is either positive or negative) [29], figurative-language uncovering (understanding if the input text if figurative or objective) [21,30], aspect-based polarity detection (e.g., assigning sentiment polarity to features of a certain topic, such as the screen of an iPhone) [31,32], sentiment scores prediction (e.g., identifying a continuous number in [−1, 1] to a certain topic or text) [4], and so on.
However, these methodologies have only recently been explored for toxicity detection [33], although the need to monitor online communications to identify toxicity and make the communications safe and respectful is an old and still open issue. Hence, the gap between the current methodologies and their potential use within toxicity detection remains an open challenge. Therefore, dealing with toxicity raises new challenges and research opportunities where deep learning-based approaches for Sentiment Analysis can have a relevant role in making advancements for the identification of toxicity levels.
Additionally, Semantic Web technologies are being used within Sentiment Analysis tasks. It has been proved that they bring several benefits, leading to higher accuracy [34]. For example, the use of sentiment-based technologies to detect toxicity is investigated in [35]. However, the use of word embedding representation is not taken into account. A work worth noting is [21], where the authors analyzed the problem of figurative language detection in social media. More in detail, they focused on the use of semantic features that were extracted with Framester for identifying irony and sarcasm. Semantic features have been extracted to enrich the representation of input tweets with event information using frames and word senses in addition to lexical units. Sentilo [36,37] represents one more example of an unsupervised method that exploits Semantic Web technologies. Given a statement expressing an opinion, Sentilo recognizes its holder, detects its related topics and subtopics, links them to relevant situations and events referred to by it, and evaluates the sentiment expressed on each topic/subtopic. Moreover, Sentilo is domain-independent and it relies on a novel lexical resource, which enables a proper propagation of the sentiment scores from topics to subtopics. Its output is represented as an RDF graph and, where applicable, it resolves holders' and topics' identity on Linked Data.
Recently, the authors in [33] discussed the problem of toxicity detection and proved that context can both amplify or mitigate the perceived toxicity of posts. Besides, they found no evidence that context actually improves the performance of toxicity classifiers. In another work [38] the authors presented an interactive tool for auditing toxicity detection models by visualizing explanations for predictions and providing alternative wordings for detected toxic speech. In particular, they displayed the attention of toxicity detection models on user input, providing suggestions on how to replace sensitive text with less toxic words.
Others, Ref. [39], tackled the problem of identifying disguised offensive language, such as adversarial attacks that avoid known toxic patterns and lexicons. To do that, they proposed a framework to fortify existing toxic speech detectors without a large labeled corpus of veiled toxicity. In particular, they augmented the toxic speech detector's training data with new discovered offensive examples.
Deep learning technologies have been leveraged by the authors in [40] to tackle the problem of toxic comments detection. More in details, the authors introduced two state-ofthe-art neural network architectures and demonstrate how to employ a contextual language representation model.
One more work that deals with a sentiment toxicity detection problem is [7], where the authors adopt both pre-trained word embeddings and close-domain word embeddings previously trained on a large dataset of users' comments [41].
However, their approach is based on a Logistic Regression (LR) classifier and it does not use state-of-the-art deep learning technologies. Well established methodologies (e.g., k-nearest neighbors (kNN), Naive Bayes (NB), Support Vector Machines (SVM), etc.) are today outperformed for the same tasks by CNN-based models by [42].
One more work for toxicity detection is proposed by authors in [43] and it lies within the context of multiplayer online games. There, social interactions are an essential feature for a growing number of players worldwide. This interaction might bring undesired and unintended behavior, especially if the game is designed to be highly competitive. They defined toxicity as the use of profane language by one player to insult or humiliate another player in the same team. Given the specific domain, the use of bad words is a necessary, but not sufficient, condition for toxicity, as they can be used to curse without the intent to offend anyone. The authors looked at the 100 most frequently used n-grams for n = 1, 2, 3, 4 and manually determined which of them are toxic or not. With such training data, they use a SVM to predict the odds of winning for each team to observers based on their communication, while the match is still going.
Another work that embraces both deep learning and word embeddings for toxicity detection is reported in [1], where FastText (https://fasttext.cc/docs/en/english-vectors.html, accessed on 11 February 2021) pre-trained embeddings are used to feed four different deep learning models that are based on CNN, Long Short Term Memory (LSTM), and Gated Recurrent Unit (GRU) layers. However, the experiments show weak results probably due to the class imbalance of classes. Conversely, in this work deep learning models by using a balanced dataset are trained, when considering one toxicity class at a time, and trying to better represent the input texts by using word embeddings tuned to the target domain. More precisely, in one set of experiments, domain generated word embeddings are created through mimicking techniques; this allows to face slang, misspellings, or obfuscated contents not represented within pre-trained word embedding representations [24,44]. Besides the Word2Vec embeddings, state-of-the-art word embeddings, called BERT [10,45,46], are used to tune the vectors to the context where words are used.

Problem Formulation
The problem that is faced in this paper is a multi-class multi-label classification problem. We turned it into several binary single-label classification problems. More precisely, given a textual comment c and a toxicity facet t, the approach is aimed to build a deep learning model where l is a binary label that can only assume values in {0, 1} and indicates whether the toxicity t is present in c (i.e., l takes the value 1) or not (i.e., l takes the value 0). Therefore, with such an approach, an independent binary classifier for each toxicity label is trained. Given an unseen sample, each binary classifier predicts whether that underlying toxicity is present or not in the sample. The combined model then predicts all of the labels for this sample for which the respective classifier predicts a positive result. Although this method of dividing the task into multiple binary tasks may resemble superficially the one-vs-all and one-vs-rest methods for multi-class classification, it is essentially different from them, because a single binary classifier deals with a single label without any regard to other labels whatsoever. This means that each binary classification task that we formulated does not benefit from the information of the other labels at training time. However, this mapping is straightforward and it does not change the semantic of the input problem [47]. By building these models for various t, the performances of the proposed solutions are evaluated with the goal of finding which combination of the deep learning layers and word embeddings can better capture the text peculiarities for toxicity detection.

The Proposed Approach
In this section, we will describe the deep learning models and word embedding representations for representing the text expressing the various toxicity categories.

Preprocessing
Text preprocessing techniques, such as stop words and punctuation removal, lemmatization, stemming, matching words with a dictionary to correct grammar, removing words containing alpha-numeric characters, and so on, are common practices when Machine Learning algorithms are applied [48,49], and text representation is generated as a result of different feature engineering processes. However, with the introduction of deep learning approaches, these techniques have not shown promising results. The reason is that neural networks learn from any element found within the text, because each token contributes to the sentence semantics. Therefore, although certain terms might be included in existing stop word lists, they are maintained because they can enrich the semantics of text con-tent and improve the performance of the deep learning model [1]. Hence, as suggested by authors in [1], all of the above mentioned preprocessing steps are ignored; only the conversion of texts in lower case is performed. Afterward, the whole set of input text is ready to feed a deep learning model. More precisely, imagine to have a toxicity target class t and a set of pairs P = {(c 0 , l 0 ), . . . , (c n , l n )}, where c i is a textual comment and l 0 is a binary label that can only take either the value 0 if the comment c i does not include the toxicity t or 1 if the comment c i expresses some level of toxicity t. From the set P, the set P = {(c 0 , l 0 ), . . . , (c n , l n )} is derived, where each comment c i is an integer-encoded comment of the original c i . In details, let W be the list of all the words that belong to all textual comments, and WS the set of all the words in W without duplicates (i.e., WS has only one occurrence for each input word, whereas W can contain multiple occurrences of the same element). Subsequently, two functions θ and φ, which map the elements in W and WS to unique integer values, respectively, are built. For example, consider the sentence you both shut up or you both die and imagine to have the toy functions θ toy and φ toy . The function θ toy maps "you" to "7", "both" to "43", "shut" to "22", "up" to "76", "or" to "10", "you" to "3", "both" to "41", and "die" to "50". The function φ toy maps "you" to "7", "both" to "43", "shut" to "22", "up" to "76", "or" to "10", and "die" to "50". Subsequently, the integer-encoded sentence is [7,43,22,76,10,3,41,50] by applying θ toy , and [7,43,22,76,7,43,10,50] by using φ toy . The reader notices that, by using θ toy , the words "you" and "both" are mapped to different integers. Within our approach, the function θ is used for BERT word embeddings, whereas the function φ is used to encode the input text when Word2Vec word embeddings are employed. Figure 1 shows the designed deep learning model schemes. In particular, we illustrate four deep learning models based on Dense, CNN, and LSTM layers that are available within the Keras framework (https://keras.io/, accessed on 11 February 2021). All the models present the same number of layers. It is worth noting that the input and output layers among the models are the same to better compare their performances when only considering the type of neural network that they adopt. More precisely, the input layer is an Embedding layer, which has the goal of mapping the words of the input text to the underlying word embeddings. The last layer is a Dense layer that maps the intermediate results of the models in a single label that can only take the values 0 and 1. For doing so, it uses the sigmoid activation function to compute a probability that can be easily used to obtain the correct label value. In the next paragraphs, we will give more details about the deep learning layers.

Deep Learning Models
The literature already showed [50] that deep learning methods that are trained with word embeddings outperform those trained with tf-idf features. Therefore, we did not include the latter in our analysis, as we believe that they would not add additional value to the current evaluation.

Dense Model
The first model is depicted in Figure 1a. It is composed of two inner Dense layers with 128 and 64 neurons. They are densely-connected layers that are able to reduce the input size of hundred and thousands of nodes to a few nodes whose weights can be used to predict the final class of the input.

CNN Model
The CNN model that is depicted in Figure 1b is based on inner CNN layers. These layers perform filtering operations to detect meaningful features of textual input for the target toxicity facet. Filters can be envisioned as kernels that slide on the vector representation and perform the same operations on each element until all of the vectors have been covered. Two kernels of size 10 for the first layer and size 5 for the second layer are used.
For these layers, the same number of neurons previously introduced for the Dense layers is used to better compare the model performances.

LSTM Model
The model that is depicted in Figure 1c exploits the LSTM layers to perform a binary classification of the input text. LSTMs are an extended version of Recurrent Neural Networks (RNN) and they are designed to work on sequences. They use memory blocks to hold the state of the computation, which makes it possible to learn temporal dependencies of data, binding the chunks of data that are currently being processed with the chunks of data already processed. This allows for inferring semantic patterns that describe the history of the input data, solving the problem of common RNN whose results mostly depend on the last seen data fed into the model, smoothing the relevance of previously processed data.

Bidirectional LSTM
The last model, as shown in Figure 1d, is an evolution of the LSTM model. It uses bidirectional LSTM layers to find patterns that can be discovered by exploring the history of the input data in both forward and backward directions. The idea of this kind of network consists of presenting the training data forwards and backward to the two bidirectional LSTM hidden layers whose results are then combined by a common output layer.

Word Embeddings Representations
In this section, the word embedding representations that are used to model the syntactic and semantic properties of the words in vectors of real numbers are introduced. Within this work, the employed word embedding representations are Word2Vec [9,13] and BERT [10]. We chose the most common sizes for the embeddings, i.e., 300 for Word2Vec embeddings, and 1024 for BERT word embeddings.

Word2Vec
The Word2Vec [9,13] word embedding generator aims to detect the meaning and semantic relations among the words by investigating the co-occurrence of words in documents within a given corpus. The idea behind this algorithm is to model the context of words by exploiting Machine Learning and statistics and come up with a vector representation for each word within the corpus. The resulting word vector representations allow the recognition of relatedness between words. For example, the verbs capture and catch, which are syntactically different but share common meaning and present analogous co-occurring words, are associated to similar vectors. A Word2Vec model can be trained by using either the Continuous Bag-Of-Words (CBOW) or the Skip-gram algorithm. Within our work, the Skip-gram algorithm is adopted, because, from a preliminary evaluation, it obtained higher performances. In details, the following Word2Vec word embeddings are used: • Pre-trained. Pre-trained word embeddings that are released by Google and available online (https://code.google.com/archive/p/word2vec/, accessed on 11 February 2021). They are trained on the Google news dataset and they contain more than one billion words. However, their use can be limited by words that could be misspelled (e.g., words with orthographic errors) or domain-dependent words within the input data. These words are commonly referred to as Out Of Vocabulary (OOV) words. • Domain-trained. Domain-trained word embeddings are trained on the original unbalanced dataset (we merged the training and the test set) provided by the Kaggle challenge. The reader notices that we computed the domain-trained embeddings on the new training sets only (at each iteration of the 10-fold cross-validation procedure) of our evaluation strategy. Training the embeddings on the domain data solves the problem of OOV words, because, for each word, it is possible to associate a vector. However, words that are not frequent within our data might have a vector that does not fully and correctly represent words' semantics. The Skip-gram Word2Vec algorithm available within the gensim (https://radimrehurek.com/gensim/, accessed on 11 February 2021) library is used. The model is trained using 20 epochs. • Mimicked. Mimicked word embeddings are embeddings of OOV words that are not present within the original model used to represent the text data, but they are inferred by exploiting syntactic similarities of words that are in the originally considered vocabulary. More in details, we used the algorithm that was proposed by [44], which is based on an RNN and works at character level. Words within an original vector model representation are firstly encoded by sequences of characters, and characters are associated with new vector representations. Subsequently, by using a BiLSTM network, an OOV word w is associated to a new word embedding e. To create word embeddings for the OOV words, we used the default input dataset, the hyperparameters mentioned in [44], and the pre-trained Word2Vec Google embeddings.

BERT
The BERT word embeddings model was introduced in late 2018 by the authors in [10]. It is a novel model of pre-trained language representations that allows for the tuning of word vector representations to the meaning that the word has in a given context, overcoming ambiguity issues of words. One of the famous examples is usually reported with the word bank. Consider the two sentences "The man was accused of robbing a bank" and "The man went fishing by the bank of the river". The introduced word embedding models describe the word bank with the same word embedding, i.e., they express all the possible meanings with the same vector and, therefore, cannot disambiguate the word senses based on the surrounding context. On the other hand, BERT produces two different word embeddings, coming up with more accurate representations for the two different meanings. For doing so, BERT computes context-tuned word embeddings, resulting in more accurate representations that might lead to better model performances. In this work, the bert_24_1024_16 BERT model trained on book_corpus_wiki_en_cased is employed and finetuned by using the bert_embedding (https://pypi.org/project/bert-embedding/, accessed on 11 February 2021) library.

Word Embeddings Preparation
To load word embeddings into a deep learning model, they have to be organized into a matrix M. For Word2Vec word embeddings, the set WS of words in the input data is used to build M as a matrix of size (|WS|, 300), where each row with index φ(w) | w ∈ WS (i.e., row φ(w) ) contains the word embedding of the word w. If a word w is not present in the Word2Vec selected resource (e.g., when only pre-trained word embeddings are used), then row φ(w) is a row with all of its entries set to 0. Similarly, when the BERT embeddings are employed, the matrix M size is (|W|, 1024), where each row with index θ(w) | w ∈ W (i.e., row θ(w) ) contains the word embedding of the word w. The generated matrix M is loaded into the Embedding layer of the employed deep learning model to map the encoded textual comments to the correct word embeddings.

Experimental Study
In this section, we describe the dataset used to perform our experiments, the obtained results, and the related discussion. All of the experiments are run by using a 10-fold crossvalidation setup. Each model is trained with batches of size 128. The model is configured to train, at most, with 20 epochs. However, an early stopping method with patience of 5 epochs and a delta of 0.05 that monitors the accuracy of the model are embedded within the training stage. The loss function used to train the models is the binary crossentropy and the used optimizer is rmsprop with the default learning rate 0.001 provided by the used library. The domain-trained word embeddings have been computed on the training sets only at each iteration of the 10-fold cross-validation procedure. All of the other parameters have been empirically set on the basis of the models performance and previous experiences in past works [4,24]. The experiments have been carried out on a Titan X GPU that was mounted on a server with 16 GB of RAM memory.

The Dataset
To perform our analysis, we employed the dataset that was released by a Kaggle competition (https://www.kaggle.com/, accessed on 11 February 2021). The dataset is collected from Wikipedia comments, which have been manually labeled into 6 different toxicity classes. It consists of training and test files. However, the original split is not kept in order to apply the proposed approach and balance the data. The dataset is composed of more than 200 k comments and it presents annotations for six different toxicity classes and one more class when no toxicity is present. Table 1 reports the number of comments and the related percentage concerning the original dataset (second and third columns) belonging to each of the seven resulting classes. The first row includes the comments that do not present toxicity, then, from the second row on, the number of comments for each toxicity class (toxic, severetoxic, obscene, threat, insult, identityhate) are reported. Besides, from Table 1 it is worth noting that the dataset is strongly unbalanced, as nearly 90% of the overall comments do not present toxicity. Therefore, the training of a model is biased because the model does not have a sufficient number of examples of the minority class to correctly identify a pattern, as mentioned earlier in the paper. A random model that always predicts the majority class can obtain better performances although it is not be able to recognize elements that should belong to the minority class. Hence, having a balanced dataset is a common procedure in several classification tasks [51] and allows for better understanding the performances of a model [12].
It follows that, for each toxicity class, we built a dataset where the number of positive examples (i.e., comments that present the target toxicity class) and the number of negative examples (i.e., comments that do not present that target toxicity class) are the same. Table 1 reports the size of the created datasets for each class under the Balanced dataset size column. The reader notices that, for a certain toxicity class, the negative examples are chosen among all the other classes, including the No toxic comments.

Baselines
For evaluation purposes, the deep learning models have been compared to a certain number of baselines. These are classical Machine Learning classifiers that are usually employed with the tf-idf to represent textual resources [48]. More precisely, the deep learning models are compared against the following classifiers: • Decision Tree (DT) . The Decision Tree algorithm builds a model by learning decision rules that when applied to the input features can correctly predict the target class. The model has a root node that represents the whole set of input data. This node is subsequently split into its children by applying a given rule. The process is then recursively applied to its children as long as there are nodes that can be split. • Random Forest (RF). This method adopts more DTs applied on different samples of the input data and uses a majority voting strategy to predict the output classes. The strength of this algorithm is that each DT is individually trained; therefore, overfitting and errors due to biases are limited. We adopted a classifier that made use of 100 DTs estimators. • Multi-Layer Perceptron (MLP). This is a neural network that is composed of a single layer of nodes. We used a layer with 100 nodes in our experiment.
For these classical Machine Learning methods employed as baselines, the adoption of just word embeddings is not promising and this has already been shown in literature [52]. In particular, when employing word embeddings for classical Machine Learning methods, they should be processed by operations such as the average or the sum before being fed to a given classifier. This causes a loss of syntactic and semantic information expressed by the embeddings of each word.
Additionally, the area under the Receiver Operating Characteristic Area Under the Curve (ROC-AUC) is also reported in Table 2 in order to understand the performance of our model with respect to the best models proposed for the challenge's task.

Results and Discussion
In this section, we discuss the results of the experiments that we have carried. They are reported in Tables 2 and 3 in terms The results depicted in Table 3 show how the deep learning models perform against the baselines (classical Machine Learning approaches). For each deep learning model, the performance of the model in combination with the embedding representations is illustrated as well.

Comparison with the Kaggle Challenge
The results that are indicated in Table 2 report the ROC-AUC values of our deep learning approaches for each toxicity class and the average over all the classes. The reader notices that it is not the purpose of this paper to compete with the other participants of the Kaggle challenge where the data have been extracted and the evaluation has been reported while using the ROC-AUC. The best three approaches of the challenge were Toxic Crusaders, neongen & Computer says no, and Adversarial Autoencoder, which reported a ROC-AUC value of 0.989, 0.988, and 0.988, respectively. The challenge task was to test any proposed approach on a highly unbalanced dataset. In this paper, we wanted to study how deep learning methods and classical Machine Learning approaches (using tf-idf and word embeddings) perform on the toxicity problem without any bias (unbalanceness of the data). Moreover, it has been proved that optimizing a method for the ROC-AUC does not guarantee the optimization on the precision-recall curve [53]. This is why we included Table 3 with precision, recall, and f-measure metrics computed on the preprocessed balanced dataset. There are several heuristics and tuning that can be done in the presence of unbalanced datasets to help achieve high values of ROC-AUC. Those could not be performed by us, since we used a balanced version of the original dataset.

Baseline Comparison
The results indicate that Denseand CNN-based models are not much better than the baseline methods. Actually, in some cases, they are outperformed. For example, considering the toxicity classes obscene and insult, it is possible to observe that the f-measure that is computed on the baseline predictions is higher than the one obtained by Denseand CNN-based models. On the other hand, LSTM-based models are able to outperform the baseline methods with a minimum improvement in terms of f-measure of 0.01, i.e., in percentage 1% (see obscene class), and a maximum of 0.058, i.e., in percentage 5.8% (see toxic class). These results are similar, and sometimes still more noticeable when the Bidirectional LSTM layers are employed. Moreover, when considering that, by using the balanced dataset, every classifier is able to obtain a f-measure that is always higher than 0.8, the improvements can be considered to be remarkable. The only drawback is related to the computational time needed to train the deep learning model. Nevertheless, the training time is not reported, since: (i) it is out of the scope of this study; (ii) with modern GPUs, it is feasible to train complex deep learning models; (iii) the training step must be executed only once; and, (iv) the computational time that is needed for the prediction step does not depend on the underlying model used for the training step.

Dense-Based Model
For the task of toxicity detection, the Dense-based model never obtains the best performances. In most of the cases, the best results with this model are obtained with the mimicked word embeddings where for four out of six classes the achieved f-measure score is the highest. The pre-trained word embeddings obtain high performances too, especially for classes, such as Toxic, Threat (in this case, the f-measure is very close to the case when using mimicked), and Insult. The use of domain-trained word embeddings never meets high scores, except when the precision is considered for the Insult class. Similarly, BERT word embeddings performances are the worst.

CNN-Based Model
Using the CNN-based model the results do not improve further with respect to the Dense-based model. In some cases, the performances of the model are even lower. With this model, the best results are obtained by employing the mimicked word embeddings for the toxicity classes Toxic, Obscene, Threat, and Identity Hate. For the other toxicity classes, the best results are obtained using the pre-trained word embeddings. Domain-trained and BERT embeddings are not able to properly represent the domain knowledge for the CNN model, thus the results are poor.

LSTM-Based Model
The LSTM model outperforms both Dense and CNN-based models, proving its suitability to detect patterns for toxic detection. As previously mentioned, mimicked word embeddings are employed for the deep learning model to learn and uncover toxicity from the text comments. Pre-trained and Domain-trained word embeddings obtain good performances, and their results are not far from the model using the mimicked word embeddings.
On the other hand, once again BERT is not a good representation for the LSTM model. Except for BERT, the three other word embeddings that are adopted with the LSTM model outperform the baseline methods for almost each toxicity level.

BiLSTM-Based Model
Although the higher complexity of the employed layers, the results of the BiLSTM (Bidirectional LSTM) model are similar to those obtained by the LSTM model. In some cases, the BiLSTM is able to outperform the LSTM, in others it is not. Moreover, it differs from the other models, because its best performances for many classes are obtained using the domain-trained word embeddings. The pre-trained and mimicked word embeddings continued to show good ability to represent domain knowledge, and BERT embeddings are confirmed to be the last choice for the task of toxicity detection. Similarly to the LSTM model, except using BERT, the model outperforms the baselines in almost each toxicity class.

Overall Evaluation of the Deep Learning Models
The use of deep learning for the task of toxicity detection has shown good performances in all the toxicity classes. Additionally, it turns out that, despite the small size of datasets employed for certain classes, they are able to detect patterns that allow for correctly performing the classification. More in details, the results suggest that the Dense and CNN models perform well, since their f-measure is always higher than 0.8, but, for the toxicity detection task, they are outperformed by the LSTM and BiLSTM models, which obtain a f-measure higher than 0.9 in most of the cases. The results are comparable among the LSTM and BiLSTM models. However, because BiLSTM-based models need a higher computational time to be trained than LSTM models, the latter are slightly preferred. It is worth mentioning that the current models are trained without the context that was surrounding the comments in the Wikipedia pages (where the dataset has been originally collected) and, therefore, they might lack the necessary information to predict the correct class. One more obstacle might be also due to the presence of figurative language within the comments, which might change the meaning of the sentences, thus misleading the models. For example, a frequent sentence like I am going to kill you pronounced after a mistake or an undesired change in the Wikipedia pages does not necessarily convey a threat or hate emotion, but it may be simply a joke.

Overall Evaluation of Word Embeddings
From the results, it is noticeable that the Word2Vec algorithm is a good choice for representing textual resources to be parsed with deep learning models. The results suggest that mimicked word embeddings are the best choice, because they enclose the knowledge of pre-trained word embeddings that have been built on a large dataset and do not suffer from the OOV words problem [24]. Domain-trained word embeddings obtain good results, but, for most of the cases, they are outperformed. This may depend on the fact that the resources that are employed to train these embeddings are not very large and, besides, there is not a sufficient number of examples of toxicity due to the unbalanced number of toxic comments in the input dataset (i.e., more than 200 k comments do not present toxicity; the reader can see Table 1).
Surprisingly, BERT embeddings perform badly for the task of toxicity detection, although they are currently the state-of-the-art word embedding representations. A possible motivation behind this finding is that assigning a different embedding to the same word is somehow misleading to the training of the deep learning models. More precisely, the tuning step performed to generate the BERT embeddings on our data is not able to capture the context of the words due to the length of some input textual comments and to the typos and incorrect grammar often present within them, thus transferring possible erroneous information to our deep learning models. One more reason might be due to the lack of the surrounding context of the comments; it might have limited the fine-tuning of the model, therefore leading the semantics of words to be captured badly. This fact is worth investigating, and a close analysis to this problem is required.

Conclusions and Future Work
In this paper, we presented an assessment of various deep learning models fed by various word embedding representations to detect toxicity within textual comments. From the obtained results, we can definitely state that toxicity can be identified by machine and deep learning approaches fed with syntactic and semantic information extracted from the text. We show how LSTM-based model is the first choice among the experimented models to detect toxicity. We also show how various word embeddings may represent the domain knowledge in a variety of ways, and an unique model for all cases might be insufficient. In particular, the results are encouraging when using mimicking techniques to deal with OOV words where there are not many examples to build significant domain-dependent word embeddings. As future works, we plan to perform a deeper assessment of deep learning models by using and combining different layers, to better detect patterns and on real scenarios where classes may be unbalanced as well. Moreover, we would like to investigate other contextualized word embedding representations, such as ELMO [54] for the toxicity detection task. An analysis of the proposed approaches on which configuration, parameter settings and heuristic may be added to tackle the same problem but in presence of highly unbalanced datasets is definitely a research direction we would like to investigate as well. Finally, we would like to investigate the impact of using different embeddings for the same word, since it might be the cause of failure of BERT embeddings in our experiments. We also think that an ensemble strategy of the proposed approaches should result in better overall performances and are also investigating this direction.