1. Introduction
In recent years, deep learning methods have achieved great success in solving various problems, including bioinformatics [
1], cybersecurity [
2], manufacturing [
3,
4], or natural language processing (NLP). NLP deals with creating computational algorithms for the automatic analysis and representation of human language. In the field of NLP, neural networks achieve excellent results compared to traditional machine learning models, such as SVM (support vector machine) or logistic regression. In comparison to traditional machine learning algorithms, deep learning algorithms can learn multiple levels of representation. From the perspective of NLP, deep learning models (especially recursive neural networks) can also capture sequence information within the text (e.g., phrases), which makes them a more suitable option for NLP than the traditional methods. In recent years, convolutional neural networks (CNNs) have shown breakthroughs in some NLP tasks, such as text classification [
5,
6,
7].
Nowadays, online platforms are a widespread phenomenon that enables users to communicate with different messages. Moving human communication to online platforms is a double-edged sword. Benefits include the opportunity to share opinions and experiences and get immediate feedback, as well as the opportunity to discuss various topics. On the other hand, on these online platforms, we can observe vulgarities, hate speech, insults, or misinformation, which are referred to as
antisocial behavior on the Internet [
8]. Spreading misinformation on the Internet can take various forms, such as hoaxes, spam, rumors, false reviews, etc. We focused on two types of them, toxicity in comments and fake news [
9]. Toxicity in comments is defined as a rude, disrespectful, or inappropriate comment that is likely to force other users to leave the discussion. Toxicity in comments can appear in various areas, such as social networks or discussions related to news articles [
10]. The most common type of antisocial behavior on the Internet is fake news. They are considered news pieces that are intentionally and demonstrably untrue. Usually, these articles are designed to mislead, deceive, and influence people’s opinions. Fake news contains false information, the veracity of which can be verified [
11].
Manually detecting and tracking online content is a very demanding and costly process. Machine-learning systems that prescreen content and identify suspicious cases have proven successful in detecting antisocial behavior. These algorithms may prove to be a viable solution to problems on social networking platforms.
Besides the traditional machine learning models, deep learning is very capable in the detection of various forms of antisocial behavior on the web. Deep networks (including different topologies of CNN) have been successfully used to automatically detect cyber-bullying in Twitter posts [
12,
13]. Deep networks are successful in other related tasks, such as hate speech detection [
14]. Besides the commonly used deep learning architectures, ensembles of deep networks can be used to improve the detection ratio [
15]. Deep learning methods are very popular in toxic comment classification and fake news detection. The authors of the study [
16] focused on the detection of fake news using neural network methods on two datasets that contained English news articles. To solve this problem, the authors used CNNs, RNNs (recurrent neural networks), unidirectional LSTM (long short-term memory), and bidirectional LSTM networks. In [
17], the authors focused on binary toxicity classification in online comments. The authors used the k-nearest neighbors, naive Bayes, and CNN models to detect toxicity in the comments. The CNN network proved to be the most successful one. In contrast to the previous study, the authors of [
18] focused on classifying toxicity in comments and minimizing identity bias. In this experiment, the authors showed that although the model works well on a dataset, it can still demonstrate bias at subgroup levels. In this experiment, they trained three models: LSTM, BERT, and the TF-IDF model. As in previous studies, the authors [
19] focused on detecting toxicity in online comments. The authors divided this research into two parts. In the first part, they used a binary classification to detect toxic comments correctly. In the second part, they used a multiclass classification model to determine the degree of toxicity. Deep networks can address different types of toxicity in the text using multilabel classification [
20,
21]. Capsule networks are also used to track the temporal aspects of toxicity in the comments [
22].
The use of ML models to detect antisocial behavior from the texts is well studied, and models based on neural networks often prove to be the most suitable for handling these tasks. However, there are still many open research issues. One of the problems lies in the lack of well-labeled data. Even if many public datasets are available (in multiple antisocial behavior detection areas), many of them are still human-labeled, which may incorporate bias into the data. On the other hand, automatic labeling (e.g., using lexicons) may be more efficient in processing more data but still is not very reliable. To address the bias which can be introduced by human labeling, techniques such as crowd sourcing can be utilized. In addition, many datasets in this domain are heavily imbalanced. Such class imbalance may influence the detection models’ performance, especially in the minor class, which usually represents the type of antisocial behavior (e.g., fake news articles, fake reviews, or toxic comments). Therefore, exploring the approaches that can sample the data in the minor classes to overcome the lack of data can be interesting. An important issue is generating new, artificial samples with the same characteristics as the original data.
In work presented in this paper, we focused on using data augmentation techniques to improve the class imbalance by generating new, artificial samples from minor classes. We used simple text transformation methods based on vocabularies which were used to construct the new samples by replacing certain words with their synonyms. Such methods were already experimentally evaluated in several domains, e.g., clinical literature [
23], sentiment analysis [
24], or more recently in local (Portuguese) fake news detection [
24]. The main motivation of our research was to focus on the antisocial detection behavior domain. We selected two typical tasks within this area which often involve processing imbalanced data. Then we experimentally evaluated if the application of EDA in these tasks could influence the performance of the detection models. We decided to evaluate both, separate EDA techniques as well as a combination of all EDA methods applied at once. We compared the performance of the classification models on the original and EDA-extended datasets.
The paper is organized as follows:
Section 2 describes the data augmentation methods used in the text processing domain. The following section presents the datasets used in the study and their preprocessing.
Section 4 presents the deep learning model used in both evaluated tasks and presents the results of the experiments.
Section 5 presents the conclusions of the experiment’s results.
3. Data Understanding and Preprocessing
To evaluate the selected EDA methods, we used two datasets from the antisocial behavior detection domain—toxic comments and fake news datasets. In the first case, we used the Jigsaw toxic comments dataset (available online:
www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data, accessed on 6 July 2022), which is used to train the models for the detection of toxicity in comments related to news reports. The dataset consists of short text comments that contain very informal language and expressions, including emoticons, explicit language, or slang expressions. The dataset was created by the Civil Comments platform, which collected and made available in competition Jigsaw Unintended Bias in Classification Toxicity.
Figure 1 depicts a dataset sample to illustrate the content of the texts. Individual documents have lengths ranging from 1 to 1000 characters. The majority of the comments are short, ranging from 50 to 150 characters. The target attribute represents the toxicity score. The score values range from 0.0 to 1.0 and represent the fraction of raters who believed the label fit the toxicity type. We transformed the numeric target feature to a binary class, dividing the comments into toxic and nontoxic groups, where toxic comments were considered those with scores higher than 0.5. Other comments we considered as nontoxic/neutral. The resulting binary target feature was unbalanced, as the toxic comments made up only 8 percent of the entire dataset.
In the second task, we solved the fake news detection problem in news articles. The dataset consists of news articles; texts are usually longer than the comments used in the previous dataset. In addition, as the data consist of news pieces from different online media, the texts are written more formally and use more polished language. The particular documents are also longer, with the majority of the documents containing from 2000 to 5000 characters. The dataset consists of a total of 7500 news articles. The target class is binary, specifying if a given article is considered a regular news piece or if it contains misinformation. In addition, in this case, the target class is unbalanced, with regular records being a major and fake news a minor class.
In the data preprocessing phase, we performed just the basic standard text preprocessing, including converting the texts to lowercase, removing the punctuation marks, and dividing the sentences into tokens. We did not apply more text processing techniques to keep the data in form as close to original as possible. Data in datasets describing antisocial behavior contain various slang words, abbreviations, emoticons, or other forms of text that can express the features and characteristics of antisocial behavior. By removing these words, we could disrupt the main features and characteristics of the text, which could change the semantic meaning [
46]. Especially with toxic comments, it is very important to keep the text form as close to the original to extract the features of toxicity in the comments properly. For the same reason, keeping the stop words, nonmeaningful words, and formulas is important. In the antisocial behavior detection domain, it is more suitable to keep the dataset without several more advanced preprocessing methods (such as stemming or lemmatization or stop words removal). The application of those techniques can result in a loss of important information typical for the style used in the short texts (comments) [
6]. As a text representation model, we used GloVe (Global Vectors for Word Representation) embeddings [
47]. We aligned the sequences to the same length. In the toxic comments dataset, we used a maximum size of 200. In the fake news dataset, we set a maximum size of 2500.
While applying the selected EDA techniques, we extended the dataset by newly created artificial samples from the minor class. In this phase, it was necessary to correctly choose the parameters of the EDA methods—the number of words to be replaced, inserted, exchanged, or deleted. We trained multiple CNN models to find out which parameter value would be optimal and recorded the best results. On the toxic comments dataset, we decided to use the parameters
in synonyms replacement,
in random insertion and random swap, and
in the random deletion approach. Similarly, on the fake news dataset, we applied the EDA techniques using following settings: synonyms replacement (
), random insertion (
), random swap (
), and random deletion (
). First, we gradually added individual techniques and, finally, we applied a combination of all EDA methods. The application of particular EDA techniques doubled the minor class records in the training data; the application of all EDA techniques resulted in five times more samples of the minor class.
Table 1 and
Table 2 then summarize the class attributes in both of the datasets before and after the application of EDA methods.
4. Detection of Antisocial Behavior Using Deep Learning Methods
We chose a CNN [
48] model for the experiments, as the architecture proved to achieve good results in NLP tasks [
13,
49]. While maintaining performance, CNN architecture proved to be much less computationally intensive than LSTM networks. We expected that the effect of EDA augmentations should be very similar regardless of the used model. Both solved problems were binary classification tasks. Binary classification aims to classify data into one of two classes. In our case, we classified the data in the first dataset into toxic/nontoxic comments and, in the case of the fake news dataset, into fake/relevant news. Entire preprocessing, training, and evaluation were implemented in the Python language, including standard analytical stack (e.g., Pandas, Tensorflow, and scikit-learn packages).
The main idea of the experiments was to find out the effect of EDA augmentation techniques to the classification results. We used a simple convolutional neural network model, shown in
Figure 2. It consisted of two convolution layers, two pooling layers, one flatten layer, and one regularization dropout layer. We used the checkpoint method to prevent overfitting [
50]. The hyperparameters of the CNN model are summarized in
Table 3. During the experiments, we gradually added individual EDA augmentation techniques to the data and monitored its influence on the resulting metrics.
We evaluated the models using standard classification metrics:
These metrics were computed using the coefficients derived from the confusion matrix (see
Table 4), which expresses the number of correct and incorrect predictions made by the classification model compared to the ground truth values in the testing data. In the formulas, TP, FP, FN, and FP stand for true positive, false positive, false negative, and false positive rates associated with the class attribute. These metrics were used also to measure the model performance on the particular minor. To measure the overall model performance, we also used the AUC (area under curve). The AUC score computes the area under the ROC (receiver operating characteristic) curve and provides the aggregate measure of model performance across all possible classification thresholds.
4.1. Evaluation of Toxic Comments Detection
The basic model of the convolutional neural network without the extension of the training set reached a precision of
, a recall of
, and an F1 score of
in toxicity detection (minor class prediction, see
Table 5). After evaluating the basic model, we trained the models using EDA augmentation techniques.
Table 6 summarizes the overall performance (macro-averaged metrics) of the CNN model and compares different EDA methods applied to the original training data.
In this case, EDA augmentation techniques did not achieve the desired improvements in the detection of toxicity in the comments. Although we trained models with different augmentation techniques where we defined other parameters, we still did not achieve significant improvement. The toxic comments dataset contains comments with many slang words, dialect words, abbreviations, swear words, and words made up of different characters. For this reason, augmentation techniques that work with synonyms cannot be used because they are not in the standard synonym dictionary. If these words are replaced by words that do not constitute antisocial behavior, false negative cases will arise. When randomly deleting, the biggest problem is the length of the text. As these are short texts, toxic words are often deleted. After deleting these words, the comment becomes neutral, so there arise false negative cases.
4.2. Evaluation on the Fake News Dataset
The CNN model was trained on the fake news dataset without the extension of the training data, with an unbalanced target attribute class. The training set consisted of more than
regular news pieces and only
fake news articles. After training the base model of the same architecture as in the previous dataset, we evaluated the model’s performance on the minor (fake news) class. The fake news dataset contained 460 positive cases in the training set.
Table 7 shows the confusion matrix of this model.
As in the previous dataset, using EDA augmentation techniques extended the data with approximately more than four times more fake news records.
Table 8 summarizes the results of the CNN model on the original data and data extended using EDA techniques. The CNN model achieved a precision of
, a recall of
, and an F1 score of
in the toxic comments class.
Table 9 summarizes the macro-averaged performance of the CNN model on the original and EDA-extended datasets.
The model, which used a combination of all EDA augmentation methods, significantly improved all metrics compared to other models. Using EDA, we increased the F1 score by 19 percent and recall by 27 percent when detecting fake news samples (minor class). The confusion matrix of this model is shown in
Table 10.
4.3. Comparison with Related Literature
The use of EDA techniques was already explored in the available literature [
39], where the authors evaluated these methods on five different text classification tasks. EDA boosted the model’s performance marginally, but significant improvements could be expected on the smaller datasets. Similar behavior was observed during our experiments. Especially on the fake news dataset, which consisted of approximately 6000 training samples, the improvements were most significant. To check the issue of the possible overfitting that EDA may cause, we evaluated the models’ overall performance and their performance in the minor class. This kind of evaluation supported the benefits that EDA applications can bring.
In addition, we could compare the results obtained by applying similar techniques to the same dataset. For example, in [
51], authors applied EDA and back-translation techniques to the toxic comments classifiers using traditional machine learning algorithms (logistic regression and support vector machine). Their baseline model gained an F1 performance of 0.677; after the EDA application, it improved to 0.736. Back-translation itself did not achieve better results. Relatively lower levels of F1 could be attributed to the usage of standard ML models. In addition, in this case, the authors did not perform the optimization of EDA parameters or their combination. In [
19], authors used a similar technique as EDA on CNN-based toxic comments classification. The authors used synonym replacement, random mask, and unique word augmentations, which improved the baseline CNN model from a 0.846 F1 to 0.885 (a combination of all techniques). Similar to our experiments, a combination of multiple techniques brought higher benefits to the model performance. The difference in the F1 score of the baseline model can be attributed to different test sets (the testing set in our experiments consisted of 20% of samples, in comparison to a 10% test set in these experiments).
The comparison of the fake news dataset can be difficult, as there are multiple datasets, and their usage among the studies is rather inconsistent. The majority of current research [
52,
53,
54] uses the COVID-19 fake news dataset. However, mentioned studies use advanced deep learning models for classification and different augmentation techniques (translation, BiGRU-CRF, and CapsuleNet). In both cases, the effects of the applied techniques are quite similar, as they boost the F1 performance of the classifiers by 0.01. It is important to note that applying more advanced techniques can be very demanding on computational resources. EDA techniques can be relatively simple to implement, but their effects on classifier performance can be comparable.
4.4. Practical Implications
In this work, we have presented a CNN classification model for the detection of two forms of antisocial behavior detection trained on an EDA-extended dataset. Data analytical methodologies such as CRISP-DM [
55] describe the overall data analysis process in multiple steps. Such steps include data understanding and preparation, the training and evaluation of the analytical models, and the actual deployment of the model into production. In this paper, we focused mostly on the experimental evaluation of the model. However, as we used standard data analytical technologies in the implementation, it is relatively straightforward to serialize the developed models to transfer them into the production environment. In the studied domain, similar models could run as web services, consuming the input data from the sources and providing real-time predictions. Depending on the particular type of task, such models could be implemented as browser extensions highlighting the given text (e.g., toxic comments or unreliable news pieces) during web browsing. From a practical point of view, data can be accessed in real-time using public APIs (e.g., news articles or comments from social networks). Models can be serialized using standard Python tools (e.g., Pickle) and deployed as web services using a web framework (e.g., Flask). Such an approach enables the creation of an architecture where serialized models are used to score the incoming data on the back-end and feed the classification results to the front-end. In this case, the output of the model can be fed to the browser extension, able to highlight possible toxic comments or unreliable news pieces.
5. Discussion and Conclusions
The work presented in this paper focused on data augmentation techniques applied to text classification in the antisocial behavior detection domain. The main objective was to explore the possibility of using simple EDA augmentation techniques to overcome the class imbalance problem when solving antisocial behavior detection tasks using deep learning models. We evaluated EDA methods on two selected tasks—fake news detection and toxic comments classification. In both cases, we used the CNN classifier and compared its performance when trained on the original training set with training sets enhanced using a combination of EDA techniques. The effect of EDA augmentation techniques on the model performance is strongly dependent on the dataset. Although there are multiple EDA techniques available, those are usually very well used when applied to a dataset containing the texts written in more formal language. It was evident on the performance boost of the CNN model on the fake news dataset, which was significant, improving the F1 score by a 0.1.
From the perspective of the style and language of the texts, EDA techniques applied to the fake news dataset positively affected classification performance. This dataset comprised news pieces usually longer than discussion comments and written using more formal language. This task was much better suited to the EDA synonym replacements and similar techniques. Using EDA, we could correctly generate the augmented data samples, contributing to model performance improvement. On the other hand, EDA techniques applied to the dataset of toxic comments did not improve the CNN model performance (only a 0.01 improvement in F1). The dataset mostly comprised short texts (discussion posts) and contained much nonformal content (e.g., slang expressions). Therefore, EDA methods relying on synonyms replacement were unable to find suitable synonyms for many of the words typical for toxic behavior in the comments. The application of these methods did not generate the augmented toxic samples suitable enough to be used to improve the model’s performance.
In general, the problem of enhancing the datasets (e.g., due to data scarcity or to balance the classes) in the NLP domain is very difficult and attracts the attention of many research groups. The ability to artificially generate new texts is a difficult task, and it is very challenging to synthetically generate the features present in the texts written by humans (such as irony or sarcasm). Simple augmentation techniques such as EDA cannot reflect these complex issues and even simpler ones, such as considering the context of replaced words. However, its simple implementation and application while maintaining reasonable performance can present an advantage in certain applications. In the future, we expect that exploration of the usage of a character-level augmentation method could be useful, as they can generate texts which can represent spelling mistakes (which are very common in this type of data). The further analysis and exploration of the suitability evaluation of other, more advanced augmentation methods such as GAN in this domain could be interesting.