Social Media Hate Speech Detection Using Explainable Artiﬁcial Intelligence (XAI)

: Explainable artiﬁcial intelligence (XAI) characteristics have ﬂexible and multifaceted potential in hate speech detection by deep learning models. Interpreting and explaining decisions made by complex artiﬁcial intelligence (AI) models to understand the decision-making process of these model were the aims of this research. As a part of this research study, two datasets were taken to demonstrate hate speech detection using XAI. Data preprocessing was performed to clean data of any inconsistencies, clean the text of the tweets, tokenize and lemmatize the text, etc. Categorical variables were also simpliﬁed in order to generate a clean dataset for training purposes. Exploratory data analysis was performed on the datasets to uncover various patterns and insights. Various pre-existing models were applied to the Google Jigsaw dataset such as decision trees, k-nearest neighbors, multinomial naïve Bayes, random forest, logistic regression, and long short-term memory (LSTM), among which LSTM achieved an accuracy of 97.6%. Explainable methods such as LIME (local interpretable model—agnostic explanations) were applied to the HateXplain dataset. Variants of BERT (bidirectional encoder representations from transformers) model such as BERT + ANN (artiﬁcial neural network) with an accuracy of 93.55% and BERT + MLP (multilayer perceptron) with an accuracy of 93.67% were created to achieve a good performance in terms of explainability using the ERASER (evaluating rationales and simple English reasoning) benchmark.


Introduction
Artificial intelligence has invaded various fields in the present times.Be it science, education, finance, or business, artificial intelligence has found its applications everywhere.However, currently, AI is limited to only its subset "machine learning" and has not even realized its full potential.Machine learning is the ability of computers to learn the relationship between input and output without being explicitly programmed.Thus, in machine learning, in contrast to traditional programming which requires writing algorithms, it is required to find the algorithm that learns patterns from a given dataset and builds a predictive model, on the basis of which the computer learns the patterns between input and output.The machine learning model is now able to give predictions on new and unseen data.However, these models do not provide an explanation as to how different features contribute to the output.Thus, the functioning of artificial intelligence is traditionally like a black box.This characteristic may not provide justifications in critical scenarios such as diagnosis of life-threatening diseases and defense.If there is an explanation given along with the output, combined with human reasoning, it may prove significantly useful.This forms the basis of explainable artificial intelligence (XAI).XAI gives answers to many questions along with the output.It is an emerging area of research and has found applications in varied fields.
Artificial intelligence is implemented as a "black box" that just gives the output after a certain input but how it is achieved is not revealed.While it may not be necessary to get the

Motivation
Artificial intelligence is implemented as a "black box" that just gives the output after a certain input but how it is achieved is not revealed.Machine learning has seen applications in various fields such as medical, research, business, education, industry, chatbots, recommendation systems, and even self-driving cars.However, some machine learning models may not be intuitive or transparent, which may be complex for people to understand.In such cases, these models may lose their effectiveness.In the past few years, deep learning models have also presented state-of-the-art results in many situations.However, deep learning models are still not able to justify whether they are making the right decision or not.XAI methods provide explanations that can be translated by humans without having a depth of knowledge in deep learning models.XAI characteristics have flexible and multifaceted potential in hate speech detection by the deep learning models.XAI, thus, provides a strong interconnection between an individual moderator and hate speech detection framework, which is a pivot for the research study in interactive machine learning.As the model becomes complex with an increased number of parameters, iterations, and optimization, it becomes even more difficult to validate the results by the model.The main goal and the intended contribution in this paper are interpreting and explaining decisions made by complex artificial intelligence (AI) models to understand their decision-making process in hate speech detection.For this purpose, pre-existing models were applied on the Google Jigsaw dataset to get the best prediction accuracy, and explainable methods such as LIME (local interpretable model-agnostic explanations) were applied to the HateXplain dataset.Variants of BERT (bidirectional encoder representations from transformers) model such as BERT + ANN (artificial neural network) and BERT + MLP (multilayer perceptron) were created to achieve a good performance in terms of explainability using the ERASER (evaluating rationales and simple English reasoning) benchmark (DeYoung et al. ( 2019)).

Literature Review
There has been recent research on hate speech detection using traditional natural language processing (NLP) techniques and using machine learning methods [1][2][3].The extraction of text-, user-, and network-based features and characteristics identifying bullies has been successful [4].Furthermore, abusive language detection including hate speech keyword identification, sexism, bullying, trolling, and racism were studied in [1,[5][6][7] using deep learning techniques.
In recent times, there has been an increased interest in the explanability of artificial intelligence techniques including machine learning and deep learning methods to understand the reasons for labeling text with hate speech or other social media and medical applications.A novel explanation method based on LIME [3,8] for the explanation of predictions made by a classifier was proposed [9], and the best practices for the usage of these interpretable machine learning models and their applications were also discussed [10][11][12][13][14]. Deep learning and active learning approaches were used for explanability in [8,[15][16][17].
Explainable AI (XAI) has become very popular in recent times to unravel the secrets of decision making by AI techniques.There have been novel definitions of explainable machine learning and deep learning [18], with a categorization of XAI techniques and methods based on factors such as their scope, methodology, algorithmic intuition, and explanation capability [19].XAI models available out there, such as LIME, layer-wise relevance propagation, and DeepLIFT and how they are deployed were discussed in [20][21][22][23].XAI has been applied in various applications such as the predictive maintenance (PdM) scenario in manufacturing [24] and social science research [25].
Table 1 gives a comprehensive explanation with contributions, findings, and limitations of these works.

Ref.
Contribution Key Findings Limitation(s) [1] Automated hate speech detection and the problem of offensive language Logistic regression, naïve bayes, decision trees, random forests, and SVM are tested using 5-fold cross-validation The definition of hate speech in this research is limited to language that threatens or incites violence which excludes a large proportion of hate speech.
Lexical methods used are inaccurate at identifying hate speech, and only a small percentage of tweets flagged by hate base lexicon are considered hate speech. [2] Detection of offensive content and identification of potential offensive users

Lexical syntactical feature (LSF) framework
Comparison of existing text-mining methods in detecting offensive contents with LSF framework in not detailed and lacks scientific validation.
[3] A feature attribution method for explainability Necessity and sufficiency explained in detailed in the context of hate speech The analysis is limited by limitation of the existing dataset used which lacks variety of demographic groups. [4] Detecting bullying and aggressive behavior on Twitter Random forest classifier using WEKA tool, 10-fold cross-validation Results obtained with random forest classifier are only presented with respect to training time and performance due to limited space. [5] A unified deep learning architecture for abuse detection

Deep learning architecture for detection of abuse online
Network-related metadata are not considered in the dataset due to time limitations as it takes a significant amount of computation to crawl Twitter data due to Twitter API rate limits. [6] A unified approach to explaining complex ML models SHAP (Shapley additive explanations) framework for the explanation of complex, ensemble and deep learning models SHAP model is not consistent with human intuition in some cases, which can lead to false positives or false negatives; a different approach is not considered in such cases. [7]

Explanation of RNN predictions in sentiment analysis
Propagation rule for growing connections in recurrent neural networks (RNN) architectures Gradient-based sensitivity analysis used with approach is not able to get accurate relevance score when a sentiment is decomposed into words. [8] Intuitive explainability along with using various deep learning techniques

LIME explanation with individual examples
Some misclassification is observed in the case of nontoxic comments.The method to perform the pick-up step for images is not addressed in this research.
[10] Interpretable machine learning models Technical foundations of explainable artificial intelligence, presentation of practical XAI algorithms such as occlusion, integrated gradients, and LRP, importance, applications, challenges and directions for future work The explanation revealed by model in this research are difficult to interpret by human observer due to limited accessibility of the data representation.Deeper understanding of relevance maps is not obtained by the model. [11]

Evaluation of interpretability and explainability in machine learning
Application-grounded, human-grounded, and functionally grounded approaches for evaluation of interpretability, discussion of open questions related to these evaluation approaches The research is focused only on the taxonomy to define and evaluate interpretability and not on methods to extract explanations.[12] Framework for the explanation of the results of an artificial intelligence system Proposed framework named "teaching explanations for decisions (TED)" to provide explanations of an AI system The proposed TED framework assumes a training dataset to be having explanation and applies cartesian product using any machine learning algorithm to train classifier instead of using multitask setting.[13] Explainability of deep neural network models Key directions for moving towards transparency of machine learning models, novel technological development for explainability This research does not focus on exact choice of deep neural network for any particular domain and instead is only focused on generalized conceptual developments.
[14] Overview of interpretability of machine learning models Need for diverse metrics for targeted explanations, suggestions for explainability of deep learning models The study only focuses on abstract overview of explainability without diving deep into explanation metrics. [15] Enhancing interpretability of tree-based machine learning models Method for computation of the game theoretic Shapley values, local explanation method, tools for explainability using a combination of local explanation methods The study focuses on XAI and its importance but fails to discuss the limitations of conventional AI and its combination with XAI.
[22] Fuzzy systems for explainable artificial intelligence Need, timeline, applications, and future work of fuzzy systems for XAI The research fails to address how to arrive at a solution to the problems that are not measurable in the evolutionary fuzzy systems (EFS) patterns. [23] A literature survey on explainable artificial intelligence (XAI) terminology Background, terminology, objectives of explainable artificial intelligence (XAI), natural language generation approach The survey does not explain how to evaluate natural language generation (NLG). [24] Predictive maintenance case study based on explainable artificial intelligence (XAI) A machine learning model based on a highly efficient gradient boosting decision tree is proposed for the prediction of machine errors or any tool failure.
Results of this research are presented using a generic dataset and not a real data; however, the presented concept shows high maturity with promising results.[25] Insights from social sciences related to explainable artificial intelligence (XAI) Why questions are diversified in explainable AI, explanations are biased and social Adopting the work of this research into explainable AI is not a straightforward step, and the models discussed need to be refined and extended to provide good exploratory agent.

Materials and Methods
We used two datasets for hate speech detection using explainable artificial intelligence, and these datasets are discussed in this section.Both datasets include text written in English language.The Jigsaw dataset is used with some linear (e.g., decision trees) and some complex models (e.g., LSTM) to visualize how they compare with each other on a hate speech dataset.The Google Jigsaw dataset comprises user discussions from talk pages of English Wikipedia, and various existing semi-interpretable linear models were trained on it.The Jigsaw dataset does not have human annotations unlike the HateXplain dataset; hence, it is not possible to evaluate the ERASER benchmark on it.The HateXplain dataset contains posts from Twitter and Gab and is annotated, which makes it suitable for evaluating the ERASER benchmark for explainability.

Google Jigsaw Dataset
The first dataset that we used is a dataset released by Google Jigsaw as part of a Kaggle challenge.The dataset contains the following columns: comment, toxic, severe_toxic, obscene, threat, insult, and identity_hate.The dataset comprises discussions from Wikipedia.The labels in the dataset can be multinomial, i.e., a particular text can belong to two or more classes.Table 2 shows the Google Jigsaw dataset details.

HateXplain Dataset
The second dataset used is the HateXplain dataset which contains posts from Twitter and Gab.Combining these two sources, we obtained a dataset that contains over 20,000 data containing hateful, offensive, and normal text as labels.
From Twitter, we randomly took 1% of tweets from the period between January 2019 to June 2020.From Gab, we took the dataset provided in [26].Reposts of the tweets were not considered, and the duplicates were removed.This ensured that the tweets contained only textual data.However, emojis were kept as they contribute significantly to emotion detection.Moreover, all usernames were removed, and a token <user> was inserted in their place.Table 3 shows the HateXplain dataset details.

Extracting the Dataset
The datasets taken were in the CSV (comma-separated values) format.A CSV file stores tabular data in plain text separated by commas.Each line of a CSV file corresponds to one row of the dataset, the first row of the file being the header row or the row that contains the column or attribute names.The CSV format files were loaded into a data frame using the Pandas library of Python.Pandas are used for data analysis and manipulation and are extensively used for data science and machine learning use cases.

Data Preprocessing and Cleaning
Preprocessing of data is a crucial step that impacts a model's performance.The data obtained from Twitter or online sources are noisy and can have null or missing values, images, audio, video, etc. Preprocessing ensures that the data are cleaned, free from noise, and meaningful.However, we did not perform preprocessing on BERT-based models as these are pretrained language representation models.Moreover, BERT uses every information in a sentence including punctuation and stop words.For models not based on BERT, we used Python's various libraries and functions for data preprocessing and cleaning for this research project.
A summary of the steps performed for preprocessing and cleaning of the dataset is given below.

1.
Rows with missing labels were dropped as they do not contribute to the learning process.

2.
Using the natural language toolkit (NLTK) library, tokenization was performed, i.e., tokens of the sentences were created.

3.
Stop words (if, then, the, and, etc.) were removed to keep only the text that would contribute to the learning process.
Data cleaning is an essential step before training the model as it provides various benefits.Data cleaning removes any incorrect or inconsistent information that improves data quality.Figure 1 shows the common steps performed in data cleaning.It includes the removal of unwanted observations followed by correcting any structural errors that the observations in the dataset might have.The notion of "structural error" indicates any irregularities with the structure of the sentence such as typos in the name of features, the same attribute with a different name, mislabeled classes, additional spaces, and newline characters.The next steps are performed with an aim to manage unwanted outliers such as additional spaces, which is followed by handling any missing data in the dataset.The detailed steps performed for data cleaning are mentioned below [27].data quality.Figure 1 shows the common steps performed in data cleaning.It includes the removal of unwanted observations followed by correcting any structural errors that the observations in the dataset might have.The notion of "structural error" indicates any irregularities with the structure of the sentence such as typos in the name of features, the same attribute with a different name, mislabeled classes, additional spaces, and newline characters.The next steps are performed with an aim to manage unwanted outliers such as additional spaces, which is followed by handling any missing data in the dataset.The detailed steps performed for data cleaning are mentioned below [27].Regular expressions are sequences of characters that are used for matching with other strings in search.Patterns and strings of characters can be searched using regular expressions.Python has a "re" module that can help to find patterns and strings using regular expressions.Regular expressions can be used to remove or replace certain characters as part of data cleaning and preprocessing.2. Any newline characters or additional spaces were removed.3. Any URLs were also removed as they do not contribute to the learning process.4. Similarly, any other alphanumeric characters that included punctuation were removed for the same reason, including the following strings: !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~.
Only uppercase and lowercase letters along with digits 0-9 were kept. 5. Stopwords such as "the", "and", "then", and "if" were also removed as they are not a part of the learning process.Python's NLTK library has stopwords in about 16 different languages.We imported English stopwords to remove them from our dataset.These words were removed as they do not add any additional information to the learning process.6.The outputs of these tasks were stored in a separate column, resulting in a column of tokenized words.

Tokenization, Sentence Padding, and Lemmatization
Tokenization is the process in which sentences are divided into smaller parts that are called tokens.These tokens serve as the basis for stemming and lemmatization and can aid in finding various patterns in the text.The natural language toolkit (NLTK) library of Python provides functions to perform word tokenization.Specifically, word tokenization can be conducted to yield either characters or subwords.For example, the word "clearer" can be either tokenized into "clear" and "er" or "c-l-e-a-r-e-r".In this study, we performed character tokenization that converts words into tokens as an array of integers, improving the efficiency of the learning process.We created a tokenizer object from a pretrained

1.
Firstly, a regular expressions module was imported to help with data cleaning tasks.Regular expressions are sequences of characters that are used for matching with other strings in search.Patterns and strings of characters can be searched using regular expressions.Python has a "re" module that can help to find patterns and strings using regular expressions.Regular expressions can be used to remove or replace certain characters as part of data cleaning and preprocessing.

2.
Any newline characters or additional spaces were removed.

3.
Any URLs were also removed as they do not contribute to the learning process.4.
Only uppercase and lowercase letters along with digits 0-9 were kept.

5.
Stopwords such as "the", "and", "then", and "if" were also removed as they are not a part of the learning process.Python's NLTK library has stopwords in about 16 different languages.We imported English stopwords to remove them from our dataset.These words were removed as they do not add any additional information to the learning process.6.
The outputs of these tasks were stored in a separate column, resulting in a column of tokenized words.

Tokenization, Sentence Padding, and Lemmatization
Tokenization is the process in which sentences are divided into smaller parts that are called tokens.These tokens serve as the basis for stemming and lemmatization and can aid in finding various patterns in the text.The natural language toolkit (NLTK) library of Python provides functions to perform word tokenization.Specifically, word tokenization can be conducted to yield either characters or subwords.For example, the word "clearer" can be either tokenized into "clear" and "er" or "c-l-e-a-r-e-r".In this study, we performed character tokenization that converts words into tokens as an array of integers, improving the efficiency of the learning process.We created a tokenizer object from a pretrained model that was imported and then fitted to the HateXplain dataset.This was achieved using the keras and TensorFlow libraries.
Padding was performed so that all the inputs were of equal length.Neural networks require all inputs to be of same length.Originally, the raw text had words and sentences of different lengths.In exploratory data analysis, we observed that the maximum sentence length was mostly 200.Thus, we trimmed sentences with lengths greater than 200 and padded the remaining sentences.
Using natural language processing (NLP), word normalization was performed through lemmatization.In lemmatization, all words are reduced to their base/root forms.For instance, (1) go, going, gone, and goes are reduced to go, (2) read, reading, and reads are reduced to read, and (3) hated, hating, and hates are reduced to hate.

Simplification of Categorical Values
The original dataset had seven columns: "unnamed", "count", "hate_speech", "offensive language", "neither", and "tweet".To simplify the dataset for an efficient training and learning process, only three columns were kept: text, category, and label.The "tweet" column was converted to a "text" column.The label was derived from the class column in the original dataset, and the category label had values of 0, 1, and 2 encoded from the columns (hate_speech, offensive language, and neither) in the existing dataset.In this column, 0 represents hate_speech, 1 represents offensive_language, and 2 represents neither.Thus, the new and final dataset for the training and learning process had three columns: text, category, and label.

Exploratory Data Analysis (EDA)
EDA is the process of investigating data and drawing out patterns and insights [28].EDA helps one understand the data better.It helps in understanding the various attributes in the dataset and how the various attributes contribute to the target variable, identifying anomalies.EDA also reveals any inconsistent or incomplete data.EDA serves as the basis of the data cleaning and preprocessing step.EDA helps with matching assumptions and intuitions with reality.Thus, EDA is a crucial step to intelligently proceed with the subsequent steps in the entire process of machine learning.Figure 2 captures the essence of exploratory data analysis.
model that was imported and then fitted to the HateXplain dataset.This was achieved using the keras and TensorFlow libraries.
Padding was performed so that all the inputs were of equal length.Neural networks require all inputs to be of same length.Originally, the raw text had words and sentences of different lengths.In exploratory data analysis, we observed that the maximum sentence length was mostly 200.Thus, we trimmed sentences with lengths greater than 200 and padded the remaining sentences.
Using natural language processing (NLP), word normalization was performed through lemmatization.In lemmatization, all words are reduced to their base/root forms.For instance, (1) go, going, gone, and goes are reduced to go, (2) read, reading, and reads are reduced to read, and (3) hated, hating, and hates are reduced to hate.

Simplification of Categorical Values
The original dataset had seven columns: "unnamed", "count", "hate_speech", "offensive language", "neither", and "tweet".To simplify the dataset for an efficient training and learning process, only three columns were kept: text, category, and label.The "tweet" column was converted to a "text" column.The label was derived from the class column in the original dataset, and the category label had values of 0, 1, and 2 encoded from the columns (hate_speech, offensive language, and neither) in the existing dataset.In this column, 0 represents hate_speech, 1 represents offensive_language, and 2 represents neither.Thus, the new and final dataset for the training and learning process had three columns: text, category, and label.

Exploratory Data Analysis (EDA)
EDA is the process of investigating data and drawing out patterns and insights [28].EDA helps one understand the data better.It helps in understanding the various attributes in the dataset and how the various attributes contribute to the target variable, identifying anomalies.EDA also reveals any inconsistent or incomplete data.EDA serves as the basis of the data cleaning and preprocessing step.EDA helps with matching assumptions and intuitions with reality.Thus, EDA is a crucial step to intelligently proceed with the subsequent steps in the entire process of machine learning.Figure 2 captures the essence of exploratory data analysis.

Feature Extraction Methods
After the data are cleaned and preprocessed, they should be converted into a form that the model can understand.For this, all variables must be converted into numerical

Feature Extraction Methods
After the data are cleaned and preprocessed, they should be converted into a form that the model can understand.For this, all variables must be converted into numerical form.This process is called feature extraction or vectorization.This process also contributes to dimensionality reduction and, hence, helps with feature extraction, to keep only the features that improve the accuracy of the model.Feature extraction can be performed using methods.The importance of the words occurring in the dataset can be gauged, and redundant data can be removed.New features can also be formed from existing ones.Through such methods, features that matter and new features can be generated to form a better version of the original dataset.We used Count Vectorizer in this research, which is used for converting text into a vector [29].The TF-IDF (term frequency-inverse document frequency) statistic examines the relevance of a word to a document in a collection of documents.This is accomplished by multiplying two metrics: the number of times a word appears in a document and the word's inverse document frequency over a collection of documents.It has a variety of applications, including automatic text analysis and scoring words in machine learning techniques for natural language processing (NLP).

Classification Methods and Explainable Techniques
Different classifiers were used to predict hate speech on the Google Jigsaw data set, namely, artificial neural network (ANN) [29], multilayer perceptron (MLP) [30], decision trees, KNN, random forest, multinomial naïve Bayes, logistic regression, and long shortterm memory (LSTM).Explainability was described on the HateXplain Dataset using BERT and LIME.We briefly discuss LSTM, BERT, and LIME in this section.

Deep Learning Model-Long Short-Term Memory (LSTM)
LSTM is an artificial recurrent neural network (RNN) architecture used in the field of deep learning.Unlike standard feedforward neural networks, LSTM has feedback connections.It can process not only single data points, but also entire sequences of data.
The input layer of LSTM was designed with 30,000 × 128 size (or 3,840,000 parameters) in order to incorporate the whole dataset comments as shown in Table 4.After lemmatizing, tokenizing, and removing stop words and punctuation marks, the top 30,000 words were taken for processing.Alphabets and numbers can be represented uniquely using 7 bit ASCII code, with 27 = 128.This layer inputs the tokenized words and fetches 3,840,000 entities from it.The function of dropout layers is to reduce the number of entities read, as well as to increase the number of features to be extracted from the input.The standard rate of dropout in LSTM is 0.2 (learning rate).The number of parameters (131,584) shows that, after the recurrence layers, the number of entities was reduced from 3,840,000 to 131,584.The dense layer outputs 774 units (roughly equal to 128 × 6) in order to tell which input word belongs to which class.
The data were divided into a 70-30 split.where 70% of the data were utilized for training and 30% were utilized for testing purposes.After that the model defined above was compiled with the loss function as binary cross-entropy and the Adam optimizer.Then, the model was fit on the training data with a batch size of 128.
The accuracy obtained by the LSTM model was 97.6%, the precision was 0.85, the recall was 0.83, the macro F1-score was 0.84, and the specificity was 0.82.

BERT (Bidirectional Encoder Representation from Transformers)
The BERT model is a relatively new language model that was presented in a paper by Google in 2018 [31].This model has presented state-of-the-art results in natural language processing.The key feature of BERT is the bidirectionality of the model.The BERT model makes use of the encoder component of the transformer to furnish the representation of words.BERT is used for the creation of language representation models that can serve various purposes.
BERT has a base layer of "knowledge" that is derived from its pretraining.From this base layer of "knowledge", BERT can further be trained to adapt to specifications provided.BERT's transformer processes any given word with respect to the word's relation to all other words in that particular sentence.This enables BERT to understand the context of the word after looking at all surrounding words, unlike other models that understand the meaning of a word in one dimension only.There is another BERT variant that was trained on specifically hate speech detection task called AngryBERT [32].AngryBERT jointly learns hate speech detection with emotion classification.It can outperform standard BERT in some hate speech tasks.However, the objective of this research was to detect hate speech along with explainable AI to evaluate how explainable the current high-performing black-box algorithms can be.Therefore, standard BERT was applied rather than AngryBERT so as to not only learn the hate speech pattern using the standard BERT variant but also consider the cases where correctly identifying hate speech is difficult for machines (e.g., sarcasm), enabling the recipient of the explanations to make better decisions.
Masked language model (MLM): In this task, BERT learns a featured representation for each of the words present in the vocabulary.About 85% of the words are used for training, and the remainder are used for evaluation.The selection of the training and evaluation sets is random and in iterations.Through this process, the model learns featured representation in a bidirectional way i.e., learns both the left and the right contexts of the words.In this task, some of the tokens from each sequence are replaced with the token [Mask].The model is trained to predict these tokens using other tokens from sequence.2.
Next sentence prediction (NSP): In this task, BERT learns the relationship between two different sentences.This task contributes to aspects such as question answering.
The model is trained to predict the next sentence.It is similar to the textual entailment task where there are two sentences; it is a binary classification task to predict whether the second sentence succeeds the first sentence.

Local Interpretable Model-Agnostic Explanations (LIME)
LIME is an acronym for local interpretable model-agnostic explanations.Each portion of the name represents something we want to be able to explain.Local fidelity refers to the need for the explanation to accurately reflect the classifier's behavior "around" the instance being predicted.This explanation is pointless unless it is interpretable, i.e., if it can be understood by a person.LIME is an agnostic model as it is capable of giving explanations for the predictions of a supervised learning model.LIME can be used with all types of data, be it text, images, or videos.LIME provides local interpretable explanations by computing important features and attributes for a given data point.It works by providing weights to the data rows and, using feature selection techniques, it obtains the important features.LIME is especially successful in explainable artificial intelligence (XAI).It can also be applied to all types of data and in all domains.LIME is a concrete implementation of local surrogate models.Surrogate models are trained to approximate the prediction of underlying black box model.Methods such as SHAP (Shapley additive explanations), counterfactual explanations, and other language interpretability tools can be used to explain black-box models.However, the reason for using LIME is that it uses Lasso or short trees, which results in explanations being selective and concise, thus representing more humanfriendly explanations.In social media arbitration, the recipient of explanations is often a layman or someone with very little time.Figure 3 shows an example of explanation by LIME [34].
First and foremost, we provide a discussion on interpretability.Some classifiers employ representations that are completely unfamiliar to consumers (e.g., word embeddings).LIME describes those classifiers in terms of interpretable representations (words), even if that is not the representation actually used by the classifier.Furthermore, LIME considers human constraints, such as the length of explanations.more human-friendly explanations.In social media arbitration, the recipient of explanations is often a layman or someone with very little time.Figure 3 shows an example of explanation by LIME [34].First and foremost, we provide a discussion on interpretability.Some classifiers employ representations that are completely unfamiliar to consumers (e.g., word embeddings).LIME describes those classifiers in terms of interpretable representations (words), even if that is not the representation actually used by the classifier.Furthermore, LIME considers human constraints, such as the length of explanations.
Model agnosticism refers to LIME's ability to provide justification for any form of supervised learning model prediction.This method can be used with any type of data, including images, text, and video.LIME can handle any supervised learning model and provide reasoning in this way.LIME generates local optimal explanations by computing essential features in the immediate neighborhood of the instance to be explained.LIME cannot peek inside the model in order to be model agnostic.We disrupt the interpretable input around its neighborhood to check how the model's predictions respond in order to figure out the sections contributing to the prediction.The perturbed data points are then weighted according on their proximity to the original example, and an interpretable model is learned on the basis of those and the related predictions.It generates 5000 samples of the feature vector by default, all of which follow normal distributions.It discovers the target variables for samples whose decisions are explained by LIME after producing normally distributed samples.It allocates weights to each of the rows according to how close they are to the original samples after getting the locally created dataset and their predictions.Then, it extracts relevant features using a feature selection technique such as Lasso or PCA (principal component analysis).In the field of XAI, LIME has found great success and support, and it has been used for text, image, and tabular data.By tweaking the inputs, LIME observes the changes that happen in predictions.LIME generates a new dataset using inputs with variations and their corresponding predictions generated through a black-box model.On this dataset, LIME trains an explainable model with weights generated through the proximity of the instances generated.The model that is trained achieves a good local approximation, giving rise to the name local interpretable explanations.The explainable model trained for an instance minimizes loss and measures the proximity of the explanation to the prediction while keeping the model complexity low.LIME optimizes the loss part, and the user specifies the complexity of the model.LIME is applicable and expandable to all key machine learning fields, which is a noteworthy feature.Embeddings and vectorization of a given word or sentence can be considered a basic unit for sampling in the domain of text processing.In the case of an image, segmented chunks of the image are used as input samples.Model agnosticism refers to LIME's ability to provide justification for any form of supervised learning model prediction.This method can be used with any type of data, including images, text, and video.LIME can handle any supervised learning model and provide reasoning in this way.LIME generates local optimal explanations by computing essential features in the immediate neighborhood of the instance to be explained.LIME cannot peek inside the model in order to be model agnostic.We disrupt the interpretable input around its neighborhood to check how the model's predictions respond in order to figure out the sections contributing to the prediction.The perturbed data points are then weighted according on their proximity to the original example, and an interpretable model is learned on the basis of those and the related predictions.It generates 5000 samples of the feature vector by default, all of which follow normal distributions.It discovers the target variables for samples whose decisions are explained by LIME after producing normally distributed samples.It allocates weights to each of the rows according to how close they are to the original samples after getting the locally created dataset and their predictions.Then, it extracts relevant features using a feature selection technique such as Lasso or PCA (principal component analysis).In the field of XAI, LIME has found great success and support, and it has been used for text, image, and tabular data.By tweaking the inputs, LIME observes the changes that happen in predictions.LIME generates a new dataset using inputs with variations and their corresponding predictions generated through a black-box model.On this dataset, LIME trains an explainable model with weights generated through the proximity of the instances generated.The model that is trained achieves a good local approximation, giving rise to the name local interpretable explanations.The explainable model trained for an instance minimizes loss and measures the proximity of the explanation to the prediction while keeping the model complexity low.LIME optimizes the loss part, and the user specifies the complexity of the model.LIME is applicable and expandable to all key machine learning fields, which is a noteworthy feature.Embeddings and vectorization of a given word or sentence can be considered a basic unit for sampling in the domain of text processing.In the case of an image, segmented chunks of the image are used as input samples.

Model Training and Evaluation for Google Jigsaw Dataset
The results of all the models on the Google Jigsaw dataset, evaluated in terms of their accuracy, precision, and macro F1-score, are shown in Figure 4. Table 5 gives the scores of the evaluation metrics.It can be observed that LSTM was the best-performing model with an accuracy of 97.6%, closely followed by multinomial naïve Bayes with an accuracy of 96% and logistic regression with an accuracy of 97%.Random forest showed the highest precision of 90%, followed by the KNN classifier with a precision of 88%.
the evaluation metrics.It can be observed that LSTM was the best-performing model with an accuracy of 97.6%, closely followed by multinomial naïve Bayes with an accuracy of 96% and logistic regression with an accuracy of 97%.Random forest showed the highest precision of 90%, followed by the KNN classifier with a precision of 88%.

BERT + MLP
This section provides a discussion on the training of the dataset using the BERT model along with other techniques to provide explainability.BERT is a machine learning framework for NLP tasks specially designed to help computational systems for understanding the complex structure of language in the given text by using the surrounding text to establish some meaning.From the TensorFlow hub, a BERT model (TensorFlow Hub, 2021) and a preprocessor model were selected.There are various methods to deal with unbalanced data such as sampling techniques (upsampling and downsampling) where data are resampled, weighted loss where losses are weighted differently for data having class imbalance, and data augmentation which is used to artificially create variations in existing dataset.In this research, unbalanced data were dealt with using weight optimization, and bias was set.For weight optimization, appropriate weights were calculated for each class, depending upon their proportion.These weight factors were then multiplied to individual class so that the bias between classes could be removed.
Next, BERT was trained with the MLP model.Table 6 depicts the model summary for the BERT + MLP model, where the first column indicates the type of the layer, the second and third columns indicate the output shape and number of parameters generated  This section provides a discussion on the training of the dataset using the BERT model along with other techniques to provide explainability.BERT is a machine learning framework for NLP tasks specially designed to help computational systems for understanding the complex structure of language in the given text by using the surrounding text to establish some meaning.From the TensorFlow hub, a BERT model (TensorFlow Hub, 2021) and a preprocessor model were selected.There are various methods to deal with unbalanced data such as sampling techniques (upsampling and downsampling) where data are resampled, weighted loss where losses are weighted differently for data having class imbalance, and data augmentation which is used to artificially create variations in existing dataset.In this research, unbalanced data were dealt with using weight optimization, and bias was set.For weight optimization, appropriate weights were calculated for each class, depending upon their proportion.These weight factors were then multiplied to individual class so that the bias between classes could be removed.
Next, BERT was trained with the MLP model.Table 6 depicts the model summary for the BERT + MLP model, where the first column indicates the type of the layer, the second and third columns indicate the output shape and number of parameters generated by processing of the layer, respectively, and the last column represents the previous layer it is connected to.There were a total 29,027,843 trainable parameters.
There were a total 29,027,844 parameters.Among them, 29,027,843 were trainable parameters and only one was a nontrainable parameter.As shown in Figure 5, the architecture of the BERT + MLP model was fine-tuned in order to achieve the most efficient performance.The model contained one input and preprocessing layer, along with the BERT encoder, which was a keras layer.A dense layer was used after the keras layer to reduce the parameters and increase the number of features being propagated to the next layer.A dropout later was added to avoid overfitting of the model, followed by one dense layer used to represent the results as a classification problem.The model defined above was then compiled with the loss function as sparse categorical cross-entropy and the Adam optimizer.

BERT + ANN
Next, BERT was used with ANN to train the model and evaluate the performance.Table 7 depicts the model summary for the BERT + ANN model, where the first column indicates the type of the layer, the second and third columns indicate the output shape and the number of parameters generated by processing of the layer, respectively, and the last column represents the previous layer it is connected to.
Total params: 28,835,428 Trainable params: 28,835,427 Non-trainable params: 1 As shown in Figure 6, the architecture of the BERT + ANN model was fine-tuned in order to achieve the most efficient performance.The model contained one input and preprocessing layer, along with the BERT encoder, which was a keras layer.BERT was combined with convolution layers, followed by a 1D global max-pooling layer, which computed the maximum of all the input sizes for each of the input channels.A dense layer was introduced after the 1D global max-pooling layer to reduce the parameters and increase the number of features being propagated to the next layer.In the end, a dropout later was added to avoid overfitting, followed by a dense layer.The model defined above was then compiled with the loss function as sparse categorical cross-entropy and the Adam optimizer.
The BERT + ANN and BERT + MLP models were trained for 50 epochs.As the number of epochs increased, the accuracy improved.The parameters used to find the number of training steps and number of warmup steps were as follows: number of epochs = 50, number of training steps = steps per epoch × number of epochs, and number of warmup steps = 0.1 × number of training steps.
The accuracy obtained by the BERT + MLP model and BERT + ANN model was 93.67% and 93.55%, respectively, indicating that the gap in conventional evaluation metrics was minimal; however, in terms of the explainability metrics, BERT + ANN performed slightly better than the BERT + MLP model, as discussed later in this section.
bined with convolution layers, followed by a 1D global max-pooling layer, which computed the maximum of all the input sizes for each of the input channels.A dense layer was introduced after the 1D global max-pooling layer to reduce the parameters and increase the number of features being propagated to the next layer.In the end, a dropout later was added to avoid overfitting, followed by a dense layer.The model defined above was then compiled with the loss function as sparse categorical cross-entropy and the Adam optimizer.The accuracy obtained by the BERT + MLP model and BERT + ANN model was 93.67% and 93.55%, respectively, indicating that the gap in conventional evaluation metrics was minimal; however, in terms of the explainability metrics, BERT + ANN performed slightly better than the BERT + MLP model, as discussed later in this section.

LIME with Machine Learning Models
This section discusses the implementation of the LIME model with other linear machine learning models in order to provide explainability and interpretability.
The same labeled dataset used for BERT with ANN and MLP was used for training the LIME model.LIME was trained using linear noncomplex machine learning models such as random forest, naïve Bayes, decision tree, and logistic regression to extract the explanations.Table 5 summarizes the accuracy achieved by each of the models on the HateXplain dataset.It can be seen that logistic regression performed the best with an accuracy of 88.57%.
In this section, LIME classification is demonstrated using an example.The comment text was as follows: "@ComedyPosts: Harlem shake is just an excus to go full retard for 30 s".After the preprocessing was performed on the text, the comment text was reduced to "comedypost harlem shake excus go full retard second".This comment text was obtained from the corpus of the preprocessed pandas data frame and applied to the LIME text explainer for each of the machine learning models.The same comment text was used for all the models so as to compare each model.

Explainability with Random Forest
Figure 7 shows the explainability with LIME and random forest for a particular tweet.It can be observed that the LIME explainer gave weights to each useful word in the comment to indicate its importance in the overall decision making.From Figure 7, we can see that words such as "excus" and "retard" had the highest weights in contributing to the overall prediction probability at 0.10 and 0.07 respectively.The prediction probability of the tweet to be considered as hate speech was reduced by the word "full".Text that contributed in either direction is highlighted on the right side of the figure.The overall prediction probability for hate speech was 90% using the random forest classifier.
ment to indicate its importance in the overall decision making.From Figure 7, we can that words such as "excus" and "retard" had the highest weights in contributing to overall prediction probability at 0.10 and 0.07 respectively.The prediction probabilit the tweet to be considered as hate speech was reduced by the word "full".Text that tributed in either direction is highlighted on the right side of the figure.The overall diction probability for hate speech was 90% using the random forest classifier.

Explainability with Gaussian Naïve Bayes
Figure 8 shows the explainability with LIME and Gaussian naïve Bayes for the ex ple tweet.It can be observed that the LIME explainer gave weights to each useful wor the comment to indicate its importance in the overall decision making.From Figure 8 can see that words such as "full" and "excus" had the highest weights in contributin the overall prediction probability at 0.08 and 0.07, respectively.Interestingly, the pre tion probability of the tweet to be considered hate speech was reduced by the word tard" in the case of the gaussian naïve Bayes classifier.The word retard had the predic probability of 0.14, which eventually increased the overall prediction probability of

Explainability with Gaussian Naïve Bayes
Figure 8 shows the explainability with LIME and Gaussian naïve Bayes for the example tweet.It can be observed that the LIME explainer gave weights to each useful word in the comment to indicate its importance in the overall decision making.From Figure 8, we can see that words such as "full" and "excus" had the highest weights in contributing to the overall prediction probability at 0.08 and 0.07, respectively.Interestingly, the prediction probability of the tweet to be considered hate speech was reduced by the word "retard" in the case of the gaussian naïve Bayes classifier.The word retard had the prediction probability of 0.14, which eventually increased the overall prediction probability of the text not being hate speech by 20%.Text that contributed in either direction to the prediction is highlighted on the right side of the figure.The overall prediction probability for hate speech was 80% using the Gaussian naïve Bayes classifier.
Algorithms 2022, 15, x FOR PEER REVIEW 18 text not being hate speech by 20%.Text that contributed in either direction to the pre tion is highlighted on the right side of the figure.The overall prediction probability hate speech was 80% using the Gaussian naïve Bayes classifier.

Explainability with Decision Tree
Figure 9 shows the explainability with LIME and decision tree for the example tw It can be observed that the LIME explainer gave weights to each useful word in the c ment to indicate its importance in the overall decision making.From Figure 9, we can that words such as "full", "excus", and "retard" had the highest weights in contribu to the overall prediction probability at 0.07, 0.06, and 0.06, respectively.The decision classifier did not give weight to predict the comment as non-hate speech for any o

Explainability with Decision Tree
Figure 9 shows the explainability with LIME and decision tree for the example tweet.It can be observed that the LIME explainer gave weights to each useful word in the comment to indicate its importance in the overall decision making.From Figure 9, we can see that words such as "full", "excus", and "retard" had the highest weights in contributing to the overall prediction probability at 0.07, 0.06, and 0.06, respectively.The decision tree classifier did not give weight to predict the comment as non-hate speech for any of the words.We can see the trend in the text with highlighted words.The overall prediction probability for hate speech was 100% using the decision tree classifier.

Explainability with Decision Tree
Figure 9 shows the explainability with LIME and decision tree for the example twee It can be observed that the LIME explainer gave weights to each useful word in the com ment to indicate its importance in the overall decision making.From Figure 9, we can s that words such as "full", "excus", and "retard" had the highest weights in contributin to the overall prediction probability at 0.07, 0.06, and 0.06, respectively.The decision tr classifier did not give weight to predict the comment as non-hate speech for any of th words.We can see the trend in the text with highlighted words.The overall predictio probability for hate speech was 100% using the decision tree classifier.

Explainability with Logistic Regression
Figure 10 shows the explainability with LIME and logistic regression for the example tweet.It can be observed that the LIME explainer gave weights to each useful word in the comment to indicate its importance in the overall decision making.From Figure 10, we can see that words such as "excus" and "second" had the highest weights in contributing to the overall prediction probability at 0.03 and 0.04, respectively.On the other hand, words such as "retard" and "full" contributed to the text not being hate speech with the weights of 0.04 and 0.03, respectively.Text that contributed in either direction of prediction is highlighted on the right side of the figure.The overall prediction probability for hate speech was 95% using the logistic regression classifier.Explainability with Logistic Regression Figure 10 shows the explainability with LIME and logistic regression for the example tweet.It can be observed that the LIME explainer gave weights to each useful word in the comment to indicate its importance in the overall decision making.From Figure 10, we can see that words such as "excus" and "second" had the highest weights in contributing to the overall prediction probability at 0.03 and 0.04, respectively.On the other hand, words such as "retard" and "full" contributed to the text not being hate speech with the weights of 0.04 and 0.03, respectively.Text that contributed in either direction of prediction is highlighted on the right side of the figure.The overall prediction probability for hate speech was 95% using the logistic regression classifier.The results of all the models on the HateXplain dataset, evaluated in terms of their accuracy, precision, and macro F1-score are visualized in Figure 11.Table 8 gives the evaluation scores for all the models.It can be observed that BERT variants performed signifi-

Summary of Results for the HateXplain Dataset
The results of all the models on the HateXplain dataset, evaluated in terms of their accuracy, precision, and macro F1-score are visualized in Figure 11.Table 8 gives the evaluation scores for all the models.It can be observed that BERT variants performed significantly better than the other linear explainable models, with BERT + MLP having the highest accuracy of 93.67%, closely followed by BERT + ANN with an accuracy of 93.55%.It can be observed that the measures such as precision, recall, and macro F1-score also indicated that the BERT variants outperformed the other linear models.Logistic regression with LIME performed best among the linear models with an accuracy of 88.57% and macro F1-score of 93.75%.The results are visualized in Figure 11 as a bar chart.
Faithfulness is the measure of the accuracy of the true reasoning process of the model.To measure the faithfulness of the models, comprehensiveness and sufficiency were calculated.The comprehensiveness score is a measure of change in the probability of the output of the originally predicted class after eliminating significant tokens.A higher comprehensiveness score indicates a more faithful interpretation.Sufficiency measures the sufficiency of the important tokens to sustain the predictions.It captures the degree to which the snippets within the exact rationales are adequate for a model to make a prediction.A lower sufficiency indicates a more faithful model.
Table 9 provides a summarized view of the explainability metrics calculated on all the models implemented.It can be observed that BERT + MLP was the best-performing model in terms of plausibility.The BERT + MLP model showed the best values of IOU F1, token F1, and AUPRC as compared to the other models.In terms of faithfulness, the BERT + ANN model showed the best results with the highest comprehensiveness score of 0.

Explainability Metrics
We used the ERASER benchmark [35] in order to measure the explainability of the trained models.ERASER (evaluating rationales and simple English reasoning) is a benchmark to evaluate rationalized NLP models, which was proposed by DeYoung et al. (2020).This is achieved by measuring the agreement with human rationales.Measuring exact matches between predicted and reference rationales is likely too harsh; thus, explainability is assessed by measuring plausibility and faithfulness.The prediction is counted as a match if any of the word predictions overlap with the rationales annotated by humans.Token level calculations are compared with human annotations to derive the explainability.Various measures were used from the ERASER benchmark to calculate these comparisons.
Plausibility is the measure of how cogent the interpretation is to a human.To measure plausibility, the metrics IOU (intersection over union) F1-score, token F1-score, and area under the precision-recall curve (AUPRC) score were calculated.The IOU (intersection over union) F1-score was calculated for token level.Partial matches were considered where prediction overlapped more than 0.5 with either of the ground truth rationales.Token-level F1-scores were measured from the token-level precision and recall.AUPRC was used to measure soft token scoring.Higher values of all these metrics indicated greater plausibility.
Faithfulness is the measure of the accuracy of the true reasoning process of the model.To measure the faithfulness of the models, comprehensiveness and sufficiency were calculated.The comprehensiveness score is a measure of change in the probability of the output of the originally predicted class after eliminating significant tokens.A higher comprehensiveness score indicates a more faithful interpretation.Sufficiency measures the sufficiency of the important tokens to sustain the predictions.It captures the degree to which the snippets within the exact rationales are adequate for a model to make a prediction.A lower sufficiency indicates a more faithful model.
Table 9 provides a summarized view of the explainability metrics calculated on all the models implemented.It can be observed that BERT + MLP was the best-performing model in terms of plausibility.The BERT + MLP model showed the best values of IOU F1, token F1, and AUPRC as compared to the other models.In terms of faithfulness, the BERT + ANN model showed the best results with the highest comprehensiveness score of 0.4199.The achieved results are an improvement compared to the base paper by Mathew et al. (2020).BERT variants had the most convincing interpretation to the humans.BERT + ANN achieved a slightly higher comprehensiveness than BERT + MLP, due to the simpler structure of ANN than MLP.The same trend in the parameters of sufficiency was observed in the base paper by Mathew et al. (2020).

Bias-Based Metrics
The hate speech detection models could make biased predictions for particular groups who are already the target of such abuse (Sap et al. 2019; Davidson, Bhattacharya, and Weber 2019).To measure these unintended model biases, the AUC-based metrics by Borkan et al. (2019) were used.We computed the subgroup AUC (area under the ROC curve), BPSN (background positive, subgroup negative) AUC, and BSNP (background negative, subgroup positive) AUC.Subgroup AUC metrics for this use case are a measure of the ability of the model to segregate the toxic and normal comments.A higher value of subgroup AUC suggests that the model is better at differentiating between toxic and normal posts.The BPSN (background positive, subgroup negative) AUC metric is a measure of false-positive rates of the model, while the BNSP (background negative, subgroup positive) AUC is a measure of false-negative rates of the model.A higher value of BPSN indicates a lower likelihood of the model giving false positives, while a higher value of BSNP indicates a lower likelihood of the model giving false negatives.For this dataset, these metrics were calculated with respect to a community.
Table 10 provides a summarized view of the bias-based metrics calculated on all the models implemented.We can see that the bias-based metrics of BERT variants were significantly more accurate than the other linear models.BERT + MLP had the highest values of subgroup AUC, BPSN AUC, and BSNP AUC with 0.8229, 0.7752, and 0.8077, respectively, followed by BERT + ANN with values of 0.7977, 0.7188, and 0.7391, respectively.

Conclusions
In this research study, two datasets were taken to demonstrate hate speech detection using explainable artificial intelligence (XAI).Exploratory data analysis was performed on the datasets to uncover various patterns and insights, and various explainable models were trained on both datasets to extract useful interpretable results.The conclusions of the study are discussed in this section.

Conclusions of the Study on the Google Jigsaw Dataset
The Google Jigsaw dataset comprises user discussions from talk pages of English Wikipedia, and it was released by Google Jigsaw.We trained various existing interpretable models (decision tree, KNN, random forest, multinomial naïve Bayes, logistic regression, and LSTM) on this dataset.We found that LSTM outperformed the other models in terms of accuracy (97.6%) and recall (83%) scores.The random forest model had the best performance in terms of precision (90%) and specificity (87%).KNN, logistic regression, and multinomial naïve Bayes had low evaluation scores as compared to the other models, but they performed very well in terms of accuracy with 90%, 97%, and 96%, respectively.Decision trees and random forest also had significantly good performance with an accuracy of 89% and 91%, respectively.It was observed that the LSTM model gave better overall performance in terms of accuracy, precision, recall, and macro F1-score as compared to the study of Risch et al. (2020).

Conclusion of the Study on the HateXplain Dataset
The HateXplain dataset comprises posts from Twitter and Gab and is annotated by human annotators.Several state-of-the-art models were tested on this dataset to perform evaluation on several aspects of the hate speech detection.These models contained explainability imbibed in various ways.LIME was used with interpretable models such as decision trees, random forest, logistic regression, and naïve Bayes to extract weights of words that contributed significantly to the model's decision making.Furthermore, variants of BERT were created to achieve the best performance.The best performance was observed for the BERT variants, BERT + ANN and BERT + MLP, as compared to the other models.BERT + ANN had a slightly better overall performance than BERT + MLP.For appropriate comparisons, the evaluation metrics were divided into three subsets, namely, performance metrics (accuracy, precision, recall, negative predicted value, specificity, and macro F1-score), bias-based metrics, and explainability metrics (plausibility and faithfulness) as in mathew et al. (2020).LIME was used to demonstrate the textual explanations on some data of the black-box models.We used explanation metrics based on the ERASER benchmark by DeYoung et al. for the human-annotated dataset HateXplain.These metrics suggested how faithful the results of these models were in identifying hateful comments as compared to other existing models.LIME is a surrogate model which is used to highlight contributing words or tokens that can play major part in a comment being hateful or not hateful.The accuracy scores of BERT + MLP and BERT + ANN were 93.67% and 93.55%, respectively, outperforming the simple BERT implementations by Mathew et al. (2020) with an accuracy score 69.8%.The prime reason behind this difference was the combination of BERT with neural network models such as MLP and ANN.Furthermore, our models are trained on 50 epochs, which took around 11.5 and 8.3 h, on Google Colab Pro.The precision scores of BERT + ANN and BERT + MLP were 95.2% and 95%, while recall scores were 93.1% and 93%, respectively.The results of the macro F1-score were calculated to be 94.14% and 93.99%, respectively.
In terms of bias-based metrics, the BERT variant models performed better in reducing the unintended model bias for all the bias metrics.We observed that the presence of community terms within the rationales was effective in reducing the unintended bias.The BERT + MLP model handled this bias much better than other models in terms of subgroup, BPSN (background positive, subgroup negative), and BNSP (background negative, subgroup positive) AUC with values of 0.8229, 0.7752, and 0.8077, respectively, representing an improvement over simple BERT implementation (0.807, 0.745, and 0.763, respectively) by Mathew et al. (2020).Future research on hate speech should consider the impact of the model performance on individual communities to have a clear understanding.
Considering the explainability metrics using the ERASER benchmark by DeYoung et al. (2019), two main factors were evaluated: plausibility (defined by IOU F1, token F1, and AUPRC) and faithfulness (defined by comprehensiveness and sufficiency).The bestperforming models, BERT + ANN and BERT + MLP, had plausibility (IOU F1, token F1, and AUPRC) values of 0.188, 0.507, and 0.8384, and 0.29, 0.529, and 0.8589, respectively, compared to the base BERT model (0.222, 0.506, and 0.841, respectively) in the paper by Mathew et al. (2020).BERT + MLP performed better than the simple BERT implementation.Similarly, the faithfulness (comprehensiveness and sufficiency) values were found to be 0.419 and 0.0055 for BERT + ANN and 0.3574 and 0.003 for BERT+ MLP.BERT + ANN performed better compared to the BERT implementation in the paper by Mathew et al. (2020) (0.436 and 0.008, respectively).
Hence, it can be derived that the variants of BERT used in the research work had superior performance to the base model; BERT + ANN performed best in terms of explainability, and BERT + MLP performed best overall compared to traditional models such as logistic regression, KNN, naïve Bayes, decision trees, and random forests.

Figure 1 .
Figure 1.Data cleaning 1.Firstly, a regular expressions module was imported to help with data cleaning tasks.Regular expressions are sequences of characters that are used for matching with other strings in search.Patterns and strings of characters can be searched using regular expressions.Python has a "re" module that can help to find patterns and strings using regular expressions.Regular expressions can be used to remove or replace certain characters as part of data cleaning and preprocessing.2. Any newline characters or additional spaces were removed.3. Any URLs were also removed as they do not contribute to the learning process.4. Similarly, any other alphanumeric characters that included punctuation were removed for the same reason, including the following strings: !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~.Only uppercase and lowercase letters along with digits 0-9 were kept. 5. Stopwords such as "the", "and", "then", and "if" were also removed as they are not a part of the learning process.Python's NLTK library has stopwords in about 16 different languages.We imported English stopwords to remove them from our dataset.These words were removed as they do not add any additional information to the learning process.6.The outputs of these tasks were stored in a separate column, resulting in a column of tokenized words.

Figure 4 .
Figure 4. Result summary of all classification models on the Google Jigsaw dataset.

Figure 4 .
Figure 4. Result summary of all classification models on the Google Jigsaw dataset.

Figure 6 .
Figure 6.BERT + ANN model architecture.The BERT + ANN and BERT + MLP models were trained for 50 epochs.As the number of epochs increased, the accuracy improved.The parameters used to find the number of training steps and number of warmup steps were as follows: number of epochs = 50, number of training steps = steps per epoch × number of epochs, and number of warmup steps = 0.1 × number of training steps.The accuracy obtained by the BERT + MLP model and BERT + ANN model was 93.67% and 93.55%, respectively, indicating that the gap in conventional evaluation metrics was minimal; however, in terms of the explainability metrics, BERT + ANN performed slightly better than the BERT + MLP model, as discussed later in this section.
4199.The achieved results are an improvement compared to the base paper by Mathew et al. (2020).BERT variants had the most convincing interpretation to the humans.BERT + ANN achieved a slightly higher comprehensiveness than BERT + MLP, due to the simpler structure of ANN than MLP.The same trend in the parameters of sufficiency was observed in the base paper by Mathew et al. (2020).

Figure 11 .
Figure 11.Result summary of all models on the HateXplain dataset.

Figure 11 .
Figure 11.Result summary of all models on the HateXplain dataset.

Table 1 .
Summary of literature.

Table 4 .
LSTM model on the Google Jigsaw dataset.

Table 5 .
LSTM model on the Google Jigsaw dataset.

Table 5 .
LSTM model on the Google Jigsaw dataset.

Table 8 .
Results of models on the HateXplain dataset.

Table 8 .
Results of models on the HateXplain dataset. S.