Interpretable Multi-Head Self-Attention Architecture for Sarcasm Detection in Social Media

With the online presence of more than half the world population, social media plays a very important role in the lives of individuals as well as businesses alike. Social media enables businesses to advertise their products, build brand value, and reach out to their customers. To leverage these social media platforms, it is important for businesses to process customer feedback in the form of posts and tweets. Sentiment analysis is the process of identifying the emotion, either positive, negative or neutral, associated with these social media texts. The presence of sarcasm in texts is the main hindrance in the performance of sentiment analysis. Sarcasm is a linguistic expression often used to communicate the opposite of what is said, usually something that is very unpleasant, with an intention to insult or ridicule. Inherent ambiguity in sarcastic expressions make sarcasm detection very difficult. In this work, we focus on detecting sarcasm in textual conversations from various social networking platforms and online media. To this end, we develop an interpretable deep learning model using multi-head self-attention and gated recurrent units. The multi-head self-attention module aids in identifying crucial sarcastic cue-words from the input, and the recurrent units learn long-range dependencies between these cue-words to better classify the input text. We show the effectiveness of our approach by achieving state-of-the-art results on multiple datasets from social networking platforms and online media. Models trained using our proposed approach are easily interpretable and enable identifying sarcastic cues in the input text which contribute to the final classification score. We visualize the learned attention weights on a few sample input texts to showcase the effectiveness and interpretability of our model.


Introduction
Sarcasm is a rhetorical way of expressing dislike or negative emotions using exaggerated language constructs. It is an assortment of mockery and false politeness to intensify hostility without explicitly doing so. In face-to-face conversation, sarcasm can be identified effortlessly using facial expressions, gestures, and tone of the speaker. However, recognizing sarcasm in textual communication is not a trivial task as none of these cues are readily available. With the explosion of internet usage, sarcasm detection in online communications from social networking platforms, discussion forums, and e-commerce websites has become crucial for opinion mining, sentiment analysis, and identifying cyberbullies-online trolls. The topic of sarcasm received great interest from Neuropsychology [1] to Linguistics [2], but developing computational models for automatic detection of sarcasm is still at its nascent phase. Earlier works on sarcasm detection on texts use lexical (content) and pragmatic (context) cues [3] such as interjections, punctuation, and sentimental shifts, which are major indicators of sarcasm [4]. In these works, the features are hand-crafted and cannot generalize in the presence of informal language and figurative slang that is widely used in online conversations.
With the advent of deep-learning, recent works [5][6][7][8][9], leverage neural networks to learn both lexical and contextual features, eliminating the need for hand-crafted features. In these works, word embeddings are incorporated to train deep convolutional, recurrent, or attention-based neural networks to achieve state-of-the-art results on multiple large-scale datasets. While deep learning-based approaches achieve impressive performance, they lack interpretability. In this work, we also focus on the interpretability of the model along with its high performance. The main contributions of our work are as follows: • Propose a novel, interpretable model for sarcasm detection using self-attention. • Achieve state-of-the-art results on diverse datasets and exhibit the effectiveness of our model with extensive experimentation and ablation studies. • Exhibit the interpretability of our model by analyzing the learned attention maps.
This paper is organized as follows: In Sections 2 and 3, we briefly mention the related works and describe our proposed multi-head self-attention architecture. Section 4 includes details on model implementation, experiments, datasets, and evaluation metrics. Performance and attention analysis of our model are described in Sections 5 and 6, followed by the conclusion of this work.

Related Work
Sarcasm has been studied for many decades in social sciences, yet developing methods to automatically identify sarcasm in texts is a fairly new field of study. The state-of-theart automated sarcasm detection models can be broadly segregated into content-and context-based models.
In content-based approaches, lexical and linguistic cues, syntactic patterns are used to train classifiers for sarcasm detection. Carvalho et al. [10], González-Ibánez et al. [11], use linguistic features such as interjections, emoticons, and quotation marks. Tsur et al. [12], Davidov et al. [13] use syntactic patterns and lexical cues associated with sarcasm. The use of positive utterance in a negative context is used as a reliable feature to detect sarcasm by Riloff et al. [14]. Linguistic features such as implicit and explicit context incongruity, are used by Joshi et al. [4]. In these works, only the input text is used to detect sarcasm without any context information.
Context-based approaches increased in popularity in the recent past with the emergence of various online social networking platforms. As texts from these websites are prone to grammatical errors and extensive usage of slang, using context information helps better identify sarcasm. Wallace et al. [15], Poria et al. [16] detected sarcasm using sentiment and emotional information from the input text as contextual information. While, Amir et al. [17], Hazarika et al. [18] use personality features of the user as context, Rajadesingan et al. [19], Zhang et al. [20] use historical posts of the user to incorporate sarcastic tendencies. We show that context information, when available, helps improve the performance of the model but is not essential for sarcasm detection.
Existing works by Wallace et al. [15], Ptáček et al. [21], Wang et al. [22], Joshi et al. [23], use handcrafted features such as Bag of Words (BoW), Parts of Speech (POS), and sentiment/emotions to train their classifiers. Other works by Liu et al. [9], Poria et al. [16], Amir et al. [17], Zhang et al. [20], Ghosh and Veale [24], Vaswani et al. [25] use deep-learning to learn meaningful features and classify them. The method that uses handcrafted features is easily interpretable but lacks in performance. On the other hand, deep learning-based methods achieve high performance but lack interpretability.
In our work, we propose a deep learning-based architecture for sarcasm detection, which leverages self-attention to enable the interpretability of the model while achieving state-of-the-art performance on various datasets.

Proposed Approach
Our proposed approach consists of five components: Data Pre-Processing, Multi-Head Self-Attention, Gated Recurrent Units (GRU), Classification, and Model Interpretability. The architecture of our sarcasm detection model is shown in Figure 1. Data pre-processing involves converting input text to word embeddings, which is required for training a deep learning model. To this end, we first apply a standard tokenizer (from [26]) to convert a sentence to a sequence of tokens, then we employ pre-trained language models to convert the tokens to word embeddings. These embeddings form the input to our multi-head self-attention module, which identifies words in the input text that provide crucial cues for sarcasm. In the next step, the GRU layer aids in learning long-distance relationships among these highlighted words and output a single feature vector that encodes the entire sequence. Finally, a fully-connected layer with sigmoid activation is used to obtain the final classification score.

Data Pre-Processing
Word embeddings range from the clustering of words based on the local context to the embeddings based on a global context that considers the association between a word and every other word in a sentence. Most popular ones that rely on local context are Continuous Bag of Words (CBOW), Skip Grams [27], and Word2Vec [28]. Other predictive models that capture global context are Global Vectors for word representation (GloVe) [29], FastText [30], Embeddings from Language Models (ELMO) [31] and Bidirectional Encoder Representations from Transformers (BERT) [32]. In our work, we employ word embedding that captures global context as we believe it is essential for detecting sarcasm. We show the results of the proposed approach using multiple word embeddings, including, BERT, ELMO, FastText, and GloVe.

Multi-Head Self-Attention
Given a sentence S, we apply a standard tokenizer and use pre-trained models to obtain D dimensional embeddings for individual words in the sentence. These embeddings S = {e 1 , e 2 , ..., e N }, S ∈ R N×D from the input to our model. To detect sarcasm in sentence S, it is crucial to identify specific words that provide essential cues such as sarcastic connotations and negative emotions. The importance of these cue-words is dependent on multiple factors based on different contexts. In our proposed model, we leverage multi-head self-attention to identify these cue-words from the input text.
Attention is a mechanism to discover patterns in the input that are crucial for solving the given task. In deep learning, self-attention [25] is an attention mechanism for sequences, which helps learn the task-specific relationship between different elements of a given sequence to produce a better sequence representation. In the self-attention module, there are three linear projections: Key (K), Value (V), and Query (Q) of the given input sequence are generated, where K, Q, V ∈ R N×D . The attention map is computed based on the similarity between K, Q, and the output of this module A ∈ R N×D is the scaled dotproduct between V and the learned softmax attention (QK T ), as shown in Equation (1).
In multi-head self-attention, multiple copies of the self-attention module are used in parallel. Each head captures different relationships between the words in the input text and identifies those keywords that aid in classification. In our model, we use a series of multi-head self-attention layers (#L) with multiple heads (#H) in each layer.

Gated Recurrent Units
Self-attention finds the words in the text that are important in detecting sarcasm. These words can be close to each other or farther apart in the input text. To learn longdistance relationships between these words, we use GRUs. These units are an improvement over standard recurrent neural networks and are designed to dynamically remember and forget the information flow using Reset (r t ) and Update (z t ) gates to solve the vanishing gradient problem.
In our model, we use a single layer of bi-directional GRU to process the sequence A, as these units make use of the contextual information from both directions. Given the input sequence A ∈ R N×D , GRU computes hidden states H = {h 1 , h 2 , ..., h N }, H ∈ R N×D for every element in the sequence as follows: where σ(.) is the element-wise sigmoid function and W, U, b are the trainable weights and biases. r t , z t , h t ,h t ∈ R d , where d is the size of the hidden dimension. We consider the final hidden state, h N , which encodes all the information in the sequence, as an output from this module.

Classification
A single fully-connected feed-forward layer is used with sigmoid activation to compute the final output. Input to this layer is the feature vector h N from the GRU module and the output is a probability score y ∈ [0, 1], computed as follows: where W ∈ R d×1 are the weights of this layer and b is the bias term. Binary Cross Entropy (BCE) loss between the predicted output y and the ground-truth labelŷ is used to train the model.

Model Interpretability
Developing models that can explain their predictions is crucial to building trust and faith in deep learning, while enabling a wide range of applications with machine intelligence at its backbone. Existing deep learning network architectures such as convolutional and recurrent neural networks are not inherently interpretable and require additional visualization techniques [33,34]. To avoid this, we employ inherently interpretable self-attention that allows the identification of elements in the input that are crucial for a given task.

Datasets
Dataset details presented in Table 1, includes data source and the sample counts in train & test splits. These are sourced from varied online platforms including social networks and discussion forums.

Twitter, 2013
In this dataset [14], the tweets that contain sarcasm are identified and labeled by the human annotators solely based on the contents of the tweets. These tweets do not depend on prior conversational context. Tweets with no sarcasm or those that required prior conversational context are labeled as non-sarcastic. As a pre-processing step, URLs are removed from the tweets and all mentions are replaced with @user.

Dialogues, 2016
This Sarcasm Corpus V2 Dialogues dataset [35] is part of the Internet Argument Corpus [36], which includes annotated quote-response pairs for sarcasm detection. General sarcasm, hyperbole, and rhetorical are the three categories in this dataset. In these quoteresponse pairs, a quote is a dialogic parent to the response. Therefore, a response post can be mapped to the same quote post or the post earlier in the thread. Here, the quoted text is used as a context for sarcasm detection.

Twitter, 2017
In this dataset [5], tweets are collected using a Twitter bot named @onlinesarcasm. This dataset not only contains tweets and replies to these tweets but also the mood of the user at the time of tweeting. The tweets/re-tweets of the users are the content and the replies to the tweets are the context. Similar to Twitter 2013 dataset, tweets in this dataset are pre-processed by removing URLs and replacing mentions.

Reddit, 2018
Self-annotated corpus for sarcasm, SARC 2.0 dataset [37] contains comments from Reddit forums. Sarcastic comments by users are scrapped that are self-annotated by them using an \s token to indicate sarcastic intent. In our experiments, we use only the original comment without using any parent or child comments. "Main Balanced" and "Political" variants of the dataset are used in our experiments, the latter consists of comments only from the political subreddit.

Headlines, 2019
This news headlines dataset [38] is collected from two news websites: the Onion and Huffpost. The Onion has sarcastic versions of current events, whereas Huffpost has real news headlines. Headlines are used as content and the news article is used as context.

Implementation Details
We implement our model in PyTorch [39], a deep-learning framework in Python. To tokenize and extract word embeddings for the input text, we use publicly avail-able resources [26]. Specifically, we use tokenizer and pre-trained weights from the "bert-base-uncased" model to convert words to tokens and then convert tokens to word embeddings. The pre-trained BERT model is trained with inputs of maximum length, N = 512 by truncating longer inputs and padding shorter inputs with special token < pad >. To extract the word embeddings, the weights of this pre-trained BERT model are frozen and inputs are truncated or padded (with token < pad >) based on their length. We consider the 768-dimensional output, for each word in the input, from the final hidden layer of the BERT model as the word embeddings. These embeddings for the words in the input text are passed through a series of multi-head self-attention layers #L, with multiple heads #H in each of the layers. The output from the self-attention layer is passed through a single bi-directional GRU layer with its hidden dimension d = 512. The 512-dimensional output feature vector from the GRU layer is passed through the fully connected layer to yield a 1-dimensional output. A sigmoid activation is applied to the final output and BCE loss is used to compute the loss between the ground truth and the predicted probability score. The parameters in our model include weights from the Multi-Head Attention, GRU, and Fully Connected layers. When using the BERT model for extracting word embeddings, we initialize it with pre-trained weights and freeze them while training our model. We use the Adam optimizer to train our model with approximately 13 million parameters, using a learning rate of 1e-4, batch size of 64, and dropout set of 0.2. We use one NVIDIA Pascal Titan-X with 16 GB of memory for all our experiments. We set #H = 8 and #L = 3 in all our experiments for all the datasets.

Evaluation Metrics
We pose Sarcasm Detection as a classification problem and use Precision, Recall, F1-Score, and Accuracy as evaluation metrics to test the performance of the trained models. Precision: Ratio of the number of correctly predicted sarcastic sentences to the total number of predicted sarcastic sentences. Recall: Ratio of correctly predicted sarcastic sentences to the actual number of sarcastic sentences in the ground-truth. F-score: Harmonic mean of precision and recall. We use a threshold of 0.5 on the predictions from the model to compute these scores. Apart from these standard metrics, we also compute the Area Under the ROC Curve (AUC score), which is threshold independent.

Results
In this section, we present the results of our experiments on multiple publicly available datasets. The results on Twitter datasets are presented in Tables 2 and 3. In the experiments with the Ghosh and Veale [5] dataset, we do not use any additional information about the user or the context tweets. Hence, for a fair comparison, we present the results on this dataset under the TTEA (Target Tweet Excluding Addressee) configuration. As evident from these tables, our multi-head self-attention model outperforms previous methods by a considerable margin. In Table 4, we present the results on the Reddit SARC 2.0 dataset, which is divided into two subsets: Main and Political. In both datasets, our proposed approach outperforms previous methods.
Apart from Twitter and Reddit data, we also experimented with data from other data sources such as Political Dialogues [35] and News Headlines [38]. In Table 5, we present the results on the Sarcasm Corpus V2 Dialogues dataset and in Table 6, we present the results on the News Headlines dataset. In both datasets, we see considerable improvements.

Ablation Study
The Sarcasm Corpus V2 Dialogues dataset [35] is used for the following ablations.

Ablation 1
We vary the number of self-attention layers and fix the number of heads per layer (#H = 8). From the results of this experiment presented in Table 7, we observe that as the number of self-attention layers increases (#L = 0, 1, 3, 5), the improvement in the performance of the model due to the additional layers becomes saturated. Due to memory constraints, it is not feasible to have more than five self-attention layers in the model. However, these results show that the proposed multi-head self-attention model achieves a 2% improvement over the baseline model where only a single GRU layer is used without any self-attention layers.    [37] 75.0 -76.0 -ELMo-BiLSTM [6] 72.0 -78.0 -ELMo-BiLSTM FULL [6] 76    We vary the number of heads per layer with a fixed number of self-attention layers (#L = 3). The results of these experiments are presented in Table 8. We observe that the performance of the model also increases with the increase in the number of heads per self-attention layer. To further show the strength of our proposed network architecture, we perform this ablation, in which we train our model with different word embedding such as Glove-6B, Glove-840B, ELMO, and FastText and present the results in Table 9. These results show that the performance of our model is not due to the choice of word embeddings. With #H = 8 and #L = 3, the maximum possible batch size to train the model on 1 GPU with 16 GB memory is 64. We set #H = 8 and #L = 3 in all our experiments for all the datasets.

Model Interpretability
Attention maps from the individual heads of the self-attention layers provide the learned attention weights for each time-step in the input. In our case, each time-step is a word and we visualize the per-word attention weights for sample sentences with and without sarcasm from the SARC 2.0 Main dataset. The model we used for this analysis has five attention layers with eight heads per attention. Figures 2 and 3 show attention analysis [42] for two sample sentences with and without sarcasm, respectively. Each column in these figures corresponds to a single attention layer and attention weights between words in each head are represented using colored edges. The darkness of an edge indicates the strength of the attention weight. CLS and SEP are classification and separator tokens from BERT. Figures 4 and 5 are yet another visualization that provides a birds-eye view of attention across all the heads and layers in the model. Here rows correspond to five attention layers and the columns correspond to eight heads in each layer. From both the visualizations, we observe that words receiving the most attention vary between different heads in each layer and also across layers.

Attention Analysis
For a sentence with sarcasm, Figure 2 shows that certain words receive more attention than others. For instance, words such as 'just', 'again', 'totally', '!', have darker edges connecting them with every other word in a sentence. These are the words in the sentence that hint at sarcasm and, as expected, these receive higher attention than others. Note that each cue word is attended by a different head in the first three layers of self-attention. In the final two layers, we observe that the attention is spread out to every word in the sentence, indicating redundancy of these layers in the model. A sample sentence shown in Figure 3 has no sarcasm, thus no word is highlighted by any head in any layer. In Figure 6, we visualize the distribution of attention over the words in a sentence for six sample sentences. Attention weight for a word is computed by first considering the maximum attention it receives across layers and then averaging the weights across multiple-heads in the layer. Finally, the weights for a word are averaged over all the words in the sentence. The stronger the highlight for a word, the higher the attention weight placed on it by the model while classifying the sentence. Words from the sarcastic sentences with higher weights show that the model can detect sarcastic cues from the sentence. For example, the words "totally", "first", "ever" from the first sentence and "even", "until", "already" from the third sentence. These are the words that exhibit sarcasm in the sentences, which the model can successfully identify. In all the samples that are classified as non-sarcasm, the weights for the individual words are very low in comparison to cue-words from the sarcastic sentences. The probability of sarcasm predicted by our model for each of the sentences is shown on the right and their respective scores on the left column in Figure 6. Our model can predict a high score for sarcastic sentences and low scores for nonsarcastic sentences.

Failure Cases
In this section, we provide a brief analysis of the failure cases. We present a few samples that our model fails to classify correctly in Figure 7. From the analysis of such failure cases, we observe that our model mostly finds it difficult to classify interrogative sentences which usually end with a "?". With no context information, we believe classi-fying these correctly is a challenging task not only to the deep learning models but also to human annotators. Figure 7. Sample sentences that our model fails to classify correctly. The top row shows Sarcastic sentences with predicted probability of Sarcasm less than 0.5 and the bottom row shows a Non-Sarcastic sentence with probability greater than 0.5. It can be observed from these examples that our model has difficulty in detecting sarcasm when the inputs sentences are questions.
Apart from these interrogative sentences, we also show a sample Non-Sarcastic sentence that our model classifies incorrectly as Sarcastic. For example, if we observe the third sample in the Non-Sarcastic part of the Figure 7; here the sample sentence ends with an exclamation "!", illustrating hard sample to classify correctly without prior knowledge.

Conclusions
In this work, we propose a novel multi-head self-attention-based neural network architecture to detect sarcasm in a given sentence. Our proposed approach has five components: data pre-processing, multi-head self-attention module, gated recurrent unit module, classification, and model interpretability. Multi-head self-attention is used to highlight the parts of the sentence that provide crucial cues for sarcasm detection. GRUs aid in learning long-distance relationships among these highlighted words in the sentence. The output from this layer is passed through a fully-connected classification layer to obtain the final classification score. The experiments were conducted on multiple datasets from varied data sources and show significant improvement over the state-of-the-art models by all evaluation metrics. The results from ablation studies and analysis of the trained model are presented to show the importance of different components of our model. We analyze the learned attention weights to interpret our trained model and show that it can indeed identify words in the input text that provide cues for sarcasm.