Automatic Sarcasm Detection: Systematic Literature Review

: Sarcasm is an integral part of human language and culture. Naturally, it has garnered great interest from researchers from varied ﬁelds of study, including Artiﬁcial Intelligence, especially Natural Language Processing. Automatic sarcasm detection has become an increasingly popular topic in the past decade. The research conducted in this paper presents, through a systematic literature review, the evolution of the automatic sarcasm detection task from its inception in 2010 to the present day. No such work has been conducted thus far and it is essential to establish the progress that researchers have made when tackling this task and, moving forward, what the trends are. This study ﬁnds that multi-modal approaches and transformer-based architectures have become increasingly popular in recent years. Additionally, this paper presents a critique of the work carried out so far and proposes future directions of research in the ﬁeld.


Introduction
Natural Language Processing (NLP) has been one of the most active fields of AI research in the past decade.Great strides have been made to bring machines closer to a human level understanding of language and, in many instances, the results have been groundbreaking.One area that has been very lucrative is sentiment analysis, the machine's ability to correctly identify the sentiment polarity of a statement or utteranc [1] Sentiment analysis is popular in both academia and industry, where it helps model trends and business strategies alike.
However, researchers have always encountered difficulties while performing sentiment analysis when figurative language forms are present, such as irony and sarcasm.These language instances are almost always used to convey the opposite meaning of what is said.The object of this paper, sarcasm, is defined by some as "a form of irony that is used in a hurtful or critical way [2] Others define it "as a subtype of verbal irony distinguished by the expression of a negative and critical attitude toward a recognizable victim or group of victims [3] Both definitions state that sarcasm requires the presence of a victim, towards which a negative sentiment, hurtful or critical, is addressed.Additionally, sarcasm is often described as a form of irony.Irony may be included in the domain of pragmatics, which relates to the role of context in conveying meaning.
Sarcasm can make data noisy.For example, "I love traffic!" is clearly a sarcastic sentence that expresses a negative sentiment.However, if a model is not fitted to account for sarcasm, it could deem that this sentence expresses a positive sentiment.To counter this noise, researchers developed models that can correctly identify the presence of sarcasm in target utterances.As such, automatic sarcasm detection has shaped a sub-area of sentiment analysis and NLP research.This paper will conduct a systematic literature review of this sub-area of research, automatic sarcasm detection, and will present its findings in the following sections.We consider that a systematic literature review will serve all researchers in the Natural Language Processing field and beyond.They will be able to quickly assess the state-of-the-art data in the sarcasm detection research field and they will be able to more easily select approaches to tackle this task and others using the findings highlighted in this paper.
The paper is structured as follows: the next section will present the details about how the review was conducted and will present in summary the results of the review.The third section will present in depth the results of the review.The discussion section will offer a critique of the literature.The conclusions and future research section will postulate possible directions going forward and will summarize the paper and present final thoughts.

Materials and Methods
The research method used in this paper is a systematic literature review (SLR [4] SLR enables researchers to review current trends and future directions of study, as it allows multiple studies to be reviewed in a single grouping.To conduct this SLR, the PRISMA Guidelines were followed (https://prisma-statement.org/ (accessed on 10 August 2022)), which aids authors in improving the reporting of meta-analyses and systematic reviews.
The research questions of this review are: 1.
What are the main areas of improvement that automatic sarcasm detection has seen since its inception?2.
What are the trends that have shaped the automated sarcasm detection task in the past decade?
Studying the existing literature on both sarcasm as a form of figurative language and sentiment analysis as a subarea of NLP, applying the knowledge in the research field and using the research questions as a starting point, the following search terms have been identified: automatic, sarcasm, irony, figurative language, detection, recognition, NLP, machine learning, deep learning, sentiment analysis and artificial intelligence (AI).
The strings were used to search the following scientific databases: Web of Science, Science Direct, Google Scholar, Scopus and IEEE Xplore.These databases were selected because they are the largest on the internet and are guaranteed to have the noteworthy papers published in the subject of interest.Each database was searched separately between 20 May-5 July 2021.
There were several inclusion and exclusion criteria used in this study.First, to determine the year of publishing, the starting point was set by the paper largely credited to be the first to postulate and tackle the automatic sarcasm detection task.This paper was first published in 201 [5] There were prior attempts to detect sarcasm in text; however, they are often disregarded as being part of this area of research [6][7][8] As such, in this study there will only be papers published from 2010 onward.Another screening criterion was domain.This study included only the papers from the Computer Science domain.Therefore, papers that tackled sarcasm detection in an automatic, AI-related approach were considered.Papers from other domains, like Neuroscience, were not considered.The final criterion was the language that the papers were published in, with only documents available in English being included.This criterion was set to avoid the complexity and confusion of translation.The papers that passed all these screening steps were then reviewed to verify their eligibility.First, the abstracts of the papers were analyzed to quickly identify if and how the papers tackled the automated sarcasm detection task.Then, the qualifying full papers were reviewed.This process was essential to ensure that only those papers that had sarcasm detection as their main scope were studied.This step eliminated those studies that used sarcasm detection alongside other topics.All the screening steps were carried out manually, and no automation tool was used during the conduction of this study.
As such, after searching the research databases presented prior with the selected search strings, 271 articles were identified.Of these, 142 were deemed irrelevant, with another 31 articles being excluded due to them not being in English or their type or title not being aligned with the subject matter.After this step, the abstracts of the remaining papers were studied and 27 were excluded because they did not align with the inclusion criteria of the study.Lastly, 11 papers were excluded following the full text review, as it was observed that sarcasm detection was not a central point of the research presented.After applying the steps described prior, 60 papers were identified.The inclusion and exclusion criteria by which the papers were selected are presented in Figure 1.The earliest retrieved paper that is credited with tackling the automatic sarcasm detection task was published in 2010 and it was the only one published in that year.Interest for this area of research has steadily grown in the past decade, with more papers being published every year.This interest has led to the introduction of a new task in the SemEval competition, with the 2020 edition seeing an increased interest in Task 3, the sarcasm detection task.As such, a great number of papers on this topic were published in 2018 and 2020, years in which the SemEval took place.However, the positive trend of research articles in the field must be noted, even disregarding the SemEval competition.Ever since its inception, sarcasm detection has seen an increasing number of articles published every year, with a record 14 total articles identified for the year 2020.The trend of publication can be observed in Figure 2.

Results
The thematic analysis of the papers highlights two main issues: sarcasm dataset creation and sarcasm detection.The first topic is mainly, but not exclusively, concerned with the introduction of a new dataset or a new rule by which a dataset can be constructed.The second topic is mainly, but not exclusively, concerned with the task of sarcasm detection and uses an established dataset or creates its dataset based on rules set by others.There are papers that tackle both issues.

Sarcasm Detection Datasets
First, the datasets and dataset generation rules used by researchers will be presented and analyzed.There are two approaches that are used when constructing a dataset: distant supervision and manual annotation.
Distant supervision is the quicker, more efficient way to build a dataset.It enables researchers to tap into the APIs of established social networks or websites, such as Twitter, Reddit or Amazon, and to collect millions of examples without any manual labor.Examples are considered positive or sarcastic when they meet a certain criterion, #sarcasm for Twitter or /s for Reddit.Twitter is the most popular source of data and most dataset generation rules for it require the researcher to query the APIs for tweets that have one or more hashtags, such as #sarcasm, #irony, #sarcastic, #not or others.The data is then filtered by eliminating tweets where the hashtags are not at the end, retweets or non-English tweets [5,[9][10][11].
Manual annotation uses crowdsourcing, employing human labor.A target utterance is presented to an annotator and they must state whether it is sarcastic or not.Until recently, the annotator was never the author of the utterance, therefore the labeling was based on perceived sarcas [12] A new dataset, iSarcasm, uses the authors of the utterances as the annotators, the labeling occurring based on intended sarcas [13] A trend that is shaping for sarcasm datasets is the use of multiple data sources.While most papers prior to 2020 almost exclusively used Twitter datasets, more recent papers have displayed interest in multiple data sources, such as Reddit, news, books or even YouTube.A distribution of the dataset source for the identified papers is presented in Figure 3.It can be noted that Twitter is by far the most popular dataset source when it comes to automatic sarcasm detection, due to its popularity among English speakers, simple and concise text and the ease of extracting data through distant supervision by use of hashtags and APIs.
Recent trends for sarcasm datasets point toward special curated data and sets that are manually built or labeled, in favor of scrapping social media websites or forums.Researchers have opted for this approach to improve the quality of the data that is required to train better models.Sarcasm is known to be difficult to detect even under ideal conditions and noisy data that are usually found on the internet, for example Twitter, no longer suffice.
Next, the datasets used by the selected papers will be presented and analyzed.For each dataset used in each paper certain characteristics will be presented, inspired by the work of Joshi et al [14] There are three main metadata categories: annotation, context and dataset.The annotation category presents the method used to annotate the utterances as positive or negative.The two main techniques used are manual annotation and distant observation.For manual annotation a human judge has analyzed the utterance and labeled it accordingly.For distant observation the researchers used signals, such as Twitter hashtags, to label the data.For example, a tweet is considered sarcastic if it has #sarcasm in its text.The context is split into author and conversation context.The datasets that have author context contain information about the author of the target utterance, like past activity or profile information.Conversation context contains information about previous sentences in a conversation before the target utterance.The dataset data presents information about data type.Short datasets are composed of small texts, such as tweets, long datasets are composed of long texts, such as news articles or Reddit conversations, and other data represent non-text data, such as images or video data.The size represents the total number of instances present in the dataset.These characteristics for each dataset identified are presented in Table 1.The findings are further discussed in Section 4. (* same datasets used for the FigLang2020 workshop).

Automatic Sarcasm Detection
The approaches used to tackle automatic sarcasm detection have evolved throughout the years.They have transitioned from being rule-or feature-based to machine learningbased and, most recently, deep learning-based.An overview of the models used in the selected papers, in chronological order starting from 2010, is shown in Table 2. From Table 2 it can be noted that automatic sarcasm detection has followed the trends that have shaped NLP research in the past decade.The first papers published in this area focused on machine learning and feature-based models.Classifiers such as Support Vector Machine [69], Logistic Regression, Decision Tre [70] Naïve Bayes and Random Fores [71] dominated the landscape in the first years of the decade.Then, a shift began toward deep neural networks, such as Convolutional Neural Network [72] Long Short-Term Memor [73] and other different configurations shaping the progress for sarcasm detection.Recent years have seen the surge of transformers, with BER [74] and RoBERTA [75] setting the path for new advancements in the field.Additionally, the analysis of these papers has highlighted new trends in the field, multi-modal approaches becoming more popular with the integration of deep learning.
Next, the reported performance of the selected articles will be presented and analyzed.Because sarcasm detection is a classification problem, the metrics used by researchers are Precision, Accuracy, Recall, F1-score and Area Under Curve (AUC).The performance information is found in Table 3.The findings are further discussed in Section 4.

Discussion
As can be seen from Table 2, the early years of sarcasm detection were dominated by machine learning models.The first robust algorithm used for sarcasm detection, the semi-supervised sarcasm identification algorithm (SASI) to detect sarcasm in Twitter and Amazon product reviews [5], appeared in 2010.At the time, all systems failed to correctly classify the sentiment of sarcastic sentences.This algorithm used two modules: semisupervised pattern acquisition to identify sarcastic patterns that could be used as features in a classifier and a classification stage to assign each sentence to a sarcastic class.The authors of [10] studied lexical and pragmatic features in tweets, using unigrams and dictionary-based for classifying sarcastic, positive and negative tweets by employing two classifiers: SVM and logistic regression.
From the analyzed papers, there are interesting approaches that must be noted.One such approach is the SCUBA framewor [22] The authors wanted to improve sarcasm detection on Twitter by integrating past tweets of the target tweet's author.The authors developed several features that derived from forms that sarcasm can take, such as contrast of sentiments, the form of written expression, the means of conveying emotion and others.They trained several models and chose L1-regularized logistic regression as the preferred option.The results showed that the accuracy of the predictions increased as the number of past tweets that the model has access to also increased.Other researchers also accounted for contex [23] They extracted several features that could give information about said context and split these features in three categories: tweet, author and audience features.They used binary-logistic regression with L2 regularization to classify the texts.The results showed that the author features were the salient features, the performance of the classifier improving almost as much as when all the features were introduced in the model.Additionally, the authors found that #sarcasm was used by the tweet authors when they were not familiar with their audience and wanted to make sure their message was correctly perceived.These papers highlight that when context is accounted for, the performance of sarcasm detection models increases.
Some researchers introduced emojis into the mix [37].The authors employed a deep learning approach and trained their own word embeddings to properly capture the salient information provided by emojis.The DeepMoji model that the authors proposed consists of an embedding layer that feeds in two BiLSTM layers that feed into an attention layer and a final softmax activation function that makes the prediction.The results showed that the diversity of the emojis used is important to the performance of the model.
Other interesting approaches integrated English with other languages, like Cantonese [20] or HindI [56].The Chinese model first extracted sarcasm features from Cantonese and English texts, then applied weighted random sampling to these texts, followed by bagging.A weighted vote was applied to find the best classifier.The Indian model consisted of three parts, one that uses an attention-based BiLSTM that generates context vectors for English, one that uses Hindi-SentiWordNet to generate feature vectors for Hindi and a classifier, which is trained on three features, English, Hindi and auxiliary pragmatic features.Sarcasm detection has also been implemented to counter cyberbullyin [66] The study found similarities between ironic and sarcastic tweets.Even more, sarcasm was found to be a great indicator for the presence of cyberbullying, further proving the practical applicability of sarcasm in NLP tasks.
Oprea and Mand [48] defined author context as the embedded representation of their historical Twitter posts and proposed neural models for extracting these representations.They tested two tweet datasets, one manually labeled for sarcasm and the other using tag-based distant supervision.Exclusive models in the authors' proposed architecture did not use the current tweet being classified, instead basing the prediction solely on user history.In contrast, inclusive models took into account both user history and the most recent tweet.
Multimodal approaches must also be noted, due to their increased popularity in recent years.The first multimodal approach integrated images in the sarcasm detection task [30].The authors collected data from three social media platforms, Twitter, Instagram and Tumblr.They then employed two approaches to sarcasm detection, a SVM approach and a deep learning approach.For the SVM approach, they extracted NLP and visual semantics features.For the DL approach, they used two networks, an NLP one and a visual one, which they then fused in order to achieve the prediction.The results showed that integrating visual information improved performance for the Instagram set, while it was inconsequential for Twitter and Tumblr.Additionally, text features proved to offer little for the performance of the deep learning approach.Another approach [51] proposed a hierarchical fusion model that implemented three feature representations: image, image attribute and text.The authors of the paper treated text features, image features and image attributes as three modalities.The proposed hierarchical fusion model in the paper extracted image features and attribute features first and then used attribute features and a bidirectional LSTM (BiLSTM) network to extract text features.After that, the model reconstructed the features of three modalities and fused them into a single feature vector for prediction.The authors trained and tested their approach on a multi-modal dataset based on Twitter.
A multimodal approach [52] based on BERT for text preprocessing was also proposed.The study was conducted on Twitter data that had both text and image.The model integrated three components, text, hashtag and image.The model made use of both intermodality attention, between image and text, and intra-modality attention, within the text.
One interesting observation is that the best performing solutions in the SemEval 2020 used ensemble methods and/or implemented data augmentation.Therefore, semisupervised techniques in conjunction with transformer-based architectures could attain superior results over other approaches and should be favored going forward.
There are multiple observations to be made from Table 1.First, the variability in dataset size must be noted.With the exception of the FigLang2020 workshop, most papers use different sized datasets.Even the papers that try to use established sarcasm datasets, similarly to the Riloff tweet dataset, encounter difficulties.Because these datasets were constructed through Twitter API and only tweet ids were given, the longer the time passes, the more the datasets deteriorate.Due to Twitter change of policies, the users deleting the tweets or other events, the datasets become smaller and information is lost.This size variability makes performance benchmarking more difficult because parity is lost.For some papers the dataset size information is missing altogether.The better reporting of the dataset used and its characteristics could be employed by future researchers.
Next, it can be seen that most datasets are annotated through distant observation.Forty-six of the unique datasets identified are constructed this way, more than double those of manual annotation, i.e., twenty.Research has shown that manual annotation is superior to distant observation and future research should focus more on building and working with manually annotated datasets.Context is also found to be lacking, especially in older datasets.Only 10 unique datasets have author context and 18 have conversation context.Again, future research should focus on building and working with datasets that include context information, if better performing solutions are to be developed.On the topic of data type, most datasets are composed of short texts (44), fewer are composed of long texts (19) and only 6 texts include non-text data.Multi-modal approaches have generated increased interest in recent years and more datasets that incorporate different types of data, like MUStARD, should be developed.
There is valuable information that can be extracted from Table 3.At first glance, the performance boost of deep learning can be observed.Solutions proposed from 2016 onward see an increase in metrics scores.However, the scores do not tell the whole story.As seen in Table 2, the great variability of datasets or size of the same dataset make performance benchmarking a difficult task.Past solutions were able to be trained on complete datasets and achieve equal or superior performance to more recent implementations, especially if the proposed approach is data hungry, such as transformer-based architectures.The performance of modern solutions is, however, superior to past approaches.This difference is highlighted in the FigLang 2020 workshop where the winning proposal netted excellent results, far superior to any past solution.
However, this study has identified some key issues with the automatic sarcasm detection task.These issues spring from the datasets and dataset creation rules.First, for distant supervision, the dataset ends up being noisy.The assumption is made that sarcasm is present only in instances that have a certain identifier, such as #sarcasm or /s.This is untrue and sarcastic instances end up in the false class, only because they lacked the identifier, and therefore lead to false negatives.Furthermore, this process only captures one type of sarcasm, one that is specific to a clearly established setting, a Twitter or a Reddit thread.This limits the ability of models to identify other flavors of sarcasm and therefore leads to false negatives.
Manual annotation methods have also proven to create less than ideal datasets.One crucial problem is perceived vs. intended sarcasm.For almost all datasets built this way, the annotator is different from the author of the target utterance.This can lead to a low agreement rating between the author and the annotator and can lead to both false positives and false negatives.Training a model on these sets is akin to training it on perceived sarcasm.To counter this, iSarcasm has the authors of a target utterance to annotate it.Therefore, the utterance is correctly labeled by its author.
However, the problem of perceived vs. intended sarcasm does not go away.Training a model on this dataset will simply shift the perception to intended sarcasm.Research has shown that different cultures perceive sarcasm differently or second language speakers have difficulties understanding sarcasm [76,77].As such, a dataset that is skewed toward each part can prove detrimental to universal sarcasm detection, if such a task can even be performed.
Furthermore, sarcasm detection has traditionally proven to be a very difficult task, even for humans.By relying heavily on a single source, Twitter, the different flavors and facets of sarcasm are lost.Recent approaches, such as the one by Castro et al. [49], must be encouraged.The dataset that the authors propose is collected from sitcoms and has text, audio and video data.However, it is not without fault.The nature of sitcoms is to exaggerate situations and purposefully make jokes, therefore the sarcasm present in them tends to be different from the one used daily by humans.As such, better sources of data for sarcasm must be found or created.
As previously stated, the models used in the automatic sarcasm detection task have evolved throughout the years.NLP trends have shaped the task, with the large adoption of deep learning starting in 2016 and transformers shaping the landscape in 2020 and 2021.For example, all papers published from 2020 onward have included transformers.Their impact cannot be understated, and the automatic sarcasm detection task has benefited greatly from their introduction.However, it can be said that the problem has been wrongly defined.If the goal is to aid corporations to identify sarcastic tweets and correctly respond, then the path might be right.However, if the goal of NLP research is to build models that can replicate human level ability, then sarcasm detection still has a long way to go.
Recent trends in the automatic detection of sarcasm point towards multi-modal, deep learning approaches.In recent years, researchers have shifted their focus to implementing transformer-based architectures in order to tackle this task.As seen in other areas of NLP, transformers have brought a new era of performance and have quickly achieved the stateof-the-art.However, because multiple, heterogeneous datasets are used for the sarcasm detection task, benchmarking these models remains difficult.Going forward, a handful of datasets could be identified that would serve for benchmarking purposes and that would help the development of the research field.Both MUStARD and iSarcasm could be great starting points for selecting such datasets.
The systematic literature review allowed us to cover a broad field of research and to extract valuable information that was presented and discussed in the previous sections.However, this study has some limitations.Future researchers could use the same method-ology and query more databases.Additionally, they could introduce more search terms in their query.
This study has also disregarded any papers that were not published in English and, therefore, ignored important research on sarcasm detection in other languages.Due to its nature, being time-bound, this study can always be replicated in the future to assess the progress that has been made in the field.

Conclusions and Future Research
Stemming from the findings presented in the previous section, a few directions for future research will be presented.First and, we believe, most importantly, researchers should investigate better ways to construct their datasets.Twitter can be a good source of data, but it must not be the only one.Researchers fight an uphill battle in this regard, with few networks providing access to their data, but a more varied approach must be considered.Multimodal datasets can also be further studied, as they tend to capture more of the nuances of sarcasm.Therefore, the following questions are asked: a.
What does automatic sarcasm detection want to achieve? b.
What are the best data to train models on to replicate human level ability?
Second, as sarcasm has proven to be heavily dependent on context, sarcasm context detection could be explored.Researchers could develop models that could correctly identify whether a certain context is appropriate for sarcasm to be used and, therefore, develop speech systems that could correctly use it.This could lead to more natural humanmachine communication, as sarcasm is an integral part of human culture.This leads to the following question: c.
What is the correct context in which to use sarcasm?
Third, future researchers could explore the relation between different languages and cultures regarding sarcasm.They could develop models that can correctly translate or interpret a sarcastic remark from one language to another and identify if a certain context is appropriate for sarcasm in multiple languages.Interesting systems that can improve our understanding of sarcasm can also be developed.Such systems could transform normal utterances into sarcastic ones or vice versa, akin to a sarcasm translator that could decode the intent of someone like Chandler, a highly sarcastic character, from the "Friends" TV show.Such a system could lead to a better understanding of sarcasm and its application in a machine environment.This leads to the following question: d.How can machines generate sarcasm?After analyzing the selected papers, the two research questions considered in this paper can be answered as follows: the main area that automatic sarcasm detection has seen improvement in is the models.They have evolved in tandem with all of NLP research, from machine learning to transformers.Additionally, recent trends in the field have been identified, for both datasets and methods.However, one area that has seen slow improvement is the selection of data for the task.It has remained mostly unchanged for the past decade, and this has proven to be a problem.A re-evaluation of the task could be carried out by researchers and future avenues could be explored, especially regarding the data selection process.
However, this should not discourage future researchers.Sarcasm is a beautiful characteristic of human language and culture, and its application to a machine environment was never going to be easy.This review serves only as an assessment of the work carried out and peeks into a future where humans and machines can get along even better than today.

Figure 1 .
Figure 1.Inclusion and Exclusion Criteria, following the PRISMA Guidelines.

Figure 2 .
Figure 2. Publication trend by Year for Studies on Automatic Sarcasm Detection in English.Source: Own work.Note: 2021 until end of May.

Figure 3 .
Figure 3. Dataset source distribution for the identified papers.Source: Own work.

Table 1 .
Dataset information for the selected papers.

Table 2 .
Overview of the models used in the selected papers.Source: Own work.

Table 3 .
Reported performance for each identified paper.