A Multi-View Interactive Approach for Multimodal Sarcasm Detection in Social Internet of Things with Knowledge Enhancement

: Multimodal sarcasm detection is a developing research field in social Internet of Things, which is the foundation of artificial intelligence and human psychology research. Sarcastic comments issued on social media often imply people’s real attitudes toward the events they are commenting on, reflecting their current emotional and psychological state. Additionally, the limited memory of Internet of Things mobile devices has posed challenges in deploying sarcastic detection models. An abundance of parameters also leads to an increase in the model’s inference time. Social networking platforms such as Twitter and WeChat have generated a large amount of multimodal data. Compared to unimodal data, multimodal data can provide more comprehensive information. Therefore, when studying sarcasm detection on social Internet of Things, it is necessary to simultaneously consider the inter-modal interaction and the number of model parameters. In this paper, we propose a lightweight multimodal interaction model with knowledge enhancement based on deep learning. By integrating visual commonsense knowledge into the sarcasm detection model, we can enrich the semantic information of image and text modal representation. Additionally, we develop a multi-view interaction method to facilitate the interaction between modalities from different modal perspectives. The experimental results indicate that the model proposed in this paper outperforms the unimodal baselines. Compared to multimodal baselines, it also has similar performance with a small number of parameters.


Introduction
There is a close relationship between human activities and sarcasm communication, which is vital in daily work.Sarcasm is a unique expression, and the Oxford Dictionary defines sarcasm as "a way of using words that are the opposite of what you mean in order to be unpleasant to someone or to make fun of them".Given a sample consisting of a sentence W and image I, the multimodal sarcasm detection aims to predict the sarcasm label of the sample from the label set: {Sarcasm, Non − Sarcasm}.From the Internet of Things perspective, identifying sarcastic information can be helpful for machines in many ways, such as making more anthropomorphic human-computer interaction applications on social IoT.It is also the reason why sarcasm detection has attracted people's attention.Social Internet of Things is a "social network of intelligent objects" paradigm based on social relationships between objects.It can capture the Internet of Things and discover services and resources through social relationships [1,2].Sarcasm detection can be applied in social Internet of Things applications, such as sentiment analysis, opinion mining, and dialogue generation, et al. [3][4][5][6][7].Sarcasm may mislead the predictions of sentiment analysis and opinion-mining methods.Sentiment analysis and sarcastic expression can influence the dialogue system's performance in social robots.Another common application is multimodal data analysis on mobile social platforms such as WeChat, Facebook, and Twitter.Early sarcasm detection research mainly utilized machine learning methods to mine information, such as viewpoint or emotional orientation in text for text classification [8,9].With the advancements in mobile social media technology, people can share their daily lives or comments on current hot events through multimodal data composed of videos, images, and text issued on platforms like Twitter and Weibo, et al., resulting in a substantial increase in the volume of multimodal data available.Sarcasm detection gradually shifts from unimodal to multimodal research [10].This paper focuses on multimodal sarcasm detection in images and text on Twitter.When it comes to the sarcasm detection of images and text, we need to recognize that images and text express different psychological states of users.Therefore, multimodal sarcasm detection requires simultaneous extraction of semantic information between images and texts and consideration of their incongruity.
For multimodal sarcasm detection composed of images and text, some researchers use concatenation operations [11], attention mechanisms, or graph neural networks to fuse multimodal data [12,13].Although these studies have achieved excellent performance, there are still the following drawbacks: 1.
They overlook commonsense knowledge's role in supplementing the information of graphic and textual features.For example, subtitles and background descriptions of images can be used as visual commonsense knowledge to enhance unimodal feature representations.Intuitively, visual commonsense knowledge provides additional semantic information.

2.
They lack effective modal interaction approaches.The current research on multimodal interaction is overly simple and needs to view multimodal sarcasm detection tasks from multiple perspectives.

3.
They do nottake into account the impact of the model size on mobile devices.Especially when processing multimodal data, it is necessary to extract each modality's features and typically requires a larger modal fusion model.
However, most existing methods for sarcasm detection extract semantic representations for each modality separately.Subsequently, they fuse multimodal features through complex networks.The sarcastic information embedded in both text and images mutually complements one another, allowing the expression of the user's current intentions and feelings.In previous literature, each modal feature is modeled separately, and relatively little attention is paid to the interaction between modalities.Consequently, for sarcasm detection, it is essential to construct a multimodal interaction model that integrates visual commonsense knowledge.
For the above reason, this paper focuses on multimodal sarcasm detection that can be used for mobile devices with limited computing resources.We exploit a multi-view interaction model with knowledge enhancement (MVK) for multimodal sarcasm detection in this paper.MVK mainly consists of two parts, interactive learning and feature fusion.Firstly, we employ the pre-trained model ResNet [14] to obtain each image's corresponding attribute labels, which can effectively represent the object and background information of the image.These attribute labels serve as visual commonsense knowledge.From the perspective of multi-view learning, attribute labels can also be used as the image attribute view.Then, we extract images, text, and visual commonsense knowledge representations.In the stage of interaction learning, text and image features are learned and acquired through the self-influence and mutual influence of text and images with the guidance of visual commonsense knowledge.Specifically, we concatenate knowledge features with text and image features, respectively.By employing an attention mechanism, we obtain the importance of knowledge for each modality and enhance the single modality representation.Subsequently, an inter-modal interaction module is established to interact with image and text information to learn inter-modal mutual and incongruous information.Within this module, the attention weight matrix from the text perspective is employed for extracting image features, whereas the attention weight matrix from the image perspective is employed for extracting text features.We optimize the attention weight matrix for each modality through interactive learning.Finally, in the feature fusion stage, a late-attention mechanism is utilized to fuse the representation vectors of each modality.Additionally, the constructed multimodal sarcasm detection model has fewer parameters.The experiment demonstrates that the presented approach performs excellently in multimodal sarcasm detection.
The main contributions of this paper can be stated as follows: • In this paper, we employ ResNet to acquire attribute information of images as visual commonsense knowledge and leverage visual commonsense knowledge to enhance the representation of each modality.

•
The proposed method enhances the multimodal information interaction.The model can learn the inter-modal differences and mutual information through different modal views and improve the model's robustness.

•
A series of experiments and analyses indicate that the presented model can effectively utilize the information of text and image to improve the performance of multimodal sarcasm detection.Concurrently, the model we constructed has fewer parameters, reducing memory pressure and computational resources.
The rest of the paper is arranged as follows.Section 2 introduces the relevant works.Section 3 provides a detailed description of the method proposed in this paper, and then the experimental setting and results are presented in the Sections 4 and 5.The Section 6 provides a conclusion.

Unimodal Sarcasm Detection
Early sarcasm detection primarily concentrates on text modality [15][16][17].Moreover, on the basis of the textual data, the early works extract paragraph-level, sentence-level, or word-level fine-grained features.The research on text sarcasm detection mainly includes three categories: deep learning-based, machine learning-based, and rule-based.Riloff et al. [18] bring up a rule-based algorithm that iteratively expands positive and negative phrases and then uses these learned words to predict irony label, but it lacks sufficient adaptability.With the widespread use of machine learning algorithms, researchers have begun to extract manual textual features and design machine learning algorithms to carry out sarcastic detection work.Ghosh et al. [19] use SVM as their classifier to solve this problem.These methods rely on the quality of features and are not friendly to data with similar features.However, since features can automatically be extracted through deep learning architecture, a large amount of work on deep learning has been used for sarcastic detection.Poria et al. [20] leverage mixed CNN-SVM to encode emotional and personal features for sarcastic detection.Xiong et al. [21] use an attention mechanism to capture word differences to detect sarcastic discourse.Ilic et al. [22] employ ELMo to extract a word's character-level representation to express contextual satire.These deep learningbased methods have achieved excellent performance.However, due to the increase in multimodal data, researchers are paying more attention to multimodal sarcasm detection.

Multimodal Sarcasm Detection
Applications such as social networks and news websites have generated abundant multimodal data.As a result, people have conducted extensive multimodal investigations; for example, sentiment analysis [23][24][25][26][27], image and text retrieval [28], reason extraction [29,30], and sarcasm detection [31].Unlike text-modal-based sarcasm detection, multimodal sarcasm detection aims to identify sarcastic expressions implied in multimodal data.Schifanella et al. [11] first process multimodal sarcasm detection tasks in social platforms through manually designed text and image data features; manual features require higher costs.Chauhan et al. [32] introduce a multitasking framework to identify irony and emotions.Furthermore, Cai et al. [12] create a new text-image sarcasm dataset collection from Twitter and propose an attention-based fusion model for multimodal sarcasm recognition which achieves a simple late fusion.Xu et al. [33] exploit relational and decompositional networks to model semantic correlation in sarcastic recognition.These early studies using simple fusion networks make extracting mutual information between modalities difficult.The HKE model [34] proposes the joint use of cross-modal graphs and syntactic analysis.These studies employ adjective-noun pairs to enrich multimodal information.Pan et al. [35] propose inter-modal and common attention mechanisms to complete sarcastic detection tasks.The DIP [36] framework recently proposed leverages a contrastive loss to capture sarcastic information from factual and emotional perspectives.Liang et al. [13] focused on learning the incongruous relationships through a cross-modal graph convolutional network by building intra-modal and cross-modal graphs for each multimodal sample.Furthermore, they [37] present a cross-modal graph neural network in which the edge weights come from SenticNet to capture the inter-modal inconsistency.Jiang et al. have embedded sentiment word into multimodal vectors [38].These graph neural networks have achieved excellent performance but bring much computational trouble.
These studies have achieved high classification accuracy by learning the inconsistentrelationships between modalities through attention mechanisms or graph neural networks.Although the above deep learning-based methods avoid the high cost of manually designing features, there are still some areas for improvement in utilizing commonsense knowledge.Meanwhile, they focus on improving the model's classification performance while ignoring the model's size.This paper proposes a multimodal interaction model based on knowledge augmentation to utilize visual knowledge information.This model has fewer parameters, which is beneficial for deployment on mobile devices.

Methodology
The approach presented in this paper consists of three components: feature extract module, multi-view interaction module, and late fusion module.The pre-trained models extract the multimodal features in the feature extract module.The multi-view interaction module interacts with multimodal messages with visual commonsense knowledge.Lastly, the late fusion module leverages an attention machine to fuse multimodal representation.The overall architecture is illustrated in Figure 1.We employ the pre-trained DenseNet-121 [39] to extract image features, which is divided into 14 × 14 regions.Each region is fed into DenseNet to obtain the local feature: Then, we average the value of all region features as the representation vector of the image: where n = 196 is the number of regions, C image i is used for late attention fusion, and v image is used for multimodal interaction.

Text Feature
In this section, we utilize the pre-trained model BERT-base [40] to extract the raw text feature.For each sample, we extract word-level features: where w i denotes the i-th word in text.The maximum length of a text is set to 75, i.e., i ∈ [1, 75].The output of the BERT's last layer serves as the text's sentence-level features v text .C text i is used for late attention fusion and v text is used for multimodal interaction.

Knowledge Feature
Previous studies have revealed that introducing extra knowledge can enrich the semantic information of images or texts and enhance the robustness of the model.We utilize the ResNet model trained on the COCO dataset [41] to predict the attribute labels of each image in the dataset to be evaluated.That is to say, we acquire some words that can illustrate the content and background of the image.These words serve as the visual commonsense knowledge to establish connections between text and images.Subsequently, we utilize BERT to extract the representation of each word: where k i denotes a word of knowledge and i ∈ [1, 5].After that, attention weighting is applied in these representations to obtain knowledge features for subsequent sections: where C knowledge i is used for late attention fusion and v knowledge is used for multimodal interaction.

Multi-View Interaction Module
In this section, we elaborate on the multimodal interaction process of multimodal data, which is enhanced with visual commonsense knowledge.Before feeding data into the multi-view interaction module, text, images, and knowledge features are first aligned with dimensions through the full connection layer.As displayed in Figure 1, the multi-view interaction module has N layers.Firstly, in the i-th layer, text and images are concatenated with visual commonsense knowledge separately. where Then, we utilize scale dot product attention to enhance text and image representation with visual commonsense knowledge: where d = 512, the attw T i and attw I i are the attention weights for the text and image model,respectively, with the dimension of R. The W T i and W T i are trained parameters.The symbols • and ⊗ denote the dot product operator and matrix product, respectively.The F T i and F I i are the text and image modal representations with knowledge enhancement through the attention machine.We then utilize attention weights to facilitate information interaction between text and images: after that, F T i concatenates with f T i , F I i concatenates with f I i .Then, they go through a projection layer: the projection layer consists of two linear layers, a LeakyReLU activation function, and a LayerNorm layer.
The current layer's output serves as the next layer's input:

Late Fusion Module
Finally, we adopt an attention mechanism to fuse the representation vectors of text, images, and visual commonsense knowledge.The representations obtained from the multiview interaction module are concatenated with the raw representations, which are fed into a full connection layer to calculate the fused vector: where W mn and b mn are trained parameters.m, n ∈ ϕ, ϕ = set{text, image, knowledge}.C i m denotes the i-th raw representation vector of modal m. S m is the length of sequence.
where W m and b m are trained parameters.v p is the fused vector used in the classification layer.

Classification and Loss Function
In this paper, we utilize the fully connected layer as the classification layer, followed by the sigmoid activation function.The binary cross-entropy loss is applied to optimize the model.
The training, validation, and testing process of the model is shown in Figure 2.

Experiment Setup 4.1. Dataset and Evaluated Metrics
In this paper, we use Cai's [12] publicly available multimodal sarcasm dataset collected from the Twitter platform.This dataset comprises images and corresponding text.This dataset contains 24,635 samples, divided into training, validation, and testing sets.Every sample is annotated by sarcasm or non-sarcasm labels, i.e., 1 and 0. Table 1 presents the specific statistical information about the dataset.Following the previous works, Accuracy Score (Acc), Precision (Pre), Recall (rec), and F1 score (F1) are applied to evaluate model performance.

Hyper-Parameters Setting
In our experiment, the pre-trained models BERT and DenseNet-121 are employed to extract multimodal features.The pre-trained model freezes its parameters when extracting raw features.The optimizer is Adam.The important hyper-parameters are shown in Table 2.

Compared Baselines
The following compared baselines contain models that use text, images, and textimage modalities as inputs.
TextCNN [42]: It uses the 1D Convolutional Neural Network to extract text features with few parameters.TextCNN-RNN [42]: We add a recurrent neural network on the basis of TextCNN to further extract text features.BERT [40]: The BERT-based-uncased is utilized to extract textual features and fine-tune them for downstream sarcastic tasks.Image [43]: It purely employs the image vectors after the pooling layer of Desnet to predict the results of sarcasm detection.ViT [43]: Like BERT, the vision pre-trained model ViT extracts the image representation by inserting the [CLS] token for the image sarcasm detection.HFN [12]: This model utilizes GloVe and ResNet as feature extractors.Then, a simple attention mechanism layer fusion is adopted to fuse text, image, and attribute modularity.Attr-BERT [35]: It presents a self-attention structure focusing on the intra-modal information based on BERT architecture.HKE [34]: It exploits atomic-level congruity and composition-level congruity detection models based on the graph neural network through cross-modality attention.

Main Experiment Result and Analysis
The main results of our proposed MVK model and other compared baselines are presented in Table 3.The key evaluation indicator, F1, for the model proposed in this article, is 83.92%, with an accuracy (Acc) of 86.68%.Additionally, the precision (Pre) and recall (rec) have also improved.In the compared baselines, using only text modality resulted in better results than using only image modality, indicating that text contains more information than images.However, they cannot utilize the complementarity between multimodal data sufficiently.Current research focuses on building multimodal fusion networks using multimodal data as input.For example, HFN and InCrossMGs use attention mechanisms and cross-modal graph convolutional networks to fuse text and image features directly.Attr-BERT, on the other hand, directly concatenates text and image data, using the BERT structure to extract multimodal features.However, these approaches ignore the interactivity between modalities and do not leverage visual commonsense knowledge to enrich multimodal representations.We draw visual commonsense knowledge into the model to consciously enhance the representation vectors of each modality.Simultaneously, our model uses multi-view to interact with inter-modal information dynamically in multiple turns.Finally, we fuse the multimodal representation through attention fusion.The experimental results manifest that our multi-view interaction based on the knowledge enhancement model can effectively improve the performance of multimodal sarcasm detection.In this section, we also quantified the total number of parameters of both the proposed model and the comparative baselines, as demonstrated in Table 3.The comparative baselines employ multimodal data as input.In contrast to the HKE model, our model, as proposed in this article, exhibits comparable performance, but it contains significantly fewer parameters.This observation suggests that our approach, when deployed in mobile devices, can effectively mitigate memory constraints on devices.Furthermore, the reduced parameter count results in a decreased computational overhead.

Ablation Study
To explore the role of each component in multimodal sarcasm detection, we conduct a series of ablation experiments and provide a detailed analysis.The ablation experiment removed the multi-view interaction module and the late-fusion module, respectively.Table 4 reports the results of the ablation experiment.(Legend: w/o V denotes without employing multi-view interaction.w/o L denotes that without employing late attention fusion, it concatenates the multimodal representation after multi-view interaction.w/o denotes without employing any interaction and fusion action.)Upon observing the results presented in Table 4, it becomes evident that the model's performance decreases significantly when adopting no actions, indicating that in multimodal sarcasm detection, knowledge enhancement, multi-view interaction, and attention fusion mechanisms are crucial for learning the relationships between multimodal data.Specifically, when the multi-view interaction module is removed (w/o V), the experimental results show a noticeable decline, which indicates that interacting information through different modal perspectives can improve the model's ability to learn dependency relationships between modalities.When removing attention mechanism fusion, the model's performance is slightly worse, indicating that a simple fusion method cannot make the model focus on important information in multimodal representations.It is worth noting that the model achieves excellent performance in multimodal sarcasm detection when employing these two components together, especially when utilizing visual commonsense knowledge to enrich the semantic information of representation.

Impact of Multi-View Layers
To investigate the impact of the number of multi-view layers on the presented model, we run experiments on different numbers of multi-view layers.The number varies from 1 to 10. Figure 3 exhibits the experimental results on Acc and F1.When the multi-view module consists of two layers, the model performs best.When there is only one layer, insufficient interaction between text and image information makes the model unable to obtain satisfactory features.As the number of layers increases, the model's performance gradually declines, indicating that the excessive depth weakens its learning ability.The possible reason is that the growth of model parameters leads to overfitting.Even if we increase this module's dropout rate, the model's performance is still constrained.

Case Study
We conduct a case study and its attention visualization on a sarcastic sample and a nonsarcasm sample from the test set to analyze how our proposed model learns multimodal information.The results are exhibited in Figure 4.For multimodal sarcasm detection, models need to learn the inter-modal mutual information and incongruous information.We construct an architecture based on multi-view interaction with knowledge enhancement and attention fusion to learn this information.As shown in Figure 4, for the sample (a) with sarcasm label, in the image's attention hot map, it is observed that the model focuses on the lower middle of the image, i.e., the lying dog.Furthermore, from the text's hot map, we can see the model focuses on the words 'wasting time' and 'fun'.For sample (b) with the non-sarcasm label, the image's attention hot map focuses on the wild lizard and the text's hot map focuses on the words 'browser', 'fronoze', and 'video'.We can conclude that with the help of visual commonsense knowledge and multi-view interactions, our model can effectively extract text and image features and enhance modal interaction.The model can concentrate on the important objects and words of the multimodal data.

Error Analysis and Limitation
By observing the samples with incorrect predictions in the test set, we note a notable ratio of the samples containing less information in the images.For example, they are featured with only a few words in the image, blurry visuals, and a single scene, as exhibited in Figure 5, which makes the model fail to focus on essential positions in the image and unable to extract adequate information from the image.As a result, the model heavily relies on text modality alone.The subsequent work can alleviate these issues by extracting words from images and using techniques such as image defogging to enhance the visual quality.Research limitation: the inability to extract sufficient information from blurred images and images with less information and inadequate adaptability in the absence of a certain modality.

Conclusions
The rapid evolution of social networks has led to the generation of vast amounts of multimodal data, especially published on mobile devices.Mobile devices on the Internet of Things typically have limited memory and computing power.Therefore, this paper proposes a multi-view interaction model based on knowledge enhancement for multimodal sarcasm detection in social Internet of Things.The model enhances the representation of each modalityby leveraging visual commonsense knowledge.Furthermore, it utilizes a multi-turn multi-view interaction module to extract mutual information and an incongruous relationship between text and image.The experimental results validate the effectiveness of the proposed model.In addition, the model has fewer parameters and does not require a large amount of memory consumption.The lightweight model also has a lower inference time which can facilitate the development of social Internet of Things.Despite all this, parameter sharing or multi-teacher distillation compression models can be considered in future work to make the model practical.In addition, considering the impact of poor image quality on model performance, future work can utilize more powerful feature extractors, such as VIT and CLIP.Optical Character Recognition (OCR) can also be employed to capture the textual information of simple images to enrich image information.

Figure 1 .
Figure 1.The overall architecture of multi-view interaction model with knowledge enhancement (MVK).The model consists of three modules: (a) feature extract module, (b) multi-view interaction module, and (c) late fusion module.In module (a), the DenseNet and BERT are feature extractors used to extract features at two levels.Module (b) has N layers, and each layer performs an inter-modal interaction based on knowledge augmentation.

Figure 2 .
Figure 2. The model's process of training, validation, and test.The test set result is based on the model with the minimum valid loss.The visualization data are from test data.

Figure 3 .
Figure 3. Impact of multi-view interaction layers.

Table 1 .
The statistics of the multimodal dataset.

Table 3 .
Main results of the experiment.