You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Article
  • Open Access

30 November 2022

Hateful Memes Detection Based on Multi-Task Learning

,
,
,
and
1
Engineering Research Center of Cyberspace, Yunnan University, Kunming 650091, China
2
School of Software, Yunnan University, Kunming 650091, China
3
Yunnan Key Laboratory of Statistical Modeling and Data Analysis, School of Mathematics and Statistics, Yunnan University, Kunming 650091, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Advances in Artificial Intelligence: Models, Optimization, and Machine Learning, 2nd Edition

Abstract

With the popularity of posting memes on social platforms, the severe negative impact of hateful memes is growing. As existing detection models have lower detection accuracy than humans, hateful memes detection is still a challenge to statistical learning and artificial intelligence. This paper proposed a multi-task learning method consisting of a primary multimodal task and two unimodal auxiliary tasks to address this issue. We introduced a self-supervised generation strategy in auxiliary tasks to generate unimodal auxiliary labels automatically. Meanwhile, we used BERT and RESNET as the backbone for text and image classification, respectively, and then fusion them with a late fusion method. In the training phase, the backward guidance technique and the adaptive weight adjustment strategy were used to capture the consistency and variability between different modalities, numerically improving the hateful memes detection accuracy and the generalization and robustness of the model. The experiment conducted on the Facebook AI multimodal hateful memes dataset shows that the prediction accuracy of our model outperformed the comparing models.

1. Introduction

Memes are an element of a cultural or behavioral system transmitted from one person to another through imitation or other non-genetic behaviors. Memes come in various types and formats, including but not limited to images, videos, or posts, which are increasingly influential on social platforms. The vast amount of memes on the Internet constitutes an eye-catching problem. Memes not only express people’s natural emotions but may also cause emotional damage to someone. The most popular form of memes is images containing text, which is the type we are interested in. Usually, an ordinary sentence or a picture does not have any special emotional meaning, but when combined, they become meaningful. Hateful memes thus emerge and are becoming an increasingly serious problem in modern society. People with malignant motives use such memes, with misleading content, hateful speech, and harmful images, to attack vulnerable people or target people.
Nowadays, social giants such as Facebook, Twitter, and Weibo, are engaged in identifying hateful memes and removing thousands of hateful memes to protect users. However, it is impossible to have humans detect every meme on a massive Internet scale manually. Researchers have explored statistical tools [1,2] and machine learning techniques [3,4] with optimization algorithms [5] to address this issue. The probability upper bounds of the generalization errors of simple models are well studied [6,7], but statisticians are still struggling to explain the generalization ability of large artificial neural networks [8]. Meanwhile, machines cannot understand contextual information like humans, and detecting hateful memes is still a challenging study for statistical learning and artificial intelligence. Owing to the development of sentiment analysis (hate is one emotion) and artificial intelligence, we can build our research on the work of previous researchers [9,10,11,12]. However, the available sentiment analysis methods have limited usefulness in practice because hate is not always as easy to identify as other emotions, and they do not explain generalization ability statistically. Most early studies for hateful memes focused on unimodal hateful text detection, classifying hateful, abusive, or offensive texts against individuals or groups according to gender, nationality, or sexual orientation [13,14]. These studies for hate detection are enlightening, but they cannot handle hateful memes detection, which combines visual and textual elements. In addition, some hateful attacks against specific groups are very subtle. To further improve the accuracy of detecting hateful memes, we have to extend them to multimodal learning.
Baltrušaitis et al. [15] figured that difficulties and challenges in multimodal learning are representation, translation, alignment, fusion, and co-learning, while representation learning may be the most critical impact on multimodal learning. According to the difference of guidance in representation learning, the existing methods are divided into forward guidance and backward guidance. The forward guidance projects unimodal representations together into a shared subspace [16] with the interaction module for obtaining information on different modalities [10,17,18,19]. However, the uniformity of multimodal labels makes it difficult to get information in a single modality. Backward guidance adds extra regularization terms to the optimization objectives [20] to guide feature learning by gradient descent and thus learn the variability across modalities [21,22], and we prefer this method.
Multi-task Learning is a learning paradigm in machine learning that learns multiple related tasks jointly and leverages useful information contained in multiple related tasks [23]. It can learn multiple related tasks together simultaneously and maximize the use of information from each modal in multimodal data. Therefore, it can be further used to enhance the accuracy of hateful meme detection. Usually, multi-task learning is designed with a primary classification task and some auxiliary tasks to enhance the feature learning capability. Nevertheless, this leads to a problem coming with a requirement of independent labels for auxiliary tasks, which is time-consuming and labor-intensive [22] by manual labeling. Yu et al. [24] designed a self-supervised unimodal label generation module to overcome this problem. This method can automatically get appropriate labels without requiring access to any further data.
Thus, we proposed a new idea to detect hateful memes using a multi-task learning method to balance the unilateral information exacting from different modalities separately and the fuzzy information from the multimodal without introducing further data or manual labels and reduce the generalization errors. We conducted a primary task to learn multimodal features and classify hateful memes. Meanwhile, two auxiliary tasks were used in the training phase to learn unimodal features and classify the hate of text and images. Moreover, we used two self-supervised label generation modules to generate unimodal labels in auxiliary tasks automatically. Finally, we applied our method to the Facebook AI hateful memes data sets [25] and achieved competitive results. In contrast with the previous works, the main contributions of this work are as follows:
  • A new artificial intelligence model is proposed for hateful memes detection. It effectively improved the hateful memes detection accuracy in that our model outperformed the comparing models.
  • The multi-task strategy and adaptive weight adjustment strategy used in our model captured the consistency and variability between different modalities and numerically improved the generalization and robustness of the model.
  • Our auxiliary tasks using self-supervised unimodal auxiliary label generation module enhanced the feature learning capability without human-defined labels or additional data.
The remaining part of this paper is organized as follows. Section 2 introduces related works. Section 3 shows our hateful memes detection model’s framework and algorithm. Next, experiments with real data and their results are presented in Section 4. Section 5 summarizes this work.

3. Method

This paper aims to design a model that can balance the unilateral information exacting from different modalities separately and the fuzzy information from the multimodal without introducing further data or manual labels and reduce generalization errors. Nowadays, modern methods for predicting and understanding data are rooted in both statistical and computational thinking, and algorithmics are put on equal footing with intuition, properties, and the abstract arguments behind them [55]. So we proposed a new hateful memes detection method combing statistic theory with modern neural nets and optimization algorithms. And we will describe it detailly in this section.
First, we introduced the setup of our model to illustrate the inputs and outputs. Next, we constructed the multi-task learning model with a primary multimodal task and two unimodal auxiliary tasks to capture the consistency and variability between different modalities. As we only have manually labeled labels ( y m ) in the dataset for the primary task, we adopt a self-supervised method [24] to generate the unimodal labels ( y u ). And then, we designed an adaptive weight in our objective function to optimize this model and reduce the generalization errors. In the following, we call multimodal labels m-labels and unimodal labels u-labels, where u = t , v .

3.1. Setup

Hateful memes detection is a binary classification task that uses text and image signals to judge whether a meme is hateful. Our designed model takes I t and I v as inputs after data processing and the hateful intensity y ^ m R as outputs. In addition to the primary multimodal classification output y ^ m , two unimodal auxiliary task outputs y ^ t and y ^ v are also set to improve the accuracy in the training phase. Obviously, y ^ m is the final result we are interested in.

3.2. Architecture

We designed a multi-task learning model that can generate auxiliary labels in a self-supervised way to detect multimodal hateful memes, as shown in Figure 3. The network consists of a primary multimodal task using BERT and RESNET to extract features and two unimodal auxiliary tasks that share the bottom feature learning network in a hard parameter sharing method.
Figure 3. The architecture of our method. y m is the labeled multimodal label in the dataset, and y t , y v are the auxiliary labels generated by the self-supervised label generation module for the unimodal text and image auxiliary tasks, respectively. y ^ m is the predicted output of the primary multimodal task, y ^ t , y ^ v are the predicted outputs of the unimodal text and image auxiliary tasks, respectively.
The primary task part is a multimodal classification net, which consists of three steps, the extraction of features, the fusion of features, and the output of classification. Pre-trained models have performed very well in recent years, so we used two pre-trained models as the backbone for two unimodal tasks in the hateful memes detection task.
For text processing, we use the pre-trained twelve-layers BERT [37] to extract text feature F t .
F t = B E R T I t ; θ t b e r t ,
where I t is the text input, θ t b e r t is all parameters of the BERT we used.
For image processing, we use the pre-trained RESNET101 [40] to extract image feature  F v .
F v = R E S N E T I v ; θ v r e s n e t ,
where I v is the image input, θ v r e s n e t is all parameters of the RESNET we used.
Then, the text and image representations are concatenated as F m = [ F t ; F v ] and projected onto a low-dimensional space.
F m * = σ W 1 m F m + b 1 m ,
where W 1 m and b 1 m are the parameters of the first linear layer in the primary multimodal task, σ is the activation function.
After that, we use the representation of fusion obtained from the linear layer and activation function to detect whether the meme is hateful.
y ^ m = W 2 m F m * + b 2 m ,
where W 2 R d m × 1 , and W 2 m and b 2 m are the parameters of the second linear layer in the multimodal primary task.
The auxiliary tasks are two unimodal classification tasks that detect the presence of hateful sentiment in text and images, respectively. We project the unimodal features into a new feature space, which reduces the impact of the dimensional difference between different modalities. Moreover, the text and image auxiliary classification tasks share modal features with the primary multimodal classification task.
F u * = σ W 1 u F u + b 1 u ,
where u { t , v } , W 1 u and b 1 u are parameters of the first linear layer in the unimodal auxiliary task.
Then, the results of unimodal auxiliary tasks are obtained by
y ^ u = W 2 u F u * + b 2 u ,
where u { t , v } , W 2 u and b 2 u are parameters of the second linear layer in the unimodal auxiliary task.

3.3. Unimodal Label Generation Module

While we need corresponding labels to guide the training in the two unimodal auxiliary tasks, and manual labeling is too costly, we adopt a strategy of self-supervised label generation to obtain u-labels. We call this module the “Unimodal Label Generation Module” (ULGM), that is
y u = U L G M y m , F m * , F u * ,
where u { t , v } .
The ULGM generates labels for unimodal auxiliary tasks based on multimodal labels and the feature of each modality. The unimodal label generation module does not have any parameters, which makes it a stand-alone module without any impact on the multi-task network. Based on the fact that unimodal labels are closely related to multimodal labels, this module calculates the offset value based on the distance between each modal representation to the center of the hateful class and the non-hateful class.
Here, we calculate the relative distance rather than absolute distance values, which overcomes the error introduced by different modal features in different feature spaces. First, we keep the center of the hateful class ( C k h ) and the center of the not-hateful class ( C k n ) unchanged for different modal features in the training phase. And the hateful class center and the not-hateful class center can be defined as:
C k h = j = 1 N I y k j > c · F k j g j = 1 N I y k j > c , C k n = j = 1 N I y k j < c · F k j g j = 1 N I y k j < c ,
where k { m , t , v } , N is the sample size of the training set. I ( · ) is an indicator function and F k j g is the global representation of the j-th sample in modality k, and c is a threshold value, which we chose it as 0.5 in our experiment.
Then we use the L2 norm to calculate the distance between features and the hateful/not-hateful class centers, that is
D k h = F k * C k h 2 2 d k , D k n = F k * C k n 2 2 d k ,
where k { m , t , v } , d k is a scaling factor used to represent the dimensions.
After doing the above calculations, we can calculate the relative distance α k between the modality representation and the hateful/not-hateful center with
α k = D k n D k h D k h + ϵ ,
where k { m , t , v } , ϵ is a very small number to avoid zero exception.
Obviously, α k is positively related to y k , then the ratio relationship between y u and y m can be summarised as:
y u y m y ^ u y ^ m α u α m y u = α u · y m α m .
To avoid the “zero value problem”, the difference relationship between y s and y m should also be considered, which means:
( y u y m ) ( y ^ u y ^ m ) ( α u α m ) y u = y m + α u α m .
By equal-weight summation Equations (4) and (5), we obtain the unimodal supervisions as follows.
y u = y m · α u 2 α m + y m + α u α m 2 = y m + α u α m 2 · y m + α m α m = y m + δ u m ,
where u { t , v } , δ u m is the offset value of the unimodal supervision values to the given multimodal labels.

3.4. Optimization Objectives

In the case of a binary classification task, since there are only positive and negative cases, and the probability sum of both is 1, it is not necessary to predict a vector, but only a probability. We choose the cross-entropy loss of binary classification as the base optimization objective, and the loss function is defined in a simplified way as follows.
l o s s k = [ y k · log ( y ^ k ) + ( 1 y k ) · log ( 1 y ^ k ) ] ,
where k { m , t , v } .
As the hateful memes data are complicated with two modalities, we designed multi-task learning to make the statistical inference. When we optimize the model, the extracted information may be fuzzy if we pay too much attention to the multimodal part. However, if we pay too much attention to the unimodal part, the extracted information may be much unilateral and weaken our primary task. In addition, the gradient magnitudes of the backpropagation of several tasks’ losses may differ. When backpropagating to the shared bottom part, the task with a small gradient magnitude has less weight to update the model parameters, making the shared bottom not learn enough for that task. Of course, we can simply introduce static weights to balance the gradients for different tasks. However, this does not work well. If we assigned a fixed weight for a task with a large gradient magnitude at the beginning of training, this small weight would keep limiting this task by the end of the training, making this task not learned enough and enhancing the generalization errors [56,57]. Meanwhile, information may be with different intensities among different samples. Suppose the difference between the multimodal label y m ( i ) and the generated unimodal label y ^ u ( i ) is large. In that case, the results from different modalities are diverging, and we should impose a larger weight on this sample to learn more information. Therefore, a data-driving weight should be imposed on different samples so that the objective function can be adaptively adjusted to balance the learning process.
Thus, we use the absolute difference between the generated unimodal label and the existing multimodal label as a measure for weight adjustment, that is, | y u ( i ) y m | . As we want to make more significant adjustments for samples with large distances and slight adjustments for samples with small distances, an ‘S’-type function ( 0 , 1 ) may be preferred, such as t a n h ( · ) , e l i o t ( · ) , a r c t a n ( · ) and l o g i t ( · ) . We chose t a n h ( · ) here to get more adjustment for the samples with large distances with rapid change, and the weight of i t h sample for auxiliary task u can be expressed as ω u i = t a n h ( | y u ( i ) y m | ) . Then the optimization objective is
L = 1 N j N l o s s m j + u { t , v } ω u j l o s s u j ,
where N is the sample size, l o s s m j is the binary cross-entropy loss between multimodal labels and multimodal predictions of the j-th sample, l o s s u j is the binary cross-entropy loss between the self-supervised generated unimodal labels and the unimodal predictions of the j-th sample.
While the modal representations are changing dynamically, so the generated auxiliary labels are unstable. In order to mitigate the influence of this disadvantage, a momentum update strategy is introduced.
y u ( i ) = y m i = 1 i 1 i + 1 y s ( i 1 ) + 2 i + 1 y s i i > 1 ,
where u { t , v } , i means the i-th epoch [58].
Finally, supervised by the m-labels in the dataset and the u-labels generated by the self-supervised module, the final result y ^ m for detecting whether each meme is hateful or not can be obtained. Overall, the entire algorithm (Algorithm 1) of our model is defined as follows:
Algorithm 1: The algorithm of our model in training stage [24]
Mathematics 10 04525 i001

4. Experiments

4.1. Dataset

To validate the performance of our model, we choose the hateful memes dataset in the “Hateful Memes Challenge” [25] published by Facebook AI as our experimental dataset. It is a dataset of over 10,000 strictly labeled memes, where the memes are manually labeled as hate or not with a strict definition. The researchers carefully designed each meme and confounded the hateful memes with the benign memes by methods such as “benign confounders”, as shown in Figure 4. These subtle designs make each meme challenging to detect accurately by unimodal detection methods and must be reasoned about both text and image to obtain accurate detection results.
Figure 4. Example pictures in the experimental dataset. The memes in the first column are all hateful memes, the second column replaces only their images to make them not hateful, and the third column replaces only their text to make them not hateful.

4.2. Compared Models

We compared our model with different advanced unimodal, multimodal models described in [59]. All models can be classified into two categories, unimodal models and multimodal models.
Unimodal models include the image and text classification models, while image classification models include Image-Grid and Image-Region regarding different features. Features of Image-Grid are ResNet-152 [40] convolutional features and are based on res-5c with average pooling. Features of Image-Region are from the fc6 layer of Faster-RCNN [60] and are based on ResNeXt-152. The text classification model is the Twelve-layer BERT.
Multimodal models include Late Fusion, Concat BERT, MMBT-Grid, MMBT-Region, ViLBERT, and VisualBERT. Late Fusion is a model that fused the mean of outputs of the unimodal text model BERT and the unimodal image model ResNet-152 through simple fusion methods. Concat BERT is a model that concatenates the unimodal image model ResNet-152 feature with the unimodal text model BERT. MMBT-Grid and MMBT-Region are both supervised multimodal transformers models, the former using Image-Grid features and the latter using Image-Region features. VisualBERT [49] is a single-stream model in which the text and image features are fused at the beginning of the model. ViLBERT and VisualBERT can be pretrained on unimodal and multimodal datasets. ViLBERT [50] model is a dual-stream model, where text and image features are first passed through two separate encoding modules. Then the different modal information is fused through a co-attention mechanism. We use VisualBERT and ViLBERT with unimodal pretraining, Visualbert COCO is VisualBERT trained on multimodal dataset COCO [61] and ViLBERT CC is ViLBERT trained on multimodal dataset Conceptual Captions [62].

4.3. Results

We compared the results of our model with all kinds of unimodal and multimodal models on the hateful memes dataset. The activation function in our model is selected as ReLU, and the threshold value to calculate the hateful/not-hatful class center is set as 0.5 . The results of compared models on the dataset were from [59]. For the unimodal models, it can be found that their performance is generally less satisfactory. In addition, the unimodal text model outperformed the unimodal image model, reflecting the fact that the text features may contain more information. For the multimodal models, they outperformed the unimodal models. We also found that the fusion method affects their performance, while models using early fusion methods outperformed those using later fusion methods. For the multimodal pretrained process, there was little difference between the multimodal pretrained model and the unimodal pretrained model.
In contrast to the models mentioned above, our model used a late fusion method and two unimodal pre-training models. Although the late fusion method generally performed worse than the early fusion method, our model outperformed those early fusion models. Thanks to the additional auxiliary learning, which validated the idea that adding multi-task learning to hateful meme detection can improve the accuracy of the task. Moreover, it may help to fuse different unimodal pre-training models using our method in future studies for similar tasks. Prediction accuracy results of these models are presented in Table 1.
Table 1. The prediction accuracy of different models on the “Hateful Memes Challenge” data set.

4.4. Ablation Study

We added a self-supervised multi-task learning of generating auxiliary labels to the task of hateful memes detection, which did greatly improve its accuracy. However, we wanted to further investigate the effect of each unimodal auxiliary learning on the overall model. Therefore, we set up this experiment to test the model by adding each unimodal auxiliary task separately and comparing the results in Table 2.
Table 2. The prediction accuracy of the multi-task learning models with the addition of different unimodal auxiliary tasks.
These results indicated that the accuracy of the multi-task model only with the unimodal textual auxiliary or only with the unimodal visual auxiliary task is very similar in hateful meme detection. Furthermore, both the results were also very close compared to the multimodal task, which showed that the accuracy of detecting hateful memes could hardly be improved by adding a single unimodal auxiliary task alone. In contrast, the multi-task learning model was greatly enhanced with the addition of a unimodal textual auxiliary task and a unimodal visual auxiliary task. Moreover, all the cases optimized using equal weights with ω u j = 1 performed worse than the same model using the adaptive weight adjustment strategy. In conclusion, the multi-task learning and the adaptive weight adjustment strategy helped improve the testing accuracy and reduce the generation errors.

5. Conclusions

Our research aims to improve the accuracy and reduce generalization errors of detecting hateful memes, which are widely available on the Internet and have severe negative impacts. For this purpose, we selected a multimodal dataset of hateful memes published by Facebook AI as our experimental dataset. Moreover, we designed a multi-task learning model that can generate auxiliary labels self-supervised. A text classification model BERT and an image classification model RESNET were selected as the backbone, and a late fusion method was used. In the multi-task learning network, we added two unimodal auxiliary learning tasks, the textual and the visual auxiliary task, to the primary classification task. In order to solve the problem of lacking labels for the unimodal auxiliary tasks and the high cost of manual labeling, we chose a strategy of self-supervised label generation for the auxiliary tasks. In the phrase of optimization, we added a data-driving adaptive weight adjustment strategy to balance the learning process and reduce the generalization errors. By comparing our multi-task learning model with various advanced models for the detection of hateful memes, we can find that our multi-task learning model achieved more accurate results.
In the ablation experiments, we also found that it is difficult to improve the accuracy of the final classification results by simply adding a single unimodal auxiliary task to the multi-task learning network. Both the text and image auxiliary tasks should be introduced to the model to achieve better results. In addition to the good performance of the results, our method can easily be extended to fuse other unimodal models to solve similar problems. Although our experiments achieved good results, there is still much room for improvement. Our model and existing multimodal models are still far from reaching the accuracy of humans (84.7%) for the task. We are trying to improve the accuracy of hateful meme detection from other perspective. One is improving the adaptability of the backbone model and the multi-task learning network. Another is improving the feature fusion methods.

Author Contributions

Conceptualization, Y.Z. and S.Y.; software, Z.M. and S.G.; validation, Y.Z.; formal analysis, Z.M., L.W. and S.G.; investigation, Y.Z. and L.W.; resources, Y.Z., L.W. and S.Y.; writing—original draft preparation, Z.M. and Y.Z.; writing—review and editing, Y.Z. and S.G.; visualization, Z.M.; supervision, Y.Z. and S.Y.; project administration, Y.Z., L.W. and S.Y.; funding acquisition, Y.Z. and S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 61863036), the China Postdoctoral Science Foundation (No. 2021M702778), the Fundamental Research Funds for the Central Universities (No. 2042022KF0021), and the Fundamental Research Plan of “Release Management Service” in Yunnan Province: Research on Multi-source Data Platform and Situation Awareness Application for Cross-border Cyberspace Security (No. 202001BB050076).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets generated and analysed are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer Science & Business Media: New York, NY, USA, 2013; Volume 31. [Google Scholar]
  2. Fan, J.; Li, R.; Zhang, C.H.; Zou, H. Statistical Foundations of Data Science; Chapman and Hall/CRC: New York, NY, USA, 2020. [Google Scholar]
  3. Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin/Heidelberg, Germany, 2009; Volume 2. [Google Scholar]
  4. Mohri, M.; Rostamizadeh, A.; Talwalkar, A. Foundations of Machine Learning; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  5. Bertsekas, D.P. Nonlinear programming. J. Oper. Res. Soc. 1997, 48, 334. [Google Scholar] [CrossRef]
  6. Tewari, A.; Bartlett, P.L. On the Consistency of Multiclass Classification Methods. J. Mach. Learn. Res. 2007, 8, 1007–1025. [Google Scholar]
  7. Zhang, T. Statistical analysis of some multi-category large margin classification methods. J. Mach. Learn. Res. 2004, 5, 1225–1251. [Google Scholar]
  8. Vapnik, V.N. An overview of statistical learning theory. IEEE Trans. Neural Netw. 1999, 10, 988–999. [Google Scholar] [CrossRef] [PubMed]
  9. Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor fusion network for multimodal sentiment analysis. arXiv 2017, arXiv:1707.07250. [Google Scholar]
  10. Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; NIH Public Access: Bethesda, MD, USA, 2019; Volume 2019, p. 6558. [Google Scholar]
  11. Poria, S.; Hazarika, D.; Majumder, N.; Mihalcea, R. Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research. IEEE Trans. Affect. Comput. 2020, 1. [Google Scholar] [CrossRef]
  12. Bartlett, P.L.; Jordan, M.I.; McAuliffe, J.D. Convexity, classification, and risk bounds. J. Am. Stat. Assoc. 2006, 101, 138–156. [Google Scholar] [CrossRef]
  13. i Orts, Ò.G. Multilingual detection of hate speech against immigrants and women in Twitter at SemEval-2019 task 5: Frequency analysis interpolation for hate in speech detection. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, MN, USA, 6–7 June 2019; pp. 460–463. [Google Scholar]
  14. Burnap, P.; Williams, M.L. Hate speech, machine classification and statistical modelling of information flows on Twitter: Interpretation and communication for policy decision making. In Proceedings of the Internet, Policy & Politics Conference, Oxford, UK, 26 September 2014. [Google Scholar]
  15. Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef]
  16. Guo, W.; Wang, J.; Wang, S. Deep multimodal representation learning: A survey. IEEE Access 2019, 7, 63373–63394. [Google Scholar] [CrossRef]
  17. Zadeh, A.; Liang, P.P.; Mazumder, N.; Poria, S.; Cambria, E.; Morency, L.P. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  18. Sun, Z.; Sarma, P.; Sethares, W.; Liang, Y. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8992–8999. [Google Scholar]
  19. Rahman, W.; Hasan, M.K.; Lee, S.; Zadeh, A.; Mao, C.; Morency, L.P.; Hoque, E. Integrating multimodal information in large pretrained transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), online, 5–10 July 2020; NIH Public Access: Bethesda, MD, USA, 2020; Volume 2020, p. 2359. [Google Scholar]
  20. Wang, S.; Zhang, H.; Wang, H. Object co-segmentation via weakly supervised data fusion. Comput. Vis. Image Underst. 2017, 155, 43–54. [Google Scholar] [CrossRef]
  21. Hazarika, D.; Zimmermann, R.; Poria, S. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1122–1131. [Google Scholar]
  22. Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3718–3727. [Google Scholar]
  23. Zhang, Y.; Yang, Q. A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 2021, 34, 5586–5609. [Google Scholar] [CrossRef]
  24. Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 10790–10797. [Google Scholar]
  25. Kiela, D.; Firooz, H.; Mohan, A.; Goswami, V.; Singh, A.; Ringshia, P.; Testuggine, D. The hateful memes challenge: Detecting hate speech in multimodal memes. Adv. Neural Inf. Process. Syst. 2020, 33, 2611–2624. [Google Scholar]
  26. Zadeh, A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 1: Long Papers, pp. 2236–2246. [Google Scholar]
  27. Gomez, R.; Gibert, J.; Gomez, L.; Karatzas, D. Exploring hate speech detection in multimodal publications. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1470–1478. [Google Scholar]
  28. Warner, W.; Hirschberg, J. Detecting hate speech on the world wide web. In Proceedings of the Second Workshop on Language in Social Media, Montréal, QC, Canada, 7 June 2012; pp. 19–26. [Google Scholar]
  29. Djuric, N.; Zhou, J.; Morris, R.; Grbovic, M.; Radosavljevic, V.; Bhamidipati, N. Hate speech detection with comment embeddings. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 29–30. [Google Scholar]
  30. Chen, Y. Convolutional Neural Network for Sentence Classification. Master’s Thesis, University of Waterloo, Waterloo, ON, Canada, 2015. [Google Scholar]
  31. Waseem, Z.; Davidson, T.; Warmsley, D.; Weber, I. Understanding abuse: A typology of abusive language detection subtasks. arXiv 2017, arXiv:1705.09899. [Google Scholar]
  32. Benikova, D.; Wojatzki, M.; Zesch, T. What does this imply? Examining the impact of implicitness on the perception of hate speech. In Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology, Berlin, Germany, 13–14 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 171–179. [Google Scholar]
  33. Wiegand, M.; Siegel, M.; Ruppenhofer, J. Overview of the germeval 2018 shared task on the identification of offensive language. In Proceedings of the 14th Conference on Natural Language Processing (KONVENS 2018), Vienna, Austria, 21 September 2018. [Google Scholar]
  34. Kumar, R.; Ojha, A.K.; Malmasi, S.; Zampieri, M. Benchmarking aggression identification in social media. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), Santa Fe, NM, USA, 25 August 2018; pp. 1–11. [Google Scholar]
  35. Nobata, C.; Tetreault, J.; Thomas, A.; Mehdad, Y.; Chang, Y. Abusive language detection in online user content. In Proceedings of the 25th International Conference on World Wide Web, Montreal, QC, Canada, 11–15 May 2016; pp. 145–153. [Google Scholar]
  36. Aggarwal, P.; Horsmann, T.; Wojatzki, M.; Zesch, T. LTL-UDE at SemEval-2019 Task 6: BERT and two-vote classification for categorizing offensiveness. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, MN, USA, 6–7 June 2019; pp. 678–682. [Google Scholar]
  37. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  38. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; NIPS: La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]
  39. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  40. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  41. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  42. Sabat, B.O.; Ferrer, C.C.; Giro-i Nieto, X. Hate speech in pixels: Detection of offensive memes towards automatic moderation. arXiv 2019, arXiv:1910.02334. [Google Scholar]
  43. Liu, K.; Li, Y.; Xu, N.; Natarajan, P. Learn to combine modalities in multimodal deep learning. arXiv 2018, arXiv:1805.11730. [Google Scholar]
  44. Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 104–120. [Google Scholar]
  45. Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 121–137. [Google Scholar]
  46. Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv 2019, arXiv:1908.08530. [Google Scholar]
  47. Aken, B.v.; Winter, B.; Löser, A.; Gers, F.A. Visbert: Hidden-state visualizations for transformers. In Proceedings of the Companion Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 207–211. [Google Scholar]
  48. Yu, F.; Tang, J.; Yin, W.; Sun, Y.; Tian, H.; Wu, H.; Wang, H. Ernie-vil: Knowledge enhanced vision-language representations through scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 3208–3216. [Google Scholar]
  49. Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.J.; Chang, K.W. Visualbert: A simple and performant baseline for vision and language. arXiv 2019, arXiv:1908.03557. [Google Scholar]
  50. Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems; NIPS: La Jolla, CA, USA, 2019; Volume 32. [Google Scholar]
  51. Liu, W.; Mei, T.; Zhang, Y.; Che, C.; Luo, J. Multi-task deep visual-semantic embedding for video thumbnail selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3707–3715. [Google Scholar]
  52. Zhang, W.; Li, R.; Zeng, T.; Sun, Q.; Kumar, S.; Ye, J.; Ji, S. Deep model based transfer and multi-task learning for biological image analysis. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015; pp. 1475–1484. [Google Scholar]
  53. Akhtar, M.S.; Chauhan, D.S.; Ghosal, D.; Poria, S.; Ekbal, A.; Bhattacharyya, P. Multi-task learning for multi-modal emotion recognition and sentiment analysis. arXiv 2019, arXiv:1905.05812. [Google Scholar]
  54. Chen, Z.; Badrinarayanan, V.; Lee, C.Y.; Rabinovich, A. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In Proceedings of the International Conference on Machine Learning (PMLR 2018), Stockholm, Sweden, 10–15 July 2018; pp. 794–803. [Google Scholar]
  55. Efron, B.; Hastie, T. Computer Age Statistical Inference, Student Edition: Algorithms, Evidence, and Data Science; Cambridge University Press: Cambridge, UK, 2021; Volume 6. [Google Scholar]
  56. Zhang, T. Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Stat. 2004, 32, 56–85. [Google Scholar] [CrossRef]
  57. Chen, D.R.; Sun, T. Consistency of multiclass empirical risk minimization methods based on convex loss. J. Mach. Learn. Res. 2006, 7, 2435–2447. [Google Scholar]
  58. Su, W.; Boyd, S.; Candes, E. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems; NIPS: La Jolla, CA, USA, 2014; Volume 27. [Google Scholar]
  59. Sandulescu, V. Detecting hateful memes using a multimodal deep ensemble. arXiv 2020, arXiv:2012.13235. [Google Scholar]
  60. Mao, H.; Yao, S.; Tang, T.; Li, B.; Yao, J.; Wang, Y. Towards real-time object detection on embedded systems. IEEE Trans. Emerg. Top. Comput. 2016, 6, 417–431. [Google Scholar] [CrossRef]
  61. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
  62. Mokady, R.; Hertz, A.; Bermano, A.H. Clipcap: Clip prefix for image captioning. arXiv 2021, arXiv:2111.09734. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.