Global Local Fusion Neural Network for Multimodal Sentiment Analysis

: With the popularity of social networking services, people are increasingly inclined to share their opinions and feelings on social networks, leading to the rapid increase in multimodal posts on various platforms. Therefore, multimodal sentiment analysis has become a crucial research ﬁeld for exploring users’ emotions. The complex and complementary interactions between images and text greatly heighten the difﬁculty of sentiment analysis. Previous works conducted rough fusion operations and ignored the study for ﬁne fusion features for the sentiment task, which did not obtain sufﬁcient interactive information for sentiment analysis. This paper proposes a global local fusion neural network (GLFN), which comprehensively considers global and local fusion features, aggregating these features to analyze user sentiment. The model ﬁrst extracts overall fusion features by attention modules as modality-based global features. Then, coarse-to-ﬁne fusion learning is applied to build local fusion features effectively. Speciﬁcally, the cross-modal module is used for rough fusion, and ﬁne-grained fusion is applied to capture the interaction information between objects and words. Finally, we integrate all features to achieve a more reliable prediction. Extensive experimental results, comparisons, and visualization of public datasets demonstrate the effectiveness of the proposed model for multimodal sentiment classiﬁcation.


Introduction
The boom in mobile internet and smartphone access has made social networking an integral part of our daily lives. More and more people share their views and feelings through Twitter, Reddit, Weibo, and other social platforms, generating a large amount of social data. With the popularity of camera and video technology, data on these platforms have developed from a single text form to a combination of different media forms such as text, audio, and images. Massive multimodal data are formed in the form of a combination of text and images. Multimodal sentiment analysis aims to identify users' sentiment polarities, as well as their attitudes towards topics or events, from different forms of data. As the core field of social media analysis, sentiment analysis has received not only extensive attention from academia [1,2] but also has broad commercial application prospects, such as personalized advertising [3], opinion mining [4], and decision making [5], etc.
Compared with single-mode data, multimodal data contains more information and can easily express the users' real feelings. However, multimodal sentiment analysis is a challenging task. Different modal data have the possibility of containing different sentimental information, and the underlying features of these modalities have different dimensions and attributes. The traditional sentimental methods for multimodal sentiment analysis are based on handcrafted features [6] in each modality, and the performance highly relies on the quality of feature selection. Since the handcrafted features are usually designed with limited human knowledge, it is difficult to comprehensively describe the highly abstract features of sentiment to retain redundant emotional information and affect model objects, scenes in images, and words in sentences adaptively. Finally, an integration network aggregates different levels of fusion information and comprehensively explores the multimodal sentiment.  The main contributions of this paper are as follows:

1.
A global local fusion neural network (GLFN) is proposed for the multimodal sentiment analysis task; the model captures fusion information at global and local levels to deal with different types of text-image relationships in social media and integrates all fusion information to achieve a more comprehensive prediction.

2.
Symmetrical local fusion learning is introduced to effectively mine the modalitybased corresponding information between image regions and text words. The fusion learning is coarse to fine; the cross-modal module aligns the features between image content and words as coarse fusion, and fine-grained fusion enhances the correlation between related content to strengthen the interaction.
Finally, extensive experiments are conducted in the public social media datasets, MVSA [26]. Comparisons with prior models are carried out to demonstrate the effectiveness of the proposed method in this paper.
The rest of this paper is organized as follows: Section 2 gives a summary of related works on sentiment analysis. Section 3 explains the detail of the proposed approach. Experiments, including baseline comparisons, ablation studies, and visualization, are reported in Section 4. Finally, the conclusion is summarized in Section 5.

Unimodal Sentiment Analysis
The research for text sentiment analysis has been developed for many years. There are multiple levels of text sentiment analysis based on the length of datasets, including document-level, sentence-level, and aspect-level sentiment analysis. The methods used to analyze these textual data can be divided into two categories, the lexicon-based approach and the machine learning approach. The lexicon-based technique is extremely feasible for sentence-level analysis. Park and Kim [27] used a dictionary-based approach to build a thesaurus lexicon for sentiment tasks; the method adopted several dictionaries to collect a thesaurus and stored co-occurring words, which improved the classification accuracy without using human resources. Pang et al. [28] first applied machine learning methods, Naive Bayes, and maximum entropy for movie review sentiment analysis. Moreover, some methods used a combination of lexicon and machine learning. Borg and Boldt [29] first applied VADER with sentiment lexicon to provide the initial labels of emails and used LinearSVM to train the model and predict the sentiment of a not-yet-seen email, thereby preparing specific actions for customers who may have negative reactions. Recently, the deep neural network has been widely employed in natural language processing, including sentiment analysis. An attention-based bidirectional CNN-RNN model is proposed by [30], the model utilized two independent BiLSTM and GRU layers to capture past and future contexts in both directions, and an attention mechanism is applied to give weights to words for better performance.
Visual sentiment analysis primarily aims to explore the emotions associated with images. The first paper on image sentiment analysis was published in 2010 [31]. The authors built the correlation between the sentiment of image and the visual content and performed a discriminative feature analysis to predict the image sentiment. Borth et al. [32] proposed a large-scale visual sentiment ontology (VSO) based on psychological theories and web mining, which consists of many adjective-noun pairs (ANP), and then they trained the sentiment classifier with ANP outputs for the sentiment task. Yang et al. [33] extracted emotions from the images by leveraging all the related information, such as visual features, comments, and friendships; the model can distinguish the comments closely related to images' emotional expression. Deep learning approaches have also been used in image sentiment analysis. Chen et al. [34] introduced a convolutional neural network for the classification task. They performed fine-tuning training on the model and improved the results significantly. Song et al. [35] presented the network with an attention mechanism, which integrates visual attention into the CNN sentiment classification framework in an end-toend manner. Wu et al. [36] employed a model which fused salient object information in the image with complete image information to predict the users' sentiment. The model shows that reasonably utilizing the local information could improve the model's performance.

Multimodal Sentiment Analysis and Pretrained Model
The multimodal sentiment analysis has attracted attention in recent years. Baecchi et al. [37] proposed a semi-supervised model, CBOW-DA-LR, which extended the CBOW model, learning textual and visual vector representations concurrently to build the sentiment polarity classifier. Hu and Flaxman [38] used fine-tuning Inception and word representation GloVe [39] combined LSTM to extract high-level visual and textual features and concatenated the features as input of dense layer for the sentiment task. Poria et al. [40] utilized 1D CNN, RNN, and 2D CNN to obtain the features from the text, audio, and image and applied a multiple-kernel learning classifier to fuse the multimodal information for the sentiment classification. Many studies focus on the fusion method in the multimodal sentiment task. Zadeh et al. [41] proposed a tensor fusion network, which learned intramodality and inter-modality dynamics end to end. Xu et al. [42] proposed a co-memory network with attention mechanism, which captured the interaction of image and sentence iteratively to conduct the sentiment analysis. Jiang et al. [43] proposed the FENet model, which included interactive information fusion to learn the fusion features and a specific information extraction to extract sentimental features for sentiment prediction. Hu et al. [44] introduced a two-stage attention-based fusion neural network to analyze textual-visual information for sentiment classification. Yang et al. [45] proposed a model based on the multi-view attention network, which combined object-text fusion and scene-text fusion to tackle the task. Li et al. [46] proposed a contrastive learning and multi-layer fusion network to detect sentiment. To build more effective correlations between image and text, object and scene extraction methods have been used to catch more details of the images. Zhu et al. [47] introduced an image-text interaction network to investigate the relationship between image and text for sentiment classification. Although the researchers obtained exciting results, the models did not consider the relationship between image and text and coarse-grained fusion.
Recently, with the sophisticated pre-training targets and huge parameters, largescale pre-trained models have demonstrated significant performance in many fields. The early exploration of the natural language process (NLP) pre-trained model is a shallow network, such as Word2Vec [48] and GloVe [39], which can represent semantic meanings of words. With the development of NLP, Transformers [49] are proposed, and based on the architecture of transformers, pre-trained models, such as GPT, BERT, etc. [50], are built to tackle NLP tasks. The fine-tuning models achieved exciting results in language understanding and generation. The pre-trained models have also been used in computer vision (CV) tasks. Applying pre-trained ResNet [8] as the backbone, many CV tasks have advanced quickly, such as classification, object detection, segmentation, etc. Since the pre-trained model has advanced at a breakneck pace in NLP and CV, many researchers have begun to study vision-and-language (V-L) learning to improve the downstream task of multimodal learning. Su et al. [51] proposed Visual-Linguistic BERT to adopt the Transformer as the backbone, using visual and linguistic embedding features as inputs to train the model. Tan and Bansal [52] introduced LXMERT, which not only contains a visual and textual encoder but also contains cross-modality encoder to build the relationship between images and sentences. Radford et al. [53] leverage large-scale image-text pairs to build the CLIP model, which jointly trained a visual encoder and textual encoder to predict the correct pairing. Singh et al. [54] proposed FLAVA, which first adopted a dual encoder to obtain unimodal vision and language representation as well as multimodal representation. Yu et al. [55] presented a contrastive captioner, which employed contrastive loss and captioning loss for an image-text encoder-decoder and achieved the state-of-theart performance with a broad range of downstream tasks.
Even though there is lots of research for multimodal sentiment analysis, few studies have applied pre-trained multimodal representations to this task. Considering the relationship between image and text on social media, this paper utilized the vision-language pre-trained model and the unimodal pre-trained method to build the network for the sentiment analysis task. Specifically, vision-language representation is applied to extract global fusion features; unimodal representation is employed to construct a fine-grained correlation between image regions and words.

Overview
Text and image data in social media often exist at the same time. However, sometimes, a single text may correspond to multiple images, which is more complicated. Considering that text and image information are equally important, this paper mainly focuses on the social data that a text corresponds to an image, and the image-text multimodal sentiment classification task is defined as follows. Given image-text pairs P = {(I 1 , T 1 ), . . . (I i , T i ), . . . (I n , T n )} and the corresponding label set of the image-text pairs L = {l 1 , . . . l i , . . . l n }, n is the total number of pairs in the set. For an image-text pair (I i , T i ), I i denotes a single image, and T i denotes the corresponding texts. The sentiment label is l i ∈{positive, negative, neutral}. The goal of the multimodal method is to predict the sentimental polarity correctly. The framework of the proposed model is shown in Figure 2. GLFN is composed of three modules: global feature learning, local fusion learning, and an integration network. Global fusion learning first extracts the overall fusion features from the pre-trained image and text representation with an attention mechanism. Then, fine-grained local fusion learning explores the detailed correlation between image content and words. Finally, an integration network is applied to aggregate the sentimental information from global fusion features and local correlation features to conduct the result.

Global Fusion Learning
The representations of image and text are from pre-trained vision-language CLIP [54]. Since the model used abundantly text-image pair data from the internet for training, natural language is referenced in the visual concepts that can be learned in this process. That is, visual representations can be learned from natural language information, and likewise, textual representations somehow represent image information. We argue that a pretrained multimodal approach can obtain shallow global fused features of image and text, considering the relationship between visual and textual representations. Multidimensional features are used to represent global visual content I G and textual content T G to retain more features: The attention mechanism is applied to extract adequate information from shallow global visual and textual fusion features. The attention operation first involves a scale-dot product of query Q and keys K, and the softmax function is used to build a weight map to measure the critical parts in fusion features. The output can be calculated by the weighted sum of values V. Query, keys, and values are the linear projection of the input feature M = {m 1 , m 2 , . . . , m n } ∈ R d m ×n . To achieve better performance, multi-head attention is employed in this paper; when there is h head in the attention, the multi-head attention mechanism can be expressed as follows: where W q , W k , W v ∈ R d m ×d m are parameters that need to be learned during training. For each head i, the output O i is calculated as follows: where are the split parts from Q, K, and V, and there are h split parts of keys, query, and value for attention calculation. The output of multi-head attention is shown as follows: The parameter W o ∈ R d m ×d m and output feature O keep the same dimensions as the input features.
So, for given image fusion features I G and text fusion features T G , the global image fusion features F IG and the global text fusion features F TG can be represented as follows: where d Ih = d Im /Ih, d Th = d Tm /Th, and Ih and Th are the head number for image-based attention and text-based attention, respectively.

Local Fusion Learning
There are two modules in local fusion learning: a coarse cross-modal module and a fine-grained interaction module. A coarse cross-modal module builds rough interactions between image content and sentences. The fine-grained interaction module further explores the correlation between image region and words; the structure of local fusion learning is shown in Figure 3.

Cross-Modal Module
For the input image, we consider that the image content information is the combination of scenes and objects, the importance of objects and scenes is equal, and this information can be extracted from the pretrained model. Objects' representations and their corresponding regions can be obtained by the Faster RCNN [56], which is trained on the Visual Genomes dataset [11]. Top k region proposals are selected for each image as image object content representations. We detect scene representations by utilizing the VGG model, which is trained on the places 365 dataset [12]. For the input sentence, we apply GloVe [39] to represent each word, and a bidirectional LSTM is employed to summarize the context information of the sentence. The process can be expressed as follows: To project the representations into the same d dimension space to calculate the weight map, linear projection is applied: where R o ∈ R d×k , R s ∈ R d×1 , R w ∈ R d×c , and c is the number of words in the sentence. For the given object features R o and word features R w , the object word fusion matrix can be computed as: The fusion matrix W ow ∈ R k×c represents the affinity between objects from the image and words in a sentence. Specifically, W woij represents the affinity between the ith image region and the jth word. The attention weights between image objects and words can be obtained by softmax function. We obtain word-specific image object region representation A o and region-specific word representation A ow by combining the representations and weight map: In a similar way, by replacing R o with R s in the Equation (16), W sw can be computed, which shows the affinity between scene and words. When using W sw and R s and R w , such as in Equations (17) and (18), we can calculate word-specific image scene representation A s and scene-specific word representation A sw :

Fine-Grained Fusion Module
Although the cross-modal module builds the correlation between image objects and words, it is difficult to perfectly align words and image regions because the relationship between images and sentences is complex in social media. Therefore, a fine-grained fusion module is applied to control the coarse fusion feature adaptively, aiming to eliminate the noises generated in the cross-modal module and explore deeper interaction between image region and words. To the extent of matched parts fusion features and suppressing the mismatched pairs of fusion features, in the beginning, a gate is designed to compute the matching value of each fusion feature. Then, the gate weights control how much information is useful to the fusion cross-modal features. The gate weight will be high when the word and image content match well. On the contrary, when the image content and word are unpaired, the weight is low, and the gate weights in the visual aspect are shown as follows: where denotes the element-wise product, G wo ∈ R 1×k is the text-object region matching weights in the visual aspect, and G ws ∈ R 1×1 is the text-scene matching weights in the visual aspect. We can obtain the fine-grained fusion features by the gate weights and fused modality-based fusion features. To preserve the modality-based information which is not intensively fused, we further integrate the fine-grained visual fusion feature with original features to obtain the object wise fine-grained text-referred fusion O wo and scene-wise fine-grained text-referred fusion O ws .
MLP is a two-layer perceptron operation. Symmetrically, we can calculate the weight map, and textual-wise fine-grained objectreferred features O ow and textual-wise fine-grained scene-referred features O sw : where G ow ∈ R 1×c , G sw ∈ R 1×c . We finally aggregate different fine-grained fusion features in spatial dimension based on the modality and obtain visual-wise local fusion features F IL and textual-wise local fusion features F TL :

Integration Network
The network aims to integrate the global and local fusion features effectively and extract valuable information for sentiment analysis. Considering the architecture of global fusion learning and local fusion learning, we adopt different methods to deal with the features. For global fused parts, since the attention mechanism has been applied to explore the deeper effective information for fusion, we apply one-dimensional max-pooling to the global fusion features and on a spatial dimension, producing the features related to sentiment: We employ the attention mechanism on the spatial dimension for the local fused features to aggregate the local image features F IL and local text features F TL with the built spatial-wise attention weight map: att T = So f tmax(MLP(F TL )), where att I ∈ R 1×k , att T ∈ R 1×c . The final local fusion features can be expressed as follows: After obtaining the final global and local fused representations, we concatenate them as the input of a two-layer perceptron operation for the sentiment classification task: . (38) The model is trained by minimizing the cross-entropy loss with the Adam operation:

Experiments
This section describes the experimental results of the proposed model on two open datasets. This section consists of four parts: experimental data and model setup, baseline and comparison, ablation studies, and visualization.

Dataset and Setup
Niu et al. [26] established the public multimodal sentiment analysis datasets, which were collected from Twitter, including MVSA-Single and MVSA-Multiple. There are 5129 image-text pairs in MVSA-Single. An annotator labeled the image and text with one sentiment polarity form as positive, neutral, and negative, respectively. MVSA-Multiple contains 196,00 image-text pairs. Three annotators give the label to the image and text independently, and the judgment of each annotator is not affected by others. For a fair comparison, we preprocess the two datasets following the previous method [26,42,43]: first, remove the pairs which have different labels between image and text; when one label is positive(negative), and the other corresponding content is neutral, the label of this pair is regarded as positive(negative). As a result, we obtain the new MVSA-Single, which has 4511 image-text pairs and MVSA-Multiple datasets with 17,024 image-text pairs for the experiment.
In this study, the training, validation, and test sets are split with a ratio of 8:1:1. We utilize Adam as the optimizer method. The initial learning rate is 5 × 10 −5 , and an exponential decay is applied with gamma equal to 0.5 for every five epochs. The batch size of MVSA-Single is 32, and the batch size of MVSA-Multiple is 128.
The maximum word number for each sentence is 50, the head number of the global fusion attention mechanism is 8, and we select the top 3 image region features as local image presentation. The dimension for projecting the local features into the same space is 512. The framework of the model is implemented by PyTorch.

Baseline and Comparison
We list six studies utilizing the deep learning method for MVSA dataset as follows:

1.
MultiSentiNet [13]: employed CNN to obtain objects and scene deep semantic features of the image and utilized visual feature guided attention LSTM to extract important word features; all these features are aggregated for the sentiment analysis.

2.
Co-Memory [42]: proposed an iterative co-memory model with an attention mechanism by considering the relationship between image and text; the network explored the interaction between visual and textual information multiple times to analyze users' sentiment.

3.
FENet [43]: introduced an interactive information fusion mechanism, which learned the deep modality-specific fusion representation and built an information extraction network to extract information more effectively for the multimodal sentiment task. 4.
MVAN [45]: proposed a multi-view attention-based interaction model, which built an iterative scene-text co-memory network, as well as an iterative object-text co-memory network to obtain semantic image-text features for the sentiment analysis. 5.
CLMLF [46]: introduced contrastive learning with a multi-layer transformer-based fusion method. Two contrastive learning tasks, label-based contrastive learning and data-based contrastive learning, are proposed for training to help the model learn the common features for sentiment analysis. 6.
ITIN [47]: developed an image-text interaction network to align the information between image region and words and preserved the valid region-word pairs with a cross-modal gating module for effective fusion features. The unimodal features are combined with cross-modal fusion features for the sentiment classification.
Same as the previous studies, we use the accuracy rate and F1-score as the experimental evaluation metrics. The calculation is as follows: TP is the number of samples that obtain the correct prediction with the positive label, and FP is the number of samples that wrongly indicate the prediction as positive while the label is negative. TN denotes the number of samples that are marked as negative correctly. FN presents the number with a negative prediction and positive label.
The experimental results comparisons between the proposed model and the baselines are shown in Table 1. The results of baseline methods were retrieved from published papers. The results show that the MultiSentiNet obtained the worst performance among the models; even though the model considered the influence of visual information on the text, it ignored the influence of textual information on the image, and the interaction between image and text is shallow. Co-Memory considered the mutual influence and proposed an interactive fusion module to analyze the sentiment and is better than MultiSentiNet, but the coarse-grained attention mechanism may cause information loss and still need to be improved. FENet applied symmetry interactive information fusion module and the information extraction module to obtain informative representations, which separated the analysis into two sections, making the model more effective than the previous model. Since MVAN proposed a multi-view attention model to build the correlation between objects/scenes and words and integrated the features to predict results, it is slightly superior to FENet. CLMLF applied label-based and data-based contrastive learning for the sentiment task, which learned more features related to sentiment and is better than MVAN. ITIN model considered the relationship between image region and words and built deep region information based fusion features for the classification, achieving the best results among the baseline. Our model is competitive with other baseline models on the MVSA dataset. The model extracts the coarse global fusion features to obtain whole fusion information for image-text pairs and explores fine-grained local fusion features for detailed correlation between corresponding regions from images and words in sentences, which utilizes the interaction comprehensively for multimodal sentiment analysis. As a result, for the MVSA-Single dataset, the proposed model outperforms the existing base model ITIN and CLMLF in terms of accuracy and F1 score with 2.02% and 1.45%. For the MVSA-Multiple dataset, the model improved by 2.35% in accuracy and is superior to other models in F1 score.

Ablation Studies
To verify the effectiveness of each proposed module of GLFN, we carry out ablation experiments on the MVSA dataset. We remove textual global fusion learning (wo-TGF), visual global fusion learning (wo-IGF), textual local fusion learning (wo-TLF), and visual local fusion learning (wo-ILF) of GLFN to evaluate the influence of each part. In addition, we further investigate the importance of the cross-modal fusion module and fine-grained fusion module of local fusion learning, removing the cross-modal fusion module (wo-CM) and fine-grained fusion module (wo-FF) independently. The results of the ablation experiments are reported in Table 2. From the results, we can observe that the full version of GLFN achieves the best results; removal of either part of the model will affect the performance. This indicates that all these parts are adequate for the sentiment task. For global fusion learning, textual fusion features are more important than visual features for the prediction. The importance of the sentiment of visual and textual local fusion features is different by the datasets, which strongly depend on the relationship between images and texts. The additional investigation for local fusion shows that the cross-modal fusion module is more important than the fine-grained fusion module, and the fine-grained fusion further enhances the performance of the model. Compared with other parts, textual global fusion features are more effective for the prediction.
We also consider the number of region proposals in the image as a hyperparameter that affects the interaction between image and text, and the performance of sentiment analysis. In addition, we conduct experiments with different top k region proposals where objects exist. Figure 4a,b show the experimental results of accuracy and F1-score. MVSA-S represents the results obtained from MVSA-Single, and MVSA-M denotes the results performed on MVSA-Multiple. The figures show that when k = 3, both the accuracy and F1-score reach the maximum values on the two datasets. The values show that correlation discovery between image regions and words can be found within three region proposals in the image, and the local learning is more valid when the number of region proposals equals three. Therefore, for all the experiments reported in this paper, the region number is set to three. Here, k = 0 denotes the local fusion learning is removed from the model, which has a significant influence on the performance. The results indicate the effectiveness of local fusion learning.

Visualization
We conduct a visualization experiment on parts of the MVSA-Single dataset to demonstrate the effectiveness of global fusion learning. The image's global fusion and initial representation are visualized by dimensionality reduction. We use the TSNE algorithm to reduce the dimensions of the visual features and project the representations into a twodimensional space to visualize them, as shown in Figure 5. Figure 5a is the visualization of original visual features, and Figure 5b is the visualization of attention-based visual fusion features. The three marking symbols in the figure represent three different labels of sentiment: positive, neutral, and negative. From the figure, we can see that after the attention mechanism of fusion learning, the distribution distance between the different categories is more prominent. That is, the intra-class distance decreases, and the spacing between classes increases. The figure shows that global fusion learning has the ability to distinguish the valid information that helps improve the performance of sentiment analysis. We visualize the attention weight of the words and image regions of the local fusion learning part. Figure 6 illustrates two examples of visualization. The column shows region and scene representations with the green weight color, and the bottom row represents the words with red attention weight color. A darker color represents greater attention and vice versa. We can see that the contribution of image regions and words is different in Figure 6a. The area of the sky has the most contribution to the sentiment analysis, and the words 'Gorgeous Milky Way' are more effective than others for the prediction. In addition, the road region has little attention in this image-text pair since both content and emotion correlate less with that region. In Figure 6b, the region of the cars has the most significant contribution, the area of the scene gives little contribution to the sentiment analysis, and the corresponding text is a little complex; the word 'disgraceful' and phrase 'nearly run people off' are the crucial parts for the analysis.

Conclusions
Social media multimodal sentiment analysis is a challenging task. We have proposed a global local fusion neural network (GLFN) for the sentiment prediction task. The model considers the relationship between image and text, combining the general fusion information extracted from global fusion learning and local fine-grained fusion information obtained from local fusion learning to explore essential features related to the sentiment. To be specific, the pre-trained vision-language model is employed as the input of global fusion learning to obtain the comprehensive overall fusion features. In local fusion learning, scene and object representations construct a deep correlation to words as fine-grained fusion features for specified relation discovery between image and sentence. The integration network aggregates visual and textual context features as integrated fusion information for effective sentiment prediction. Experiment results and the comparison demonstrate that our model significantly improves the sentiment classification performance in the multimodal dataset. Even though we obtained promising results, there is a limitation to the model. Since there are some posts in which image and text are unrelated, this leads the sentiment expression to rely on the independent features, which may limit the performance of GLFN. So, in future work, we plan to improve the integration network by exploring the complicated relationship between images and texts to build more effective features by considering the ratio of fused information and independent information on the effect of sentiment. Furthermore, we want to extend the model to multimodal aspect sentiment analysis since the local fusion learning can align the image and text information and explore the detailed correlation between the corresponding pairs.

Data Availability Statement:
The data presented in this study are from public datasets that can be downloaded from https://mcrlab.net/research/mvsa-sentiment-analysis-on-multi-view-socialdata/ (accessed on 1 January 2016).

Conflicts of Interest:
The authors declare no conflict of interest regarding the publication of this paper.