# Tree-Based Mix-Order Polynomial Fusion Network for Multimodal Sentiment Analysis

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- •
- We propose the mixed-order polynomial tensor pooling (MOPTP) block to adaptively activate the much more discriminative sentiment properties among representation subspaces with varying orders. Compared to the existing fixed-order methods, the proposed mixed-order model can effectively integrate multiple local sentiment properties into the more discriminative one, leading to the relatively global optimal performance.
- •
- We propose a novel tree-based multimodal sentiment learning architecture (TMOPFN) that allows for using multiple sentiment analysis strategies on the same network layer simultaneously. Compared to the existing sequential sentiment analysis model, our parallel framework naturally provides us the great benefit of capturing multi-level sentiment properties via the parallel procedure.
- •
- We conduct various experiments on three public multimodal sentiment benchmarks to evaluate our proposed model. Empirical performance demonstrated the effectiveness of the mixed-order-based techniques and tree-based learning architecture. Please note that sentiment is part of emotion, i.e., the emotion recognition also covers sentiment analysis. Thus, we evaluated our model on two sentiment analysis benchmarks and one emotion recognition benchmark.

## 2. Related Work

## 3. Preliminaries

## 4. Methodology

#### 4.1. Mix-Order Polynomial Tensor Pooling (MOPTP)

Algorithm 1 Mix-order Polynomial Tensor Pooling (MOPTP) |

Input: text ${\mathbf{x}}_{1}$, video ${\mathbf{x}}_{2}$, audio ${\mathbf{x}}_{3}$, the number of subspaces ‘N’, tensor rank ‘R’.Output: The multimodal fusion representation ‘$\mathbf{x}$’.1: ${\mathbf{x}}_{123}$← concat (${\mathbf{x}}_{1}$, ${\mathbf{x}}_{2}$, ${\mathbf{x}}_{3}$) 2: for $n=1$ to N do3: for $p=1$ to n do4: for $r=1$ to R do5: ${\mathbf{x}}_{123}^{p\left(r\right)}$←${a}_{r}{\mathbf{w}}_{p}^{\left(r\right)}\xb7{\mathbf{x}}_{123}$ 6: end for7: ${\mathbf{x}}_{123}^{p}$← (${\mathbf{x}}_{123}^{p\left(1\right)}$ + ⋯ + ${\mathbf{x}}_{123}^{p\left(R\right)}$) 8: ${\mathbf{x}}_{123}^{p}$←${\mathbf{x}}_{123}^{p}$·${\mathbf{x}}_{123}^{(p-1)}$ 9: end for10: ${\mathbf{x}}_{123}^{n}$←${a}_{n}{\mathbf{x}}_{123}^{n}$ 11: end for12: $\mathbf{x}$← (${\mathbf{x}}_{123}^{1}$ + ⋯ + ${\mathbf{x}}_{123}^{N}$) 13: return x |

**x**. The high-order term in the n-th mode is indicated by ${i}_{n}$.

#### 4.2. Tree-Based Mix-Order Polynomial Fusion Network (TMOPFN)

`H1`-

`1`. At the second hidden layer, MOPTP is used to model local sentiment properties in a ‘window’ that includes 2 time steps and 2 modalities. For example, the MOPTP is applied to merge the text and video representation spanning time

`T3`and

`T4`, resulting in the third hidden node of new modality

`H2`-

`1`. Similarly, the first hidden node of

`H2`-

`2`is retrieved by fusing the video and audio features of

`T1`and

`T2`. Please note that the first hidden layer focuses on the general sentiment properties among three modalities, while the second hidden layer attempts to highlight the local sentiment properties of each modality pair. At the output layer, the MOPTP is used to integrate the message of the first and second hidden layers into the final global sentiment properties.

`H1`-

`1`is captured by integrating text and video representation of

`T3`and

`T4`. Then, the audio representation of the input layer is directly delivered to the same hidden layer $H1$. Furthermore, we can capture the inter-correlations between audio and

`H1`-

`1`. This strategy allows our model to highlight the fine-grained and multi-stage sentiment inter-correlations among multiple modalities. Note that each modality serves as the backbone and guide the multimodal sentiment analysis process in diverse fashions, leading to comprehensive sentiment properties. Taking advantage of the tree-based structure, the original sentiment analysis framework could be extended to diverse variants that enjoy a significant improvement of expressive power.

## 5. Experiments

#### 5.1. Experiment Setups

`TMOPFN`-

`L3`-

`S1`,

`TMOPFN`-

`L3`-

`S2`, and

`TMOPFN`-

`L3`-

`S3`. Furthermore, the corresponding model complexity analysis is demonstrated in Table 2. As to the input of models, only tensor-based models attempt to perform the multimodal fusion on the tensor input, while the non-tensor-based models take the vector as input. Additionally, we report the model results with the following measure metrics: MAE = mean absolute error, Corr = Pearson correlation, Acc-2 = accuracy across two classes, Acc-7 = accuracy across seven classes, and F1 = F1 score. For a fair comparison, we follow ([24,27]) to report the above evaluation metrics. Specifically, the binary acc/F1 is calculated based on positive/negative sentiments. When the value of the label is larger than 0, the corresponding data are annotated with ‘positive sentiment’.

`’TMOPFN-L3-S2, 2 subspaces’`exceeds the previous best MFM on the metric ‘Acc-7’ with an improvement of $3.2\%$. The overall best performance is received by

`’TMOPFN-L3-S1, 2 subspaces’`on the metric ‘MAE’. This implies that the tree-based structure is able to highlight the multi-level sentiment properties via the parallel procedure. It is important to note that, our

`’TMOPFN-L1, 2 subspaces’`achieves the best performance on ‘Corr’, ‘Acc-2’ and ‘F1’. This implies the superior expressive capability of MOPTP that attempts to further capture the more discriminative sentiment properties among multiple representation subspaces. The results of the number of modalities are demonstrated in (Table 4). Furthermore, we further investigate the performance of transformer-based and graph-based models which recently have exhibited great expressive power in multimodal learning. As shown in Table 5 and Table 6, we observe that our model can still obtain state-of-the-art or competitive performance, implying the superiority of our work. In particular, TMOPFN exceeds the other models by a significant margin of 5.4 on ‘Acc-7’. This demonstrates that TMOPFN is capable of effectively investigating the discriminative sentiment properties with the help of ‘MOPTP’ and tree-based framework. Additionally, we also report the UA+WA (Table 7) and the variance and standard deviation of TMOPFN (Table 8). The results demonstrate the robustness of the proposed model.

#### 5.2. Experimental Results

`’TMOPFN-L1, 1 subspace’`, where the multimodal representation is non-temporal that is obtained by averaging along the temporal domain. The ‘1 subspace’ indicates that ‘TMOPFN’ is only associated with a single fix-order representation subspace. The order of the corresponding subspace is denoted by ${}^{\prime}{P}^{\prime}$ which ranges from 1 to 9. As shown in Figure 6,

`’TMOPFN-L1, 1 subspace’`can obtain competitive performance with respect to the specified orders. We can observe that

`’TMOPFN-L1, 1 subspace’`reaches the peak value at order 4 on the CMU-MOSI dataset. As to the IEMOCAP dataset, it is important to note that the best performance is received at the order of 3 in the ‘neutral’ emotion, while the other emotions reach the highest point at the order that varies from 6 to 8. Indeed, ‘neutral’ is a kind of straightforward emotion, and may only need to search for the relatively simple or general properties among multiple modalities from the low-order feature space, i.e., the low-order multimodal fusion space is comprised of low-level interactions that refer to the general properties of ‘Neutral’, leading to the good performance of the recognition task. This is quite analogous to the process of the shallow network. In contrast, ‘Angry’, ‘Happy’, and ‘Sad’ are relative strong emotions, and those emotions may attempt to exploit the sophisticated properties from high-order space. In other words, the high-order polynomial space consists of much more polynomial terms, and each term corresponds to a local intercorrelation among multiple modalities. Then, employing the operation block associated with a high-order representation space to recognize the strong emotion. This allows us to effectively integrate comprehensive local interactions into the sophisticated global one.

`TMOPFN-L1`, where the non-temporal multimodal presentation is retrieved by averaging along the temporal domain. The number of subspaces ranges from 1 to 9, where the setting ${}^{\prime}MOPTP,subspace={2}^{\prime}$ means that corresponding MOPTP includes 1-order and 2-order representation subspaces.

`TMOPFN-L1`, is able to achieve fairly good performance with respect to the representation subspaces. Compared to the performance demonstrated in Figure 6, MOPTP outperforms the fixed-order manner, which indicates the necessity and effectiveness of activating the discriminative sentiment properties among representation subspaces with varying orders. In particular, our model reaches the peak value at the subspace of 3 and 2 in the ‘neutral’ and ‘happy’ emotions, while the other emotions maximize the performance at the subspace of 7 and 8 on IEMOCAP. Actually, ‘angry’ and ‘sad’ are relative negative emotion that is comprised of more complex sentiment properties. Therefore, we need to exploit the sophisticated sentiment message that includes the comprehensive sentiment granularity within the receptive ‘window’. Actually, the relative low-order sentiment subspace consists of the coarse-grained sentiment characteristics, and the high-order sentiment subspace is comprised of the fine-grained sentiment properties. Essentially, in the ‘happy’ emotion, the setting ${}^{\prime}MOPTP,subspace={2}^{\prime}$ (Figure 7) exceeds the 1-order and 2-order based settings (Figure 6). Moreover, the fixed-order manner reaches the highest point at order 5, while ‘MOPTP’ maximizes the performance only at subspace 2. Since ‘happy’ is a kind of positive emotion, which may only require explicit or general sentiment properties highlighted from the ‘window’. Compared to the fixed-order manner, ‘MOPTP’ focuses on highlighting the more discriminative sentiment properties between ‘1-order subspace’ and ‘2-order subspace’. Note that the above two low-order subspaces include distinct general sentiment properties. This may contribute to a more comprehensive general message, and bring forth the improvement of expressive power. Hence, ‘MOPTP’ is able to exceed fixed-order manner only with the integration of low-order representation subspaces.

`HPFN`-

`L1`,

`HPFN`-

`L2`,

`HPFN`-

`L3`and

`HPFN`-

`L4`. Furthermore, our TMOPFN also serve as the testing case:

`TMOPFN`-

`L3`-

`S1`,

`TMOPFN`-

`L3`-

`S2`,

`TMOPFN`-

`L3`-

`S3`. In Table 9, we observe that two-layer- and three-layer-based frameworks achieve better overall performance than both one-layer and four-layer case. Thanks to the hierarchical architecture and recursive learning strategy, the local intercorrelations of the previous layer can be easily transmitted to the next layer. Therefore, the relative deeper model consists of comprehensive sentiment properties, which naturally outperform the shallow one (i.e., one-layer case). However, the four-layer-based network contains too much redundancy that may overlook the presence of the core message. In particular,

`TMOPFN`-

`L3`-

`S3`reaches the best precision on almost all metrics. Compared to the sequential hierarchical architecture of HPFN, TMOPFN could leverage multiple fusion strategies at the same network layer simultaneously, leading to multi-level sentiment properties. Compared to the rest of tree-based networks,

`TMOPFN`-

`L3`-

`S3`employs the segmentation fusion strategy to analyze the data. Consequently, each modality could serve as the backbone and guide the fusion procedure in turn, contributing to the exhaustive multimodal fusion representation.

`HPFN`-

`L2`-based and

`TMOPFN`-

`L3`-based. In Table 9, we find that

`TMOPFN`-

`L3-S3`achieves the best performance on almost all metrics. Compared to

`TMOPFN`-

`L3-S3`,

`HPFN`-

`L2-S1`and

`HPFN`-

`L2-S2`only directly absorb the original modality from the preceding layer, where unimodality does not serve as the backbone and guide the multimodal fusion process. Additionally,

`TMOPFN`-

`L3-S1`only pays close attention to the text modality and overlooks the importance of audio and video modality. Actually, employing audio or video modality to guide the fusion procedure allows for much more complementarity and consistency among multiple modalities. Essentially,

`TMOPFN`-

`L3-S3`involves much more ‘window’, and each ‘window’ focuses on exploiting the corresponding sentiment properties from the same preceding ‘2D map’, leading to multi-level multimodal properties. In conclusion, adopting skip operation would help to incorporate additional knowledge captured from the preceding layer, allowing for superior expressive power.

`TMOPFN`-

`L2`is applied with a scanning ‘window’ at the input layer, where the size is $[4\times 2]$ and the stride step is set as 2 along the temporal domain. Initially, we employ the ‘MOPTP’ to cope with the modality representation within the scanning ‘window’ along the temporal dimension. As shown in Figure 8, compared to the weight-shared cases, the non-weight-shared model seems to receive comparable or relative higher performance. Actually, the share strategy may peel off the dynamic dependency of multimodal interactions along the temporal domain. Hence, sharing the same ‘MOPTP’ to various small ‘windows’ may do not bring forth additional improvement of task performance. As to the non-weight-shared models, it is interesting to find that there is a general upward trend in the size of ‘window’. This may imply that a large ‘window’ is comprised of much more common pattern among multiple modalities, as well as the intra-modal and inter-modal temporal consistency. Therefore, employing the non-share strategy may further exploit the latent temporal consistency and common associations among multiple modalities, and boost the task performance.

`MOPFN`-

`L1`,

`MOPFN`-

`L2`,

`MOPFN`-

`L2`-

`S1`,

`MOPFN`-

`L2`-

`S2`,

`MOPFN`-

`L3`and

`MOPFN`-

`L4`. We also validate the ‘MOPTP’ on the new proposed tree-based multimodal fusion frameworks:

`TMOPFN`-

`L3`-

`S1`,

`TMOPFN`-

`L3`-

`S2`,

`TMOPFN`-

`L3`-

`S3`.

`TMOPFN`-

`L3`-

`S1`,

`TMOPFN`-

`L3`-

`S2`,

`TMOPFN`-

`L3`-

`S3`). Indeed, ‘Happy’ and ‘Neutral’ are relatively positive emotions and may require the comprehensive explicit properties highlighted by the multimodal information. Note that tree-based architecture allows us to capture multi-level sentiment properties from the same ‘2D feature map’ simultaneously. Therefore, associating ‘MOPTP’ with the ‘TMOPFN’ can further boost the expressive capability of ‘MOPTP’, leading to the good performance of the positive emotion analysis tasks.

## 6. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Shoumy, N.J.; Ang, L.M.; Seng, K.P.; Rahaman, D.M.; Zia, T. Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals. J. Netw. Comput. Appl.
**2020**, 149, 102447. [Google Scholar] [CrossRef] - Yu, Y.; Kim, Y.J. Attention-LSTM-attention model for speech emotion recognition and analysis of IEMOCAP database. Electronics
**2020**, 9, 713. [Google Scholar] [CrossRef] - Rahman, W.; Hasan, M.K.; Lee, S.; Zadeh, A.; Mao, C.; Morency, L.P.; Hoque, E. Integrating multimodal information in large pretrained transformers. In Proceedings of the Conference Association for Computational Linguistics, Online, 6–8 July 2020; Volume 2020, p. 2359. Available online: https://aclanthology.org/2020.acl-main.214/ (accessed on 20 December 2022).
- Yadav, A.; Vishwakarma, D.K. Sentiment analysis using deep learning architectures: A review. Artif. Intell. Rev.
**2020**, 53, 4335–4385. [Google Scholar] [CrossRef] - Peng, Y.; Qin, F.; Kong, W.; Ge, Y.; Nie, F.; Cichocki, A. GFIL: A unified framework for the importance analysis of features, frequency bands and channels in EEG-based emotion recognition. IEEE Trans. Cogn. Dev. Syst.
**2021**, 14, 935–947. [Google Scholar] [CrossRef] - Lai, Z.; Wang, Y.; Feng, R.; Hu, X.; Xu, H. Multi-Feature Fusion Based Deepfake Face Forgery Video Detection. Systems
**2022**, 10, 31. [Google Scholar] [CrossRef] - Shen, F.; Peng, Y.; Dai, G.; Lu, B.; Kong, W. Coupled Projection Transfer Metric Learning for Cross-Session Emotion Recognition from EEG. Systems
**2022**, 10, 47. [Google Scholar] [CrossRef] - Zadeh, A.; Liang, P.P.; Mazumder, N.; Poria, S.; Cambria, E.; Morency, L.P. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 2–3 February 2018. [Google Scholar]
- Chandrasekaran, G.; Antoanela, N.; Andrei, G.; Monica, C.; Hemanth, J. Visual Sentiment Analysis Using Deep Learning Models with Social Media Data. Appl. Sci.
**2022**, 12, 1030. [Google Scholar] [CrossRef] - Atmaja, B.T.; Sasou, A. Sentiment Analysis and Emotion Recognition from Speech Using Universal Speech Representations. Sensors
**2022**, 22, 6369. [Google Scholar] [CrossRef] - Ma, F.; Zhang, W.; Li, Y.; Huang, S.L.; Zhang, L. Learning better representations for audio-visual emotion recognition with common information. Appl. Sci.
**2020**, 10, 7239. [Google Scholar] [CrossRef] - Atmaja, B.T.; Sasou, A.; Akagi, M. Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion. Speech Commun.
**2022**, 140, 11–28. [Google Scholar] [CrossRef] - Liang, P.P.; Liu, Z.; Zadeh, A.B.; Morency, L.P. Multimodal language Analysis with recurrent multistage fusion. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 150–161. [Google Scholar]
- Boehm, K.M.; Khosravi, P.; Vanguri, R.; Gao, J.; Shah, S.P. Harnessing multimodal data integration to advance precision oncology. Nat. Rev. Cancer
**2022**, 22, 114–126. [Google Scholar] [CrossRef] - Liang, P.P.; Lim, Y.C.; Tsai, Y.H.H.; Salakhutdinov, R.; Morency, L.P. Strong and simple baselines for multimodal utterance embeddings. arXiv
**2019**, arXiv:1906.02125. [Google Scholar] - Bayoudh, K.; Knani, R.; Hamdaoui, F.; Mtibaa, A. A survey on deep multimodal learning for computer vision: Advances, trends, applications, and datasets. Vis. Comput.
**2022**, 38, 2939–2970. [Google Scholar] [CrossRef] - Poria, S.; Cambria, E.; Bajpai, R.; Hussain, A. A review of affective computing: From unimodal analysis to multimodal fusion. Inf. Fusion
**2017**, 37, 98–125. [Google Scholar] [CrossRef] [Green Version] - Sharma, K.; Giannakos, M. Multimodal data capabilities for learning: What can multimodal data tell us about learning? Br. J. Educ. Technol.
**2020**, 51, 1450–1484. [Google Scholar] [CrossRef] - Zhang, J.; Yin, Z.; Chen, P.; Nichele, S. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Inf. Fusion
**2020**, 59, 103–126. [Google Scholar] [CrossRef] - Mai, S.; Hu, H.; Xu, J.; Xing, S. Multi-fusion residual memory network for multimodal human sentiment comprehension. IEEE Trans. Affect. Comput.
**2020**, 13, 320–334. [Google Scholar] [CrossRef] - Li, Q.; Gkoumas, D.; Lioma, C.; Melucci, M. Quantum-inspired multimodal fusion for video sentiment analysis. Inf. Fusion
**2021**, 65, 58–71. [Google Scholar] [CrossRef] - Li, W.; Zhu, L.; Shi, Y.; Guo, K.; Cambria, E. User reviews: Sentiment analysis using lexicon integrated two-channel CNN–LSTM family models. Appl. Soft Comput.
**2020**, 94, 106435. [Google Scholar] [CrossRef] - Chen, L.; Li, S.; Bai, Q.; Yang, J.; Jiang, S.; Miao, Y. Review of image classification algorithms based on convolutional neural networks. Remote Sens.
**2021**, 13, 4712. [Google Scholar] [CrossRef] - Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; pp. 1103–1114. [Google Scholar]
- Zhang, Y.; Cheng, C.; Wang, S.; Xia, T. Emotion recognition using heterogeneous convolutional neural networks combined with multimodal factorized bilinear pooling. Biomed. Signal Process. Control
**2022**, 77, 103877. [Google Scholar] [CrossRef] - Wang, J.; Ji, Y.; Sun, J.; Yang, Y.; Sakai, T. MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 2280–2292. [Google Scholar]
- Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.B.; Morency, L.P. Efficient low-rank multimodal fusion with modality-specific factors. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; pp. 2247–2256. [Google Scholar]
- Choi, D.Y.; Kim, D.H.; Song, B.C. Multimodal attention network for continuous-time emotion recognition using video and EEG signals. IEEE Access
**2020**, 8, 203814–203826. [Google Scholar] [CrossRef] - Hou, M.; Tang, J.; Zhang, J.; Kong, W.; Zhao, Q. Deep multimodal multilinear fusion with high-order polynomial pooling. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 12136–12145. [Google Scholar]
- Huan, R.H.; Shu, J.; Bao, S.L.; Liang, R.H.; Chen, P.; Chi, K.K. Video multimodal emotion recognition based on Bi-GRU and attention fusion. Multimed. Tools Appl.
**2021**, 80, 8213–8240. [Google Scholar] [CrossRef] - Van Houdt, G.; Mosquera, C.; Nápoles, G. A review on the long short-term memory model. Artif. Intell. Rev.
**2020**, 53, 5929–5955. [Google Scholar] [CrossRef] - Khalid, H.; Gorji, A.; Bourdoux, A.; Pollin, S.; Sahli, H. Multi-view CNN-LSTM architecture for radar-based human activity recognition. IEEE Access
**2022**, 10, 24509–24519. [Google Scholar] [CrossRef] - Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.P. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 873–883. [Google Scholar]
- Zadeh, A.; Liang, P.P.; Poria, S.; Vij, P.; Cambria, E.; Morency, L.P. Multi-attention recurrent network for human communication comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 2–3 February 2018. [Google Scholar]
- Tsai, Y.H.; Liang, P.P.; Zadeh, A.; Morency, L.; Salakhutdinov, R. Learning Factorized Multimodal Representations. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Sahay, S.; Okur, E.; Kumar, S.H.; Nachman, L. Low Rank Fusion based Transformers for Multimodal Sequences. arXiv
**2020**, arXiv:2007.02038. [Google Scholar] - Huang, F.; Wei, K.; Weng, J.; Li, Z. Attention-based modality-gated networks for image-text sentiment analysis. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM)
**2020**, 16, 1–19. [Google Scholar] [CrossRef] - Mai, S.; Xing, S.; He, J.; Zeng, Y.; Hu, H. Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph Pooling Fusion. arXiv
**2020**, arXiv:2011.13572. [Google Scholar] - Yang, J.; Wang, Y.; Yi, R.; Zhu, Y.; Rehman, A.; Zadeh, A.; Poria, S.; Morency, L.P. MTGAT: Multimodal Temporal Graph Attention Networks for Unaligned Human Multimodal Language Sequences. arXiv
**2020**, arXiv:2010.11985. [Google Scholar] - Chen, J.; Zhang, A. HGMF: Heterogeneous Graph-based Fusion for Multimodal Data with Incompleteness. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, 6–10 July 2020; pp. 1295–1305. [Google Scholar]
- Hong, D.; Kolda, T.G.; Duersch, J.A. Generalized canonical polyadic tensor decomposition. SIAM Rev.
**2020**, 62, 133–163. [Google Scholar] [CrossRef] [Green Version] - Little, A.; Xie, Y.; Sun, Q. An analysis of classical multidimensional scaling with applications to clustering. Inf. Inference J. IMA
**2022**. [Google Scholar] [CrossRef] - Reyes, J.A.; Stoudenmire, E.M. Multi-scale tensor network architecture for machine learning. Mach. Learn. Sci. Technol.
**2021**, 2, 035036. [Google Scholar] [CrossRef] - Phan, A.H.; Cichocki, A.; Uschmajew, A.; Tichavskỳ, P.; Luta, G.; Mandic, D.P. Tensor networks for latent variable analysis: Novel algorithms for tensor train approximation. IEEE Trans. Neural Netw. Learn. Syst.
**2020**, 31, 4622–4636. [Google Scholar] [CrossRef] [PubMed] - Asante-Mensah, M.G.; Ahmadi-Asl, S.; Cichocki, A. Matrix and tensor completion using tensor ring decomposition with sparse representation. Mach. Learn. Sci. Technol.
**2021**, 2, 035008. [Google Scholar] [CrossRef] - Zhao, M.; Li, W.; Li, L.; Ma, P.; Cai, Z.; Tao, R. Three-order tensor creation and tucker decomposition for infrared small-target detection. IEEE Trans. Geosci. Remote Sens.
**2021**, 60, 1–16. [Google Scholar] [CrossRef] - Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.P. MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv
**2016**, arXiv:1606.06259. [Google Scholar] - Zadeh, A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 1, Long Papers. pp. 2236–2246. [Google Scholar]
- Zhang, H. The Prosody of Fluent Repetitions in Spontaneous Speech. In Proceedings of the 10th International Conference on Speech Prosody 2020, Hong Kong, China, 25–28 May 2020; pp. 759–763. [Google Scholar]
- Kamyab, M.; Liu, G.; Adjeisah, M. Attention-based CNN and Bi-LSTM model based on TF-IDF and glove word embedding for sentiment analysis. Appl. Sci.
**2021**, 11, 11255. [Google Scholar] [CrossRef] - Khalane, A.; Shaikh, T. Context-Aware Multimodal Emotion Recognition. In Proceedings of the International Conference on Information Technology and Applications; Springer: Berlin/Heidelberg, Germany, 2022; pp. 51–61. [Google Scholar]
- Melinte, D.O.; Vladareanu, L. Facial expressions recognition for human–robot interaction using deep convolutional neural networks with rectified adam optimizer. Sensors
**2020**, 20, 2393. [Google Scholar] [CrossRef] - Hashemi, A.; Dowlatshahi, M.B. MLCR: A fast multi-label feature selection method based on K-means and L2-norm. In Proceedings of the 2020 25th International Computer Conference, Computer Society of Iran (CSICC), Tehran, Iran, 1–2 January 2020; pp. 1–7. [Google Scholar]
- Xia, H.; Yang, Y.; Pan, X.; Zhang, Z.; An, W. Sentiment analysis for online reviews using conditional random fields and support vector machines. Electron. Commer. Res.
**2020**, 20, 343–360. [Google Scholar] [CrossRef] - Zhang, C.; Yang, Z.; He, X.; Deng, L. Multimodal intelligence: Representation learning, information fusion, and applications. IEEE J. Sel. Top. Signal Process.
**2020**, 14, 478–493. [Google Scholar] [CrossRef] [Green Version] - Lian, Z.; Liu, B.; Tao, J. CTNet: Conversational Transformer Network for Emotion Recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process.
**2021**, 29, 985–1000. [Google Scholar] [CrossRef]

**Figure 1.**The relative low-order representation subspace consists of coarse-grained sentiment characteristics. The relative high-order subspace includes fine-grained properties. If we only focus on the analysis of a single fixed-order representation space (1-order, 2-order, or 3-order subspace), which may result in the local optimal performance of the sentiment analysis model.

**Figure 2.**The scheme of 3-order polynomial tensor pooling (MOPTP) block for fusing ${\mathbf{x}}_{1}$, ${\mathbf{x}}_{2}$, and ${\mathbf{x}}_{3}$ modalities. First, ${\mathbf{x}}_{12\cdots M}=[{\mathbf{x}}_{1},{\mathbf{x}}_{2},{\mathbf{x}}_{3}]$ is used to construct the N-order multimodal representation subspace ${\mathcal{X}}^{N}$ via tensor product, e.g., ${\mathcal{X}}^{3}={x}^{T}{\otimes}_{1}{x}^{T}{\otimes}_{2}{x}^{T}$, and ${\mathcal{W}}^{N}$ indicates the N-order weight tensor. Furthermore, MOPTP attempts to employ a pooling function to adaptively activate the more discriminative sentiment properties among representation subspaces with varying orders (${\mathcal{X}}^{1}$, ${\mathcal{X}}^{2}$, and ${\mathcal{X}}^{3}$).

**Figure 3.**TMOPFN-L3-S1. At the first hidden layer H1, a MOPTP block is used to analyze the receptive “window” that covers features across all four time steps and time modalities, leading to the new modality representation ‘H1-1’. At the second hidden layer H2, the MOPTP is used to model local interactions in the “window” that consider two time steps and two modalities, allowing for new modality representation ‘H2-1’, ‘H2-2’, and ‘H2-3’. The tree-based structure naturally allows for using multiple sentiment analysis strategies on the same network layer simultaneously, resulting in the significant improvement of expressive power.

**Figure 4.**TMOPFN-L3-S2. Compared to TMOPFN-L3-S1, TMOPFN-L3-S2 introduces the skip operation to the fusion framework, allowing for the additional information captured from the preceding layer.

**Figure 5.**TMOPFN-L3-S3. Taking the advantage of segmentation fusion strategy and skip operation, each modality can serve as the backbone and guide the multimodal fusion process in diverse fashion. This indeed contributes to more fine-grained and explicit correlations among multiple modalities.

**Figure 6.**Results of the effect of orders of multimodal sentiment subspace on IEMOCAP and CMU-MOSI. The fusion operation is associated with single fix-order sentiment subspace.

**Figure 7.**Results of the effect of mixed-order sentiment subspaces on IEMOCAP and CMU-MOSI. The fusion operation is associated with multiple mixed-order sentiment subspaces (MOPTP).

**Figure 8.**Results of the effect of window-size in temporal domain on CMU-MOSI. The validation framework is TMOPFN-L2.

**Figure 9.**Results on predictions of ‘PTP’ and ‘MOPTP’ on IEMOCAP. Specifically, the ‘n subspaces’ attached to the blue bar indicates a MOPTP with n high-order subspaces.

**Figure 12.**Results on predictions w.r.t activation method on CMU-MOSI. The MOPTP is associated with two high-order subspaces.

**Table 1.**Specifications of network architecture of TMOPFN and MOPFN. MOPFN indicates that we replace the ’PTP’ with ’MOPTP’ in the original HPFN. [-] indicates the configuration of a specific layer. ${\mathtt{MOPTP}}_{m}^{k}$ denotes the ‘m’th fused feature node in the layer ‘k’.

Model | Description of Layer-Wise Configuration |
---|---|

MOPFN-L1 | [${\mathtt{MOPTP}}_{1}^{o}(\mathtt{a},\mathtt{v},\mathtt{l})$] |

MOPFN-L2 | [${\mathtt{MOPTP}}_{1}^{h1}(\mathtt{a},\mathtt{v})$, ${\mathtt{MOPTP}}_{2}^{h1}(\mathtt{v},\mathtt{l})$, ${\mathtt{MOPTP}}_{3}^{h1}(\mathtt{a},\mathtt{l})$] – [${\mathtt{MOPTP}}_{1}^{o}({\mathtt{MOPTP}}_{1}^{h1},{\mathtt{MOPTP}}_{2}^{h1},{\mathtt{MOPTP}}_{3}^{h1})$] |

MOPFN-L2-S1 | [${\mathtt{MOPTP}}_{1}^{h1}(\mathtt{a},\mathtt{v},\mathtt{l})$] – [${\mathtt{MOPTP}}_{1}^{o}({\mathtt{MOPTP}}_{1}^{h1},\mathtt{a},\mathtt{v},\mathtt{l})$] |

MOPFN-L2-S2 | [${\mathtt{MOPTP}}_{1}^{h1}(\mathtt{a},\mathtt{v})$, ${\mathtt{MOPTP}}_{2}^{h1}(\mathtt{v},\mathtt{l})$, ${\mathtt{MOPTP}}_{3}^{h1}(\mathtt{a},\mathtt{l})$] – [${\mathtt{MOPTP}}_{1}^{o}({\mathtt{MOPTP}}_{1}^{h1},{\mathtt{MOPTP}}_{2}^{h1},{\mathtt{MOPTP}}_{3}^{h1},\mathtt{a},\mathtt{v},\mathtt{l})$] |

MOPFN-L3 | [${\mathtt{MOPTP}}_{1}^{h1}(\mathtt{a},\mathtt{v})$, ${\mathtt{MOPTP}}_{2}^{h1}(\mathtt{v},\mathtt{l})$, ${\mathtt{MOPTP}}_{3}^{h1}(\mathtt{a},\mathtt{l})$] – |

[${\mathtt{MOPTP}}_{1}^{h2}({\mathtt{MOPTP}}_{1}^{h1},{\mathtt{MOPTP}}_{2}^{h1})$, ${\mathtt{MOPTP}}_{2}^{h2}({\mathtt{MOPTP}}_{1}^{h1},{\mathtt{MOPTP}}_{3}^{h1})$, ${\mathtt{MOPTP}}_{3}^{h2}({\mathtt{MOPTP}}_{2}^{h1},{\mathtt{MOPTP}}_{3}^{h1})$] – [${\mathtt{MOPTP}}_{1}^{o}({\mathtt{MOPTP}}_{1}^{h2},{\mathtt{MOPTP}}_{2}^{h2},{\mathtt{MOPTP}}_{3}^{h2})$] | |

TMOPFN-L3-S1 | [${\mathtt{MOPTP}}_{1}^{h1}(\mathtt{a},\mathtt{v},\mathtt{l})$] – |

[${\mathtt{MOPTP}}_{1}^{h2}({\mathtt{MOPTP}}_{1}^{h1},\mathtt{l})$, ${\mathtt{MOPTP}}_{2}^{h2}(\mathtt{l},\mathtt{a})$, ${\mathtt{MOPTP}}_{3}^{h2}(\mathtt{a},\mathtt{v})$, ${\mathtt{MOPTP}}_{4}^{h2}(\mathtt{v},\mathtt{l})$] – [${\mathtt{MOPTP}}_{1}^{o}({\mathtt{MOPTP}}_{1}^{h2},{\mathtt{MOPTP}}_{2}^{h2},{\mathtt{MOPTP}}_{3}^{h2},{\mathtt{MOPTP}}_{4}^{h2})$] | |

TMOPFN-L3-S2 | [${\mathtt{MOPTP}}_{1}^{h1}(\mathtt{a},\mathtt{v})$, ${\mathtt{MOPTP}}_{2}^{h1}(\mathtt{v},\mathtt{l})$, ${\mathtt{MOPTP}}_{3}^{h1}(\mathtt{a},\mathtt{l})$] – [${\mathtt{MOPTP}}_{1}^{h2}(\mathtt{a},\mathtt{v},\mathtt{l})$] – [${\mathtt{MOPTP}}_{1}^{o}({\mathtt{MOPTP}}_{1}^{h1},{\mathtt{MOPTP}}_{2}^{h1},{\mathtt{MOPTP}}_{3}^{h1},{\mathtt{MOPTP}}_{3}^{h2})$] |

TMOPFN-L3-S3 | [${\mathtt{MOPTP}}_{1}^{h1}(\mathtt{a},\mathtt{l})$, ${\mathtt{MOPTP}}_{2}^{h1}(\mathtt{v},\mathtt{a})$, ${\mathtt{MOPTP}}_{3}^{h1}(\mathtt{v},\mathtt{l})$] – |

[${\mathtt{MOPTP}}_{1}^{h2}({\mathtt{MOPTP}}_{1}^{h1},\mathtt{v})$, ${\mathtt{MOPTP}}_{2}^{h2}({\mathtt{MOPTP}}_{2}^{h1},\mathtt{l})$, ${\mathtt{MOPTP}}_{3}^{h2}({\mathtt{MOPTP}}_{3}^{h1},\mathtt{a})$] – [${\mathtt{MOPTP}}_{1}^{o}({\mathtt{MOPTP}}_{1}^{h2},{\mathtt{MOPTP}}_{2}^{h2},{\mathtt{MOPTP}}_{3}^{h2})$] | |

MOPFN-L4 | [${\mathtt{MOPTP}}_{1}^{h1}(\mathtt{a},\mathtt{v})$, ${\mathtt{MOPTP}}_{2}^{h1}(\mathtt{v},\mathtt{l})$, ${\mathtt{MOPTP}}_{3}^{h1}(\mathtt{a},\mathtt{l})$] – |

[${\mathtt{MOPTP}}_{1}^{h2}({\mathtt{MOPTP}}_{1}^{h1},{\mathtt{MOPTP}}_{2}^{h1})$, ${\mathtt{MOPTP}}_{2}^{h2}({\mathtt{MOPTP}}_{1}^{h1},{\mathtt{MOPTP}}_{3}^{h1})$, ${\mathtt{MOPTP}}_{3}^{h2}({\mathtt{MOPTP}}_{2}^{h1},{\mathtt{MOPTP}}_{3}^{h1})$] – | |

[${\mathtt{MOPTP}}_{1}^{h3}({\mathtt{MOPTP}}_{1}^{h2},{\mathtt{MOPTP}}_{2}^{h2})$, ${\mathtt{MOPTP}}_{2}^{h3}({\mathtt{MOPTP}}_{1}^{h2},{\mathtt{MOPTP}}_{3}^{h2})$, ${\mathtt{MOPTP}}_{3}^{h3}({\mathtt{MOPTP}}_{2}^{h2},{\mathtt{MOPTP}}_{3}^{h2})$] – [${\mathtt{MOPTP}}_{1}^{o}({\mathtt{MOPTP}}_{1}^{h3},{\mathtt{MOPTP}}_{2}^{h3},{\mathtt{MOPTP}}_{3}^{h3})$] |

**Table 2.**Comparison of model complexity for TFN, LMF and our TMOPFN. Specifically, ${I}_{y}$ is the length of output feature, M is the number of modalities, R is the tensor rank,

`[`T, S

`]`is the local ’window’ size with $S\le M$, and ${I}_{t,m}$ indicates the dimension of message from modality m at time t. The amount of parameters of the L-layer TMOPFN is linearly related to the total number of MOPTP ’windows’ ${\sum}_{l=1}^{L}{N}_{l}$, where ${N}_{l}$ is the number of ‘windows’ at layer $l\in \left[L\right]$. Actually, ${N}_{l}$ is small and decreasing along layers, e.g., ${N}_{1}>{N}_{2}>\cdots >{N}_{L}$. Generally, the parameter of TMOPFN is larger than or comparable to LMF, but significantly less than that of TFN.

Models | Param |
---|---|

TFN [non-temporal] | $\mathcal{O}({I}_{y}{\prod}_{m=1}^{M}{I}_{m})$ |

LMF [non-temporal] | $\mathcal{O}({I}_{y}R({\sum}_{m=1}^{M}{I}_{m}))$ |

MOPTP [temporal] | $\mathcal{O}({I}_{y}R({\sum}_{t=1}^{T}{\sum}_{m=1}^{S}{I}_{t,m}))$ |

TMOPFN (L layers) [temporal] | $\mathcal{O}({I}_{y}R({\sum}_{l=1}^{L}{N}_{l})({\sum}_{t=1}^{T}{\sum}_{m=1}^{S}{I}_{t,m}))$ |

Models | Input | Parameters (M) | CMU-MOSI | IEMOCAP | CMU-MOSEI | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

MAE | Corr | Acc-2 | F1 | Acc-7 | F1-Happy | F1-Sad | F1-Angry | F1-Neutral | MAE | Corr | Acc-2 | F1 | Acc-7 | |||

SVM [54] | (A+V+T) | - | 1.864 | 0.057 | 50.2 | 50.1 | 17.5 | 81.5 | 78.8 | 82.4 | 64.9 | 0.77 | 0.46 | 73.9 | 73.6 | 39.9 |

DF [55] | (A+V+T) | - | 1.143 | 0.518 | 72.3 | 72.1 | 26.8 | 81.0 | 81.2 | 65.4 | 44.0 | 0.72 | 0.51 | 74.0 | 72.5 | 43.5 |

BC-LSTM [33] | (A+V+T) | 1.7 M | 1.079 | 0.581 | 73.9 | 73.9 | 28.7 | 81.7 | 81.7 | 84.2 | 64.1 | 0.72 | 0.51 | 75.8 | 75.5 | 44.6 |

MV-LSTM [32] | (A+V+T) | - | 1.019 | 0.601 | 73.9 | 74.0 | 33.2 | 81.3 | 74.0 | 84.3 | 66.7 | 0.72 | 0.52 | 76.4 | 76.4 | 43.5 |

MARN [34] | (A+V+T) | - | 0.968 | 0.625 | 77.1 | 77.0 | 34.7 | 83.6 | 81.2 | 84.2 | 65.9 | 0.73 | 0.51 | 75.9 | 75.8 | 43.2 |

MFN [8] | (A+V+T) | 0.5 M | 0.965 | 0.632 | 77.4 | 77.3 | 34.1 | 84.0 | 82.1 | 83.7 | 69.2 | 0.72 | 0.52 | 76.0 | 76.0 | 43.2 |

TFN [24] | (A+V+T) | 12.5 M | 0.970 | 0.633 | 73.9 | 73.4 | 32.1 | 83.6 | 82.8 | 84.2 | 65.4 | 0.72 | 0.52 | 74.8 | 75.4 | 44.7 |

LMF [27] | (A+V+T) | 1.1 M | 0.912 | 0.668 | 76.4 | 75.7 | 32.8 | 85.8 | 85.9 | 89.0 | 71.7 | - | - | - | - | - |

MFM [35] | (A+V+T) | - | 0.951 | 0.662 | 78.1 | 78.1 | 36.2 | 85.8 | 86.1 | 86.7 | 68.1 | - | - | - | - | - |

LMF-MulT [36] | (A+V+T) | - | 1.016 | 0.647 | 77.9 | 77.9 | 32.4 | 84.1 | 83.4 | 86.2 | 70.8 | - | - | - | - | - |

HPFN-L1, 1 subspace (P = [8]) | (A+V+T) | 0.09 M | 0.968 | 0.648 | 77.2 | 77.2 | 36.9 | 85.7 | 86.5 | 87.9 | 71.8 | 0.71 | 0.53 | 75 | 75 | 45.2 |

TMOPFN-L1, 2 subspaces (P = [1, 2]) | (A+V+T) | 0.09 M | 0.938 | 0.678 | 79.6 | 79.6 | 37.9 | 86.0 | 86.6 | 88.6 | 72.5 | 0.71 | 0.55 | 75.3 | 75.5 | 45.4 |

TMOPFN-L2, 2 subspaces (P = [1, 2]) | (A+V+T) | 0.11 M | 0.943 | 0.659 | 78.6 | 78.7 | 38.3 | 87.4 | 86.8 | 90.2 | 72.6 | 0.71 | 0.55 | 75.9 | 75.9 | 45.3 |

TMOPFN-L3-S3, 2 subspaces (P = [1, 2]) | (A+V+T) | 0.12 M | 0.949 | 0.652 | 77.6 | 77.7 | 37.3 | 85.8 | 87.4 | 88.8 | 73.1 | 0.71 | 0.55 | 75.9 | 75.7 | 45.3 |

HPFN (previous-version) [29] | (A+V+T) | 0.11 M | 0.945 | 0.672 | 77.5 | 77.4 | 36.9 | 86.2 | 86.6 | 88.8 | 72.5 | - | - | - | - | - |

TMOPFN | (A+V+T) | 0.12 M | 0.908 | 0.678 | 79.6 | 79.6 | 39.4 | 88.2 | 87.4 | 90.2 | 73.9 | 0.70 | 0.55 | 76.1 | 76.1 | 45.6 |

${\u25b5}_{SOTA}$ | - | - | $\downarrow 0.004$ | $\uparrow 0.01$ | $\uparrow 1.5$ | $\uparrow 1.5$ | $\uparrow 3.2$ | $\uparrow 2.4$ | $\uparrow 1.3$ | $\uparrow 1.2$ | $\uparrow 2.2$ | $\downarrow 0.01$ | $\uparrow 0.03$ | - | - | $\uparrow 0.9$ |

**Table 4.**Results for number of modalities on CMU-MOSI & CMU-MOSEI and emotion recognition on IEMOCAP.

Models | Input | CMU-MOSI | IEMOCAP | CMU-MOSEI | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

MAE | Corr | Acc-2 | F1 | Acc-7 | F1-Happy | F1-Sad | F1-Angry | F1-Neutral | MAE | Corr | Acc-2 | F1 | Acc-7 | ||

TMOPFN | (A) | 1.373 | 0.296 | 59.5 | 59.5 | 20.6 | 68.2 | 67.9 | 71.2 | 64.4 | 0.99 | 0.30 | 61.2 | 61.2 | 22.6 |

TMOPFN | (V) | 1.348 | 0.227 | 60.3 | 60.1 | 22.0 | 67.7 | 66.5 | 70.3 | 63.9 | 1.09 | 0.28 | 60.1 | 60.4 | 22.2 |

TMOPFN | (T) | 0.996 | 0.594 | 73.9 | 73.6 | 33.5 | 78.1 | 79.3 | 81.9 | 68.7 | 0.81 | 0.42 | 71.8 | 71.9 | 36.6 |

TMOPFN | (A+V) | 1.169 | 0.305 | 60.9 | 60.6 | 23.7 | 69.9 | 68.8 | 72.7 | 66.3 | 0.91 | 0.33 | 63.4 | 63.9 | 25.6 |

TMOPFN | (A+T) | 0.982 | 0.612 | 74.4 | 74.8 | 34.6 | 80.7 | 81.2 | 82.4 | 70.7 | 0.78 | 0.45 | 72.6 | 72.8 | 38.2 |

TMOPFN | (V+T) | 0.990 | 0.606 | 74.1 | 74.2 | 34.0 | 80.1 | 81.0 | 82.2 | 69.8 | 0.80 | 0.44 | 72.3 | 72.1 | 37.9 |

TMOPFN | (A+V+T) | 0.908 | 0.678 | 79.6 | 79.6 | 39.4 | 88.2 | 87.4 | 90.2 | 73.9 | 0.70 | 0.55 | 76.1 | 76.1 | 45.6 |

Models | Input | CMU-MOSI | ||||
---|---|---|---|---|---|---|

MAE | Corr | Acc-2 | F1 | Acc-7 | ||

Multimodal-Graph [38] | (A+V+T) | 0.923 | 0.680 | 80.1 | 80.0 | 31.9 |

LMF-MulT [36] | (A+V+T) | 0.941 | 0.671 | 78.1 | 78.1 | 34.2 |

TMOPFN | (A+V+T) | 0.908 | 0.678 | 79.6 | 79.6 | 39.4 |

Models | Input | IEMOCAP | |||
---|---|---|---|---|---|

F1-Happy | F1-Sad | F1-Angry | F1-Neutral | ||

HGMF [40] | (A+V+T) | 88.1 | 84.7 | 87.9 | - |

MTGAT [39] | (A+V+T) | 87.8 | 86.5 | 87.1 | 72.9 |

CTNet [56] | (A+V+T) | 83.0 | 86.3 | 80.2 | 83.9 |

LMF-MulT [36] | (A+V+T) | 84.2 | 83.8 | 86.1 | 71.2 |

TMOPFN | (A+V+T) | 88.2 | 87.4 | 90.2 | 73.9 |

**Table 7.**The UA (unweighted accuracy) and WA (weighted accuracy) of TMOPFN on CMU-MOSI, CMU-MOSEI and IEMOCAP.

Metric | CMU-MOSI | IEMOCAP | CMU-MOSEI | |||
---|---|---|---|---|---|---|

- | Happy | Sad | Angry | Neutral | - | |

UA | 79.9 | 88.13 | 87.32 | 90.45 | 73.18 | 75.6 |

WA | 77.4 | 87.29 | 86.14 | 88.84 | 71.63 | 73.2 |

**Table 8.**The mean, variance and standard deviation of TMOPFN on CMU-MOSI, CMU-MOSEI and IEMOCAP. Please note that each performance metric is executed for 5 times (5-fold cross validation).

Models | Value | CMU-MOSI | IEMOCAP | CMU-MOSEI | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

MAE | Corr | Acc-2 | F1 | Acc-7 | F1-Happy | F1-Sad | F1-Angry | F1-Neutral | MAE | Corr | Acc-2 | F1 | Acc-7 | ||

TMOPFN | mean | 0.908 | 0.678 | 79.6 | 79.6 | 39.4 | 88.2 | 87.4 | 90.2 | 73.9 | 0.70 | 0.55 | 76.1 | 76.1 | 45.6 |

TMOPFN | variance | $2.08\times {10}^{-4}$ | $6.07\times {10}^{-5}$ | $2.93\times {10}^{-4}$ | $4.5\times {10}^{-3}$ | $7.08\times {10}^{-4}$ | $4.9\times {10}^{-4}$ | $7.67\times {10}^{-5}$ | $9.97\times {10}^{-5}$ | $2.74\times {10}^{-4}$ | $4.33\times {10}^{-7}$ | $5.47\times {10}^{-7}$ | $3.47\times {10}^{-5}$ | $1.49\times {10}^{-4}$ | $9.12\times {10}^{-5}$ |

TMOPFN | standard deviation | $1.44\times {10}^{-2}$ | $7.79\times {10}^{-3}$ | $1.71\times {10}^{-2}$ | $6.71\times {10}^{-2}$ | $2.8\times {10}^{-2}$ | $2.2\times {10}^{-2}$ | $8.75\times {10}^{-3}$ | $9.98\times {10}^{-3}$ | $1.65\times {10}^{-2}$ | $6.58\times {10}^{-4}$ | $7.39\times {10}^{-4}$ | $5.89\times {10}^{-3}$ | $1.22\times {10}^{-2}$ | $9.54\times {10}^{-3}$ |

**Table 9.**Results of TMOPFN and HPFN on non-temporal multimodal features regarding the depth and dense connectivity.

Models | IEMOCAP | CMU-MOSI | CMU-MOSEI | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

F1-Happy | F1-Sad | F1-Angry | F1-Neutral | MAE | Corr | Acc-2 | F1 | Acc-7 | MAE | Corr | Acc-2 | F1 | Acc-7 | |

HPFN-L1, P = [2] | 85.7 | 86.2 | 87.8 | 71.9 | 0.973 | 0.635 | 77.1 | 77.0 | 35.9 | 0.71 | 0.54 | 75.0 | 75.3 | 45.2 |

HPFN-L2, P = [2, 2] | 86.2 | 86.6 | 88.8 | 72.5 | 0.958 | 0.652 | 77.1 | 77.1 | 36.3 | 0.71 | 0.54 | 75.4 | 75.5 | 45.5 |

HPFN-L2-S1, P = [2, 2] | 86.2 | 86.7 | 88.9 | 72.6 | 0.959 | 0.654 | 77.3 | 77.2 | 36.5 | 0.71 | 0.54 | 75.1 | 75.3 | 45.3 |

HPFN-L2-S2, P = [2, 2] | 86.2 | 86.7 | 89.0 | 72.7 | 0.957 | 0.656 | 77.3 | 77.3 | 36.5 | 0.70 | 0.55 | 75.3 | 75.4 | 45.6 |

HPFN-L3, P = [2, 2, 1] | 86.1 | 86.8 | 88.3 | 72.7 | 0.960 | 0.651 | 76.8 | 76.8 | 36.0 | 0.71 | 0.55 | 75.5 | 75.3 | 45.5 |

HPFN-L4, P = [2, 2, 2, 1] | 85.8 | 86.4 | 88.1 | 72.5 | 0.992 | 0.634 | 76.6 | 76.5 | 34.6 | 0.71 | 0.55 | 75.3 | 75.1 | 45.3 |

TMOPFN-L3-S1 | 86.6 | 86.4 | 88.4 | 72.1 | 0.960 | 0.641 | 76.1 | 76.1 | 36.4 | 0.71 | 0.54 | 75.0 | 75.3 | 45.2 |

TMOPFN-L3-S2 | 85.8 | 86.2 | 88.8 | 72.7 | 0.968 | 0.648 | 76.0 | 76.0 | 35.4 | 0.71 | 0.54 | 75.0 | 75.2 | 45.1 |

TMOPFN-L3-S3 | 85.8 | 87.4 | 88.8 | 73.1 | 0.949 | 0.652 | 77.6 | 77.7 | 37.3 | 0.71 | 0.54 | 75.1 | 75.4 | 45.5 |

Models | Parameters (M) | Input | CMU-MOSI | IEMOCAP | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|

MAE | Corr | Acc-2 | F1 | Acc-7 | F1-Happy | F1-Sad | F1-Angry | F1-Neutral | |||

2D-DenseNet | 6.97 M | (A+V+T) | 1.090 | 0.573 | 74.9 | 74.9 | 26.8 | 84.6 | 85.6 | 87.5 | 70.3 |

3D-DenseNet | 6.97 M | (A+V+T) | 1.054 | 0.630 | 75.9 | 76.0 | 29.0 | 82.5 | 84.4 | 88.3 | 67.7 |

HPFN (previous-version) [29] | 0.11 M | (A+V+T) | 0.945 | 0.672 | 77.5 | 77.4 | 36.9 | 86.2 | 86.6 | 88.8 | 72.5 |

TMOPFN | 0.12 M | (A+V+T) | 0.908 | 0.678 | 79.6 | 79.6 | 39.4 | 88.2 | 87.4 | 90.2 | 73.9 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Tang, J.; Hou, M.; Jin, X.; Zhang, J.; Zhao, Q.; Kong, W.
Tree-Based Mix-Order Polynomial Fusion Network for Multimodal Sentiment Analysis. *Systems* **2023**, *11*, 44.
https://doi.org/10.3390/systems11010044

**AMA Style**

Tang J, Hou M, Jin X, Zhang J, Zhao Q, Kong W.
Tree-Based Mix-Order Polynomial Fusion Network for Multimodal Sentiment Analysis. *Systems*. 2023; 11(1):44.
https://doi.org/10.3390/systems11010044

**Chicago/Turabian Style**

Tang, Jiajia, Ming Hou, Xuanyu Jin, Jianhai Zhang, Qibin Zhao, and Wanzeng Kong.
2023. "Tree-Based Mix-Order Polynomial Fusion Network for Multimodal Sentiment Analysis" *Systems* 11, no. 1: 44.
https://doi.org/10.3390/systems11010044