A Parallel Multi-Modal Factorized Bilinear Pooling Fusion Method Based on the Semi-Tensor Product for Emotion Recognition

Multi-modal fusion can exploit complementary information from various modalities and improve the accuracy of prediction or classification tasks. In this paper, we propose a parallel, multi-modal, factorized, bilinear pooling method based on a semi-tensor product (STP) for information fusion in emotion recognition. Initially, we apply the STP to factorize a high-dimensional weight matrix into two low-rank factor matrices without dimension matching constraints. Next, we project the multi-modal features to the low-dimensional matrices and perform multiplication based on the STP to capture the rich interactions between the features. Finally, we utilize an STP-pooling method to reduce the dimensionality to get the final features. This method can achieve the information fusion between modalities of different scales and dimensions and avoids data redundancy due to dimension matching. Experimental verification of the proposed method on the emotion-recognition task using the IEMOCAP and CMU-MOSI datasets showed a significant reduction in storage space and recognition time. The results also validate that the proposed method improves the performance and reduces both the training time and the number of parameters.

Emotion recognition is considered to be a hot research topic in the field of multi-modal fusion and aims to integrate video, audio, and text modalities by employing fusion strategies at feature, model, and decision levels [16]. Previous works [17,18] merged modalities in a straightforward way. They have demonstrated the performances of feature-level fusion in the emotion-recognition task that could not model the complicated relationships. Decisionlevel fusion in [19,20] is usually implemented by combining the individual classification scores and is therefore not able to well capture the mutual correlation among different modalities. In [21], model-level fusion was performed by hidden Markov models, which facilitated the establishment of optimal connections among modailites according to the maximum entropy principle and the maximum mutual information criterion.
Recently, unlike existing approaches, the tensor-product representations have been extensively used for the multi-modal emotion-recognition tasks due to their impressive capabilities to directly achieve dynamic interactions in both inter-modality and intramodality [3,[22][23][24]. Zadeh et al. [25] proposed the tensor fusion network (TFN), by using the tensor cross-product to calculate the interactions between different features and learning both inter-modality and intra-modality dynamics in an end-to-end manner. It performs fusion at the feature level. Unfortunately, as the characteristic dimension increases, the number of parameters in the model increases exponentially, leading to high computation and memory costs. To tackle this problem, the low-rank multi-modal fusion (LMF) [26], the multi-modal, factorized, bilinear pooling (MFB) model [27], the memory fusion network (MFN) [28], and the multi-modal transformer (MuLT) [14] have been proposed to further improve the processing efficiency and evaluation. Saurav Sahay et al. proposed the LMF-MulT [15] method, which builds up on the MuLT and applies transformers to fused multi-modal signals that aim to capture all inter-modal signals via the LMF. It is trained fast and uses few parameters. It performs fusion at the model or decision level. However, these methods must satisfy the limitation of the dimension-matching conditions in matrix multiplication. We notice that in real cases, the features of various modalities have different scales and dimensions. Consequently, this can result in the inability to calculate or the need to match dimensions when calculating, leading to data redundancy. Most importantly, huge amounts of data require expensive hardware to store; storage devices thus limit the applicability of such methods on resource constrained devices, such as mobile phones and wearable devices. There is an urgent need to reduce storage space and runtime to enable deployment on mobiles and under-resourced devices.
To solve the aforementioned problems, we introduce a generalization mechanism of the conventional matrix product [29] in MFM pooling for multi-modal fusion, i.e., the semitensor product (STP). The STP does not depend on the dimensionality of the operational matrices or tensors. Due to its flexibility, the STP has been used in many fields. In the compressed sensing technique [30], STP has been introduced to replace the conventional matrix product in the sampling model. For visual question answering [31], the blockwise operation of STP has been applied to multi-modal fusion. In digital watermarking, Chen et al. [32] proposed a general nonnegative matrix factorization based on STP.
In this paper, we propose a hierarchical fusion method named parallel, multi-modal, factorized bilinear pooling based on semi-tensor product (PFBP-STP). The proposed method improves the efficiency of fusion at the feature level and decision level for various text and audio/video-based tasks. More importantly, it can make information fusion between different scales and dimensions of modalities independent of the dimension matching conditions in matrix multiplication. We applied this computationally efficient and flexible method to the emotion-recognition task.
The main contributions of this paper can be summarized as follows: (1) Multi-modal, factorized, bilinear pooling based on STP, which can avoid data redundancy due to dimension matching, and reduces the computational and memory costs. (2) We proposed a parallel, multi-modal, factorized, bilinear pooling method based on STP which can capture the rich interactions between the features by hierarchical fusion, and which realizes the arbitrary combination and fusion of three modalities. (3) Experimental evaluation of the proposed methodology on two multi-modal datasets.

Notation and Preliminaries
In this section, a new matrix product named the semi-tensor product (STP) [29] is briefly reviewed initially. It is a generalization of the traditional matrix product and is applicable for two matrices of arbitrary dimensions. In addition, this generalization ensures the availability of all fundamental properties of the conventional matrix product. Therefore, it has become a very powerful and convenient new mathematical tool for investigating many matrix-expression-related problems.
We provide some basic preliminaries of the STP [33,34], which serve as the necessary theoretical basis of the proposed method.

Definition 1.
For two matrices X ∈ R m×n and Y ∈ R p×q , the left STP is denoted by and can be expressed as: where ⊗ denotes the Kronecker product [35], considering t = lcm(n, p) is the least common multiple of n and p; I t/n and I t/p are identity matrixes.
In Equation (1), set n = p. It is obvious that left STP reverts to ordinary matrix multiplication as X Y = XY. Definition 2. Given a non-negative matrix Z k×l + , we aim to find two non-negative matrices where t = lcm(n, p) is the least common multiple of n and p; k = m · t/n and l = q · t/p.

Definition 3.
Let X ∈ R NP be a row and W ∈ R P be a column. We split X into equal-size blocks as X 1 , X 2 , . . . , X P , such that X i ∈ R N , i = 1, . . . , P defines the STP of X and W is denoted by X W, which is given as

Methodology
In this section, we first describe the architecture of the proposed model. Then, we introduce the concept of STP to extend the idea of bilinear pooling. Finally, we propose a parallel, multi-modal, factorized, bilinear pooling method based on the STP (PFBT-STP).

Model Architecture
Our method first obtains the unimodal representations x 1 ∈ R I , x 2 ∈ R J and x 3 ∈ R K by passing the unimodal inputs information (which includes text, video, and audio data) through three sub-embedding networks, f l , f a , and f v , respectively. Then, we fuse one of these modalities with the other two modalities separately to exploit multi-modal, factorized, bilinear pooling based on STP, as the dimensionality differs among the features. Finally, we employ a decision fusion layer to improve the classification accuracy of the output features z 12 and z 13 for emotion recognition. The basic architecture of the PFBP-STP is shown in Figure 1.

Multi-Modal, Factorized Bilinear Pooling
In this section, we revisit the multi-modal bilinear models and the MFB pooling model. Given two feature vectors x 1 ∈ R I and x 2 ∈ R J , the multi-modal bilinear pooling is defined as follows: where W i ∈ R I×J is a projection matrix and y i ∈ R is the output of bilinear pooling.
To obtain a K-dimensional output y = [y 1 , . . . , y K ], a tensor W = [W 1 , . . . , W K ] ∈ R I×J×K needs to be learned. Unfortunately, the tensor W is a high-dimensional representation and introduces a larger number of parameters, which leads to a higher computation and memory cost and an even greater risk of overfitting, although multi-modal bilinear pooling can effectively capture the rich interactions between multi-modal features. It is known from [36,37] that the low-rank approximation of non-negative matrix factorization can reduce the dimensionality of the original matrix, along with computational and memory costs. Hence, the two low-rank factor matrices have good interpretability, which is obtained by factorization.
Inspired by the matrix factorization techniques [37], the projection matrix W i in Equation (4) can be factorized into two low-rank matrices.
therefore, Equation (4) can be re-written as where d is the latent dimensionality of the factorized matrices is an all-one vector, and the operation • represents the element-wise multiplication of two feature vectors or the Hadamard product.
According to Equation (6), we can get the following expression: where y ∈ R o , and we need to learn two three-order tensors, i.e., U = [U 1 , . . . , U o ] ∈ R I×k×o and V = [V 1 , . . . , V o ] ∈ R J×k×o , to obtain the output feature y. Generally, we reshape the tensors U and V as 2D matrices U ∈ R I×ko and V ∈ R J×ko , respectively. The fused (final) vector z can be obtained by summing non-overlapping windows of size h over the Hadamard product of projected matrices. We define the projections of feature vectors x 1 and x 2 in matrices U and V asx 1 = Ux 1 andx 2 = V T x 2 . We refer to the following model as the MFB pooling: The above traditional multi-modal bilinear pooling method directly projects the features to the low-dimensional matrices and performs multiplication. In this process, there are two-dimensional matching conditions that must be satisfied in the matrix factorization and multiplication. In Equation (5), the projection matrix W i and two low-rank matrices U i and V i have to follow the dimension matching constraints. In practice, the dimensions of multi-modal information, i.e., text, video, and audio, are different in the feature space. In this case, we need to match the dimensions, as it would cause data redundancy if we were to continue to use the traditional matrix factorization method.

Multi-Modal, Factorized Bilinear Pooling Based on STP
In order to solve the above problems, we propose a multi-modal, factorized bilinear pooling method based on STP. We factorize the projection matrix in Equation (5) by Definition 2 as follows: where W i ∈ R I×J , U i ∈ R p×m , and V i ∈ R q×n . The variable t = lcm(n, p) is the least common multiple of n and p, I = p · t/m, and J = q · t/n. According to Equation (9), Equation (4) can be rewritten as where y ∈ R o . We reshape the tensor U = [U 1 , . . . , R q×n×o as 2D matrices U ∈ R p×mo and V ∈ R q×no , respectively. We define the projection of feature vectors x 1 and x 2 in matrices U and V asx 1 = Ux 1 Similarly, Equation (11) can be re-written as In addition, we propose a pooling method based on the STP (STP-pooling). The main function of pooling is reducing the dimensionality, which is achieved by the multiple dimension relation of the STP.
Let y ∈ R 1×nh and w 0 ∈ R h . We can split y into n equal-sized blocks as y 1 , y 2 , . . . , y n ∈ R 1×n . Then, the semi-tensor product can be represented as follows: We get the final (fused) vector z by estimating the STP with a non-overlapping window of size h over the vector y. z = STP-pooling (y w 0 , h) In this section, the proposed method breaks the limitation of dimension matching conditions in matrix multiplication and achieves information fusion easily between different modalities having different scales and dimensions.

Parallel, Multi-Modal, Factorized Bilinear Pooling Based on STP (PFBP-STP)
Based on the above method, we arbitrarily merge the modalities of text, video, and audio at different scales to achieve multi-model fusion that overcomes the dimensional limitation in our task. The three modalities are represented as x 1 ∈ R I , x 2 ∈ R J , and x 3 ∈ R K , respectively, and the fused features are denoted as z 12 ∈ R 0 and z 13 ∈ R 0 , which represent the fusion results of x 1 with x 2 and x 3 respectively. Equation (11) can be rewritten as: We reshape the tensors U ∈ R p×m×o , V ∈ R q×n×o ,Ū ∈ Rp ×m×o , andV ∈ Rq ×n×o as 2D matricesŨ ∈ R p×mo ,Ṽ ∈ R q×no , Ū ∈ Rp ×mo , and V ∈ Rq ×no respectively.
Let us define the projections of feature vectors x 2 and x 3 on matrices U and Ū as x 3 ; meanwhile, we perform the projection of the feature vector x 1 on matrices V ∈ R q×no and V ∈ Rq ×no asx 1 = V T x 1 and x 1 = V x 1 respectively. Similarly, we get y 12 =x 2 x 1 and y 13 =x 3 x 1 . According to Equations (13) and (14), we get the final (fused) vectors z 12 and z 13 as follows: The final (fused) vectors of z 12 and z 13 in Equations (17) and (18) are then fused via soft fusion at the decision-level stage to further improve the results. The weighted combination of the two groups of fusion modalities' scores is mathematically shown as follows: where W 12 and W 13 are the weights of z 12 and z 13 . We set the same weight W 12 = W 13 at the initial time. S 12 (c) and S 13 (c) represent the score matrices of z 12 and z 13 for the prediction of class c, and S z (c) stands for the final classification results. Algorithm 1 shows the process of PFBT-STP.
In this paper, we use the parallel, multi-modal, factorized, bilinear pooling method based on STP to fuse x 1 with x 2 and x 3 , separately. It not only realizes the fusion of different scales and dimensions of information, but also avoids the problem of exponential growth when three modalities are fused simultaneously, leading to the risk of overfitting. This method also incorporates the scores from separate fusion modalities and generates a new prediction label by applying the soft fusion method at the decision-level fusion stage to further improve the results for the emotion-recognition task.

Experimental
In this section, we present various experiments to evaluate the characteristics of PFBP-STP and to support the following research claims: (1) Comparison with state-of-the-art: We conducted experiments on PFBP-STP and stateof-the-art methods for an emotion-recognition task on IEMOCAP and CMU-MOSI datasets; (2) The advantage of the PFBP-STP: It allows the information fusion independent of the dimension-matching conditions in matrix multiplication by replacing matrix products with semi-tensor products; (3) Complexity analysis: We evaluate the speed and learned parameters of the method by comparing them with those of other methods.

Datasets
The proposed method was analyzed on the IEMOCAP [38] and CMU-MOSI [24] multi-modal datasets for emotion recognition.
The IEMOCAP dataset is designed to classify emotions such as voice and gesture displays during human interactions. It is an active, multi-modal, and multi-speaker database. It contains approximately 12 h of 302 videos. Each segment consists of nine different emotions: happy, angry, sad, excited, surprised, fear, neutral, frustrated, and disappointed. Ten actors performed three selected plays with clear emotional content. In addition to the script, subjects were asked to improvise conversations in hypothetical situations designed to elicit specific emotions (happy, angry, sad, depressed, and neutral states). Detailed motion-capture information, interaction configurations that elicit real emotions, and the size of the database make this corpus a valuable addition to existing databases to study and model multi-modal and expressive human communication.
The CMU-MOSI dataset is an opinion-level annotated corpus containing sentiment and subjectivity analysis of online videos such as YouTube videos. It includes 93 videos with comments. In each video, there are multiple opinion clips and emotional annotations within the range of [−3,3]. The two thresholds represent highly negative and highly positive opinions, respectively. For each video, an annotator was given 8 choices: highly negative (labeled as −3), negative (−2), weakly negative (−1), neutral (0), weakly positive (+1), positive (+2), and highly positive (+3). They could also choose to be "uncertain" in an ambigous situation. It not only has rigorous labels for sentiment intensity, subjectivity, visual features per-frame, and point of view, but also shows audio features per-millisecond.
The two datasets contain multiple information and were each divided into a training set, validation set, and test set to evaluate the generalization ability of the proposed model. It was ensured that there were no identical speakers between the training set and the test set. The data segmentation of the three sets is shown in Table 1.

Multi-Modal Data Features
The IEMOCAP dataset consists of three modalities, i.e., text, audio, and video. The unimodal features are extracted by utilizing global vectors for word representation, Glove [39], Facet, and COVAREP [40], respectively.
Text feature extraction implies Glove, an unsupervised learning algorithm that converts each word into a vector representation. For different inputs in the dataset above, the dimensions of each embedded text extracted by Glove number 300.
Audio feature extraction involved COVAREP, a collaborative and free speech-processing algorithm library. Low-frequency acoustic characteristics can be obtained by using CO-VAREP, including cepstrum coefficients of 12 MEL frequencies, tone tracking, glottic source parameters, glottic peak slope parameters, etc. Each audio feature was extracted with a 5 ms shift on a 25 ms frame, and each dataset has 74 dimensions.
Video features consist of 35 facial action units extracted using by Facet from each frame of the image. The video features are widely used to extract facial features, such as basic and advanced emotions. Thus, for each dataset, the dimensions of each video feature numbers 35.

Baseline
We choose early fusion LSTM (EF-LSTM) [14], late fusion LSTM (LF-LSTM) [14], the multi-modal transformer (MulT), [14], and the low-rank fusion-based transformer for multi-modal sequences (LMF-MulT) [15] as baselines. The MulT utilizes the low-rank representation of multi-modal sequences in the multi-modal transformer to pay crossmodal attention to modalities or fused signals. The LMF-MulT is built upon the MulT and applies transformers to fused multi-modal signals, aiming to capture all inter-modal signals through the low-rank matrix factorization (LMF).

Evaluation Metrics
In our experiments, multiple assessment tasks were performed, including regression and classification. The regression task was applied to CMU-MOSI. We used the accuracy Acc-k (where k represents the number of classes) and F1-score as the evaluation metrics for the CMU-MOSI. Specifically, for the other group, we used the 7-class accuracy (ACC-7) as the evaluation metric, which has seven sentiment scores. The mean absolute error (MAE) and the correlation (Corr) between the predicted results and the ground truth labels were used to evaluate the performance. The F1-score can be expressed by a weighted average of recall and precision as F1-score = 2 precision · recall precision + recall .

Training Setup
Our method was implemented on the open-source PyTorch framework. The hyperparameters were selected using grid search, which is dependent on the performance of the model on the validation set. The sizes of video features, text features, and audio features were set to 35, 300, and 74, respectively. The parameters of the model used in training were configured as explained in the literature, where dropout was 0.2, weight normalization was L 2 , and norm coefficient was 0.01. The Adam optimizer was employed with a learning rate of 0.0003, and the batch size was 32.

Results and Discussion
We present and discuss the experimental results in this section.

Comparison with the State-of-the-Art
We compared the performance of our model with those of the above baselines. The experimental results on the IEMOCAP and CMU-MOSI are presented in Tables 2 and 3, respectively. According to Table 2, the Acc values of the proposed method for happy, sad, and angry emotions were 85.7, 79.5, and 75.9, which are higher than the four baselines' values. These observations indicate the necessity and effectiveness of applying STP in multi-modal fusion. Due to the multiple-dimension relation, STP is a block vs. block operation, unlike other mechanisms using point-wise operation. It keeps temporal and spatial information of the video, audio, and text, which allows for better representation of intra-modality correlations and improves the fusion performance. From Table 3, it is shown that the Corr of the proposed method was 0.683, and the accuracy was 34.5% in the 7-class test, so it outperformed the baselines. In comparison with the performances of the baselines, the proposed method showed significant improvement on the IEMOCAP as compared to the CMU-MOSI.

Ablation Experiment
We chose bimodal or trimodal audio, video, and text as the input for emotion prediction. The bimodal and trimodal results are presented in Table 4.
For bimodal data, the experimental results in Table 4 demonstrate that the bimodal of t+v succeeded over the other two bimodals (a+t, a+v). This is because the audio contains some inevitable noises, and thus increases the difficulty of emotion recognition from speech. Compared with the audio, the text tends to have less noisy signals. Hence, we can learn more emotion-salient representation using text features.
Meanwhile, compared with bimodal inputs, the proposed method achieves better performance on trimodal inputs. Due to the complexity of emotion recognition, we can achieve better recognition performance by integrating multi-modal information.

Evaluation Indicators
Each modality of the central axis was combined and fused with the other two modalities, and the fusion results are compared under the three-layer framework. In this experiment, three modalities were randomly combined and fused, which demonstrates that the semi-tensor product is independent of the dimensional and scale-matching conditions for the fusion of the information using matrix multiplication.
We used the IEMOCAP dataset and the accuracy metrics to verify the performance of our method.
From Figure 2, we can see that the Acc score measure grows with each new iteration in the first seven epochs, which indicates that the model converges quickly. In Figure 3, we can see that our model learns efficiently and quickly in training, and the loss function changes swiftly in the first five iterations and then smooths in the next iterations. According to the accuracy values in Table 2, happy and angry emotions, again, had better recognition scores, whereas neutral had a worse recognition score during the training process; see Figures 2 and 3. The main reason is that the model had a less obvious learning effect on attribute values with neutral features in IEMOCAP.

Computational Complexity
In order to evaluate the computational complexity of our method, we compared the parameters and the training speed of our method with those of MulT and LMF-MulT. The results are shown in Tables 5 and 6. In Table 5, we can observe that our model contained about 5.5 × 10 6 parameters and MulT contains about 10.7 × 10 6 parameters, which is almost twice the number. Experimental results show that the proposed method used less running time and fewer trainable parameters compared with the other two models to achieve better performance. These models were all implemented in the same environment. Based on the results in Table 6 Detailed analysis showed that the parameters are fewer and the running time is reduced, yet the performance of the original part is improved. This is due to the introduction of matrix decomposition based on the STP, which eliminates the data redundancy caused by the need for dimension matching for matrix factorization, and the three modalities of different dimensions can be arbitrarily combined and fused.

Conclusions
In this paper, a parallel, multi-modal, factorized, bilinear pooling method based on the semi-tensor product (PFBP-STP) is proposed, which achieves information fusion between modalities of different scales and dimensions. By replacing matrix products with STP, the information fusion becomes independent of the dimension-matching conditions in matrix multiplication.
Experiments have shown that the proposed method can achieve a significant increase in training speed and better classification accuracy simultaneously. The proposed method removes the dimensional consistency limitation of matrix multiplication and expresses the same information in a more compact structure that employ less memory. It is computationally friendly and flexible.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: