A Multi-Modal Fusion Method Based on Higher-Order Orthogonal Iteration Decomposition

Multi-modal fusion can achieve better predictions through the amalgamation of information from different modalities. To improve the performance of accuracy, a method based on Higher-order Orthogonal Iteration Decomposition and Projection (HOIDP) is proposed, in the fusion process, higher-order orthogonal iteration decomposition algorithm and factor matrix projection are used to remove redundant information duplicated inter-modal and produce fewer parameters with minimal information loss. The performance of the proposed method is verified by three different multi-modal datasets. The numerical results validate the accuracy of the performance of the proposed method having 0.4% to 4% improvement in sentiment analysis, 0.3% to 8% improvement in personality trait recognition, and 0.2% to 25% improvement in emotion recognition at three different multi-modal datasets compared with other 5 methods.


Introduction
The multi-modal fusion technique turns up to be an interesting topic in AI technology fields. It integrates the information in multiple modalities and therefore is expected to perform better prediction than the case using any unimodal information [1]. Nowadays it has been applied in a broad range of applications, such as multimedia event detection [2,3], sentiment analysis [1,4], cross-modal translation [5][6][7], Visual Question Answering (VQA) [8,9], etc.
The multi-modal fusion techniques can be typically divided into three approaches, which are the early fusion [10], the late fusion [11] and the hybrid fusion [12]. The early fusion approach extracts the representation of features from each model and then fuses them at the feature level [10]. This approach is more suitable for sentiment analysis. In contrast, the late fusion approach trains the different models at first and then merges them at the decision level [13]. This approach, however, is good at emotion recognition. To take advantage of these two solutions, the hybrid fusion approach was subsequently proposed [14]. Most of the abovementioned methods use simple and straightforward ways to integrate the information parameters, e.g., by merely concatenating or averaging the multi-modal vectors, which cannot make use of the dedicated interrelationships among the multiple models at all [15].
Recently, by leveraging the tensor product representations, many researchers have geared towards achieving rich dynamic interactions in both intra-modality and intermodality directly to boost the performance [1,[15][16][17][18]. Zadeh [16] proposed a tensor fusion network (TFN) which calculates the interaction between different modalities by the crossproduct of tensor. Unfortunately, such representations suffer from an exponential growth in feature dimensions and resulting in high cost training process. To tackle this problem, an efficient decomposition method (LMF) is proposed [17] which leads to low-rank tensor factors and much less computational complexity, meanwhile, preserves the capacity of expressing the interactions of modalities. However, the method is still prone to parametric explosions once the features get too long. Meanwhile, it also ignores the local dynamics of interactions that are crucial to the final prediction [15].
Motivated by this problem, in this paper, we make use of higher-order orthogonal iteration decomposition and projection to our tasks. It also ensures that the local dynamics of interactions are preserved with reasonable computational and memory costs [19,20].
The main contributions of our paper are given below: (1) A tensor fusion method for multi-modalities prediction is proposed based on the higher-order orthogonal iteration decomposition and projection. It can remove the redundant information of duplicated inter-modal while producing fewer parameters with minimal information loss. (2) The proposed method can tradeoff the dimensionality reduction ratio and the error rate well. Meanwhile, it guarantees that the new tensor is closest to the original tensor in the case of maximal dimension reduction. (3) The performance of the proposed method has been verified through the evaluation processes on three common available multi-modal task datasets.

Relevant Mathematical Notations
To make the following algorithm description neat and clearer, some tensor related notations and operations are given at first: T : a tensor, denoting a higher-order extension of vectors and matrices in this paper. T (n) : a n-mode unfolded matrix T : the Frobenius norm of a tensor T × n : the n-mode product of a tensor ⊗: the Kronecker product Matricization: also known as unfolding or flattening, is the process of reordering the elements of an N-way array into a matrix. The n-mode matricization of a tensor T ∈ R I 1 ×I 2 ×...I N is denoted by T (n) ∈ R I n ×(I 1 I 2 ···I n−1 I n+1 ...I N ) . It arranges the n-mode fibers to be the columns of the resulting matrix [21] as shown in Figure 1: Tensor Multiplication: The n-mode product of a tensor T ∈ R I 1 ×I 2 ×...I N with a matrix U (n) ∈ R J n ×I n is denoted by T × n U (n) and is of size Singular Value Decomposition (SVD): A real matrix A ∈ R m×m can be expressed as the product where U and V are orthogonal matrices and Σ is a diagonal matrix. Tucker's Tensor Decomposition (Tucker decomposition): Tucker decomposition is higher order SVD. Which approximates tensor T ∈ R I 1 ×I 2 ×...I N with the core tensor G ∈ R J 1 ×J 2 ×...J N and N factor matrices U (n) ∈ R I n ×J n (n = 1, 2, . . .).
where ε denotes an arbitrarily small positive real number.

Methodology
In this section, a multi-modal fusion method based on Higher-order Orthogonal Iteration Decomposition and Projection (HOIDP) is proposed. Similar to many other multimodal prediction methods, the new method is composed of feature extraction and multimodal fusion, network model training, and generating prediction task stages. The main contribution of this paper is mainly in the first stage. In another word, it belongs to an early fusion method.
As shown in Figure 2, three modalities, i.e., the audio, the text, and the video inputs, are used in our algorithm presentation as well as our following experiments. At first, we obtain the three unimodal representations I 1 , I 2 and I 3 , which are the outputs of the three sub-embedding networks f a , f l , and f v of the audio, the text, and the video input, respectively, with the unimodal feature as their inputs. Secondly, we put these unimodal representations into a tensor T using the Kronecker product and then perform higher-order orthogonal iteration decomposition and projection to get tensor Z. In the end, we put the feature tensor Z into a deep neural network to generate the prediction tasks. The detailed algorithm is introduced in the following subsection.

Multi-Modal Fusion Based on Tensor Representation
Tensor representation is an effective approach for multi-modal fusion. We define N modalities as T 1 , T 2 , . . ., and T N which are column vectors of sizes I 1 , I 2 , . . ., and I N . We represent a N-modal tensor fusion approach by the Kronecker product in mathematical form.
Equation (4) can capture multi-modal interactions effectively. The input tensor T ∈ R I 1 ×I 2 ×...I N then goes through a linear layer f (·) to produce a vector representation h as shown in Equation (5).
where f (·) is a fully connected deep neural network, and W is the weight and b is the bias. The weight W is conditioned on the feature tensor T . Since the tensor T is higher dimensional and results increasing computational complexity, a higher-order orthogonal iteration decomposition is proposed in order to improve performance and reduce the data redundancy and parameter complexity in follow subsection.

Higher-Order Orthogonal Iteration Decomposition
We use the Tucker decomposition method to decompose the N-order tensor T ∈ R I 1 ×I 2 ×...I N using (3). The solution of the core tensor and factor matrix can be obtained by solving the following optimization problem: We adopt a higher-order orthogonal iteration decomposition algorithm to solve the above optimization problem to get the core tensor G and the factor matrix U (n) . The core process is described in detail as the following steps: Step 1: The n-mode unfolded matrix T (n) (n = 1, 2, 3, . . . , N) of tensor T is calculated, and the singular value decomposition of the n-mode unfolded matrix is carried out respectively to obtain T (n) = U (n) D (n) V (n) T , let the left singular value matrix U (n) (n = 1, 2, 3, . . . , N) be the initial factor matrix U (n) (k) (n = 1, 2, 3, . . . , N; k = 0).
Step 2: Set k = k + 1 and perform the operations: Step 3: Calculate the core tensor of the k-th iteration by using the factor matrix. The core tensor of each iteration is calculated until the convergence condition is satisfied.
Algorithm 1 shows the process.

Algorithm 1
The higher-order orthogonal iterative decomposition algorithm.
Input: the N-order tensor T ; Output: the core tensor G and the factor matrix U (n) ; 1: Initialize the factor matrix U 3: Compute the core tensor of the k-th iteration: Go to Step 2; 6: else 7: Return G, U (n) ; 8: end if

Factor Matrix Projection
Through the above algorithm, we obtain the core tensor and matrix factors of the tensor T . Since the factor matrix U (n) represents the principal components of the tensor in each mode, the column vector of the factor matrix represents the principal components in this mode, and the columns are arranged in descending order according to the energy magnitude -the importance degree of features. Therefore, similar to the singular value decomposition, the factor matrices U (n) are selected such that they perform projection to the original tensor T ∈ R I 1 ×I 2 ×...I N with the front columns J 1 , J 2 , . . . , J N of each factor matrix, as shown in (7).
We can get an new tensor Z ∈ R J 1 ×J 2 ×...J N , which is the order of low dimensions in the new eigenspace compared to the original tensor T ∈ R I 1 ×I 2 ×...I N .
Let us replace T with Z in Equation (5) and since weight W conditioned on feature tensor Z, so we also replace W withW.
In practice, we flatten tensors Z andW for reducing the last operation to matrix multiplication.
In this paper, we consider the number of modalities to be 3. In Figure 3, tensor T ∈ R I 1 ×I 2 ×I 3 is decomposed into a core tensor G ∈ R R 1 ×R 2 ×R 3 and three factor matrices U (1) ∈ R I 1 ×R 1 , U (2) ∈ R I 2 ×R 2 , and U (3) ∈ R I 3 ×R 3 , the three factor matrices are then projected on the front columns J 1 , J 2 , and J 3 . This process can be used for both compression and feature extraction of higher-order data.

Experimental Methodology
To verify the improvement of the method, we compare our method with DF [22], MARN [23], MFN [24], TEN [16], and LMF [17] in sentiment analysis, personality trait recognition, and emotion recognition at three different multi-modal datasets.

Datasets
Experiments were performed on three multi-modal data sets CMU-MOSI [25], POM [26], and IEMOCAP [27]. Each data set is composed of three modalities: language, video, and audio. The CMU-MOSI includes a collection of 93 comment videos from different film reviews. Multiple opinion clips and emotion annotations consist in each video and are annotated in the range [−3,3], the two thresholds represent highly negative and highly positive respectively. The POM consists of 903 review videos from different movies. Each video has the characteristics of the speaker: self-confidence, enthusiasm, pleasant voice, dominant, credible, vivid, professional, entertaining, introverted, trusting, relaxed, extroverted, thorough, nervous, persuasive, and humorous. IEMOCAP contains 151 videos that are designed to identify emotions displayed in human interactions, such as voice and gesture. The audio-visual data is recorded for approximately 12 h by 10 actors in a two-person conversation. Ten actors were asked to complete three selected scripts with clear emotional content. The dataset contains 9 emotional labels which include anger, happiness, sadness, frustration, and neutral states.
The three datasets include multiple information which has been divided into training, validation, and test sets to evaluate the generalization of the model in this paper. And it is ensured that there are no identical speakers between training sets and test sets. The data split for the three datasets is shown in Table 1.

Multimodal Data Features
Each data set is composed of three modalities, i.e., language, video, and audio. We perform word alignment using P2FA [28] to reach alignment across modalities. The audio and video features can be obtained by calculating the average of feature values in the word time interval [29].
The experiment process of the information is as follows.
• Language: pre-trained Glove word embeddings [30] are used to embed a single word sequence transcribed from video clips into the word vector sequence of spoken text. • Visual: Facet library is applied for extracting visual features of each frame (sampling at 30 Hz), including head pose, 20 facial action units, 68 facial landmarks, gaze tracking, and HOG features [31]. • Audio: the COVAREP acoustic analysis framework [32] is applied for extracting a set of low-level audio features.

Model Architecture
Three unimodal sub-embedding networks are used to extract representations for each modality [17]. For visual and audio modalities, a simple 2-layer feed-forward neural network is used as a sub-embedding network. And for language, we use a long short-term memory network [33] to extract representations. The model architecture is illustrated in Figure 1.
In this paper, the models are tested using five-fold cross-validation which was proposed by CMU-MOSI. All experiments are performed without the information of speaker identity, while no speaker is repeated in the train and test sets, to make the model universal and independent of speaker information. The hyper-parameters are chosen by using grid search which is based on the performance of the model on the validation set. We trained our model using the Adam optimizer with a learning rate of 0.0003. The subnetworks f a , f l and f v are regularized by using dropout on all hidden layers with p = 0.15 and L2 norm coefficient as 0.01. The train, validation, and test folds are the same for each of the models. The models are implemented using Pytorch.

Evaluation Metrics
Based on the provided tags, multiple evaluation tasks are performed during our evaluation consisting of multi-category classification and regression. The multi-category classification task is applied to three multi-modal datasets, and the regression task is applied to the POM and CMU-MOSI. For the binary and multi-category classification, the F1 score and the average accuracy (ACC) are used to represent model performance. F1 score can be regarded as a weighted average of precision and recall and can be expressed as It has a maximum value of 1 and a minimum value of 0. Similarly, for regression tasks, mean absolute error (MAE) and the correlation (Corr) between prediction and true scores are used to express performance. All these indicators show better performance with the higher values but except for MAE.

Experimental Results and Discussion
Based on the research questions introduced in Section 3, we present and discuss the results from the experiments in this section.

Comparison with the State-of-the-Art
In the experiment, we compared our model with 5 methods. The Deep Fusion (DF) [22] proposed a concatenation of the deep neural model for each modality followed by a joint neural network. The Multi-attention Recurrent Network (MARN) [23] used a neural component called the Multi-attention block (MAB) which models the interaction between modalities through time and storing them in the Long-short Term Hybrid Memory (LSTM). The Memory Fusion Network (MFN) [24] was proposed for multi-view sequential learning. The Tensor Fusion Network [16] combined each modality into a tensor by computing the outer product. The Low-rank Multi-modal Fusion (LMF) [17] performed the tensor factorization with the same low-rank for multi-modal fusion.
In Table 2, the MAE, Corr, Acc-2, Acc-7, and F1 are presented. The accuracy of the proposed method is marked improvements in CMU-MOSI and POM. It is also marginally better than the LMF method in Happy and Angry recognition.

Computation Accuracy Analysis
The main function of the HOIDP method can achieve the purpose of dimensionality reduction. In this process, the core tensor and factor matrix are obtained by decomposing the original tensor firstly, and then the core tensor with the factor matrix are combined which have been updated by the HOIDP, finally, it forms a projection of the original tensor.
We verified whether the new tensor can replace the original tensor by calculating its error rate. The error rate is measured in norms is shown below: where T − Z F and T F are Frobenius Norms. Since the new tensor is composed of the core tensor and the projection of the updated factor matrix, the dimensionality reduction ratio is defined to measure the similarity between the new and the original tensor as where N nz is a function that expresses the number of non-zero matrix elements. The dimensionality reduction ratio is generated by calculating the ratio of the non-zero elements in the core tensor and the updated matrix to non-zero elements in the original tensor. This dimension reduction ratio can effectively represent the degree of dimensions reduced. We use (δ, ξ) to reflect the relationship between the error rate and the dimensionality reduction which is shown in Figure 4. The abscissa is the number of iterations and the ordinate is the ratio value. We set the error rate to 0.3%, 0.7%, 1%, 1.5%, 2%, 3%, 4.1%, 6%, 8.2%, 10%, 11.9% and 14.2% successively. The larger the error rate, the greater difference between the new and the original tensor, and the lower similarity between them. It can be seen from Figure 4 that the lower the dimensionality reduction ratio, the higher is the error rate. It means that we cannot blindly pursue a low dimension in the process of dimensionality reduction. It can achieve a balance between the dimensionality reduction ratio and error rate. In the experimental process, we found that when the number of iterations of tensor decomposition is 10, the error rate is 11.9%, and the dimension reduction ratio is 39.5%. The ACC achieved higher performance on CMU-MOSI and POM data sets as shown in Figure 5, and the prediction results are better when performing the task. The values of the dimensionality reduction ratio and error rate directly affect the accuracy of feature extraction of multi-modal data, and the evaluation metrics. Therefore, we should ensure that the new tensor is closest to the original tensor in case of maximum dimensionality reduction, and maintain the balance between dimensionality reduction and error according to the different requirements.
Furthermore, to evaluate the computational complexity of HOIDP, we measured the training and test speeds of HOIDP and compared them with TFN and LMF [17] as shown in Table 3. Here we set the dimension reduction error rate to 11.9% and the dimension reduction rate to 39.5% as it can achieve quite a significant increase in performance. The models are executed in the same environment. The data represents the average frequency value of data point inferences per second (IPS) respectively.

Conclusions
In this paper, a multi-modal fusion method based on higher-order orthogonal iterative decomposition is proposed, the method can remove the redundant information and leads to fewer parameters with minimal information loss. In addition, we can trade off the dimensionality reduction ratio and the error rate well according to the requirements.
Experiments result show that the method improves the accuracy, the Happy and Angry recognition. It is compared to the other methods and provides the same benefits as the tensor fusion method. It is also immune to a large number of parameters. Furthermore, it can be seen that the HOIDP approach is more efficient and achieves a higher dimensionality reduction effect while maintaining a lower error rate.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: