Multimodal Emotional Classiﬁcation Based on Meaningful Learning

: Emotion recognition has become one of the most researched subjects in the scientiﬁc community, especially in the human–computer interface ﬁeld. Decades of scientiﬁc research have been conducted on unimodal emotion analysis, whereas recent contributions concentrate on multimodal emotion recognition. These efforts have achieved great success in terms of accuracy in diverse areas of Deep Learning applications. To achieve better performance for multimodal emotion recognition systems, we exploit Meaningful Neural Network Effectiveness to enable emotion prediction during a conversation. Using the text and the audio modalities, we proposed feature extraction methods based on Deep Learning. Then, the bimodal modality that is created following the fusion of the text and audio features is used. The feature vectors from these three modalities are assigned to feed a Meaningful Neural Network to separately learn each characteristic. Its architecture consists of a set of neurons for each component of the input vector before combining them all together in the last layer. Our model was evaluated on a multimodal and multiparty dataset for emotion recognition in conversation MELD. The proposed approach reached an accuracy of 86.69%, which signiﬁcantly outperforms all current multimodal systems. To sum up, several evaluation techniques applied to our work demonstrate the robustness and superiority of our model over other state-of-the-art MELD models.


Introduction
Emotions can be defined as a conscious mental reaction, usually directed towards a specific object, and accompanied by dynamic physiological changes that occur nonverbally [1], allowing the identification of emotions as a very complicated task.
The process of recognizing emotions is dynamic and focuses on the individual's emotional state; thus, each person has a particular set of feelings that correlate to their actions [2]. Humans generally express their emotions in a variety of ways. The right understanding of these emotions is crucial for successful communication. However, recognizing emotions in daily life is crucial for social interaction, and emotions are a major factor in how people act [3]. Emotions are manifested through voice intonation, gestures as well as body posture, speech, and most often through facial expressions, which account for 55% of nonverbal communication. However, a single modality is unable to quickly assess the emotion of the person [4]. We cannot decide the emotion of a person by examining a particular entity or occurrence in front of our eyes [5]. This is one of the reasons why emotion recognition should be treated as a multimodal problem.
There is an integration of many approaches and strategies to achieve the goal of the study. Most of them use big data techniques [6], semantic principles [7], and deep learning [8].
A variety of various solutions to a range of multimodal sequential challenges have been developed as a result of the rapid development of DL architectures over the past decade.
The literature has demonstrated numerous methods for producing robust multimodal features using DL. from a single model of expression, such as facial expression [9], speech [10], text [11], EEG signal [12], etc. Since approaches based on DL have proven to be effective in learning and generalizing data with high dimensional feature spaces such as images, several emotion datasets have also produced promising results from comparable attempts to capture the intricate feature space of emotional data, such as MELD [13,14], IEMOCAP [15], MOSI [16], etc. Unfortunately, human emotions in real life are often expressed by a complex combination of several expression models.
So, much information is lost by employing unimodal analysis. To address this problem, the use of approaches based on DL for MER has received much research in recent years. Choi et al. [17] who utilized a novel approach to learning about the hidden representations between text and speech data using convolutional attention networks proposed end-to-end MER with auditory and visual modalities. The speech and the visual networks were built by Tzirakiz et al. [18] using a convolutional neural network (CNN) and deep residual network, respectively. To obtain the contextual information, the outputs of the two networks were combined as the input of a two-layer Long Short-Term Memory (LSTM). By feeding these features to a multiple kernel learning classifier, Poria et al. [19] proposed a novel technique for extracting features from visual and textual modalities using deep convolutional neural networks.
Inspired by these methods, we invented this work that focuses on the MER. It proposes a new system based on supervised learning algorithms, using three modalities: the first one concerns the text modality, its feature vector is extracted with CNN and LSTM; the second one depends on the audio modality that we acquired using the OpenSmile tool. While the third modality is obtained through the fusion of the text and audio features into a single vector that will represent the bimodal modality. These modalities vectors will feed the MNN that we have used as a methodological invention in order to have better predictive results. The MER by a computer is considered as a technology in full development. These innovative applications concern several domains such as human-computer interaction. Maat and Pantic's research [20], in which the authors created a system to support effective multimodal human-computer interaction, serves as an illustration of this field of study (AMM-HCI). The interaction is then modified to reflect the user's actions and feelings in order to assist the user in his activity. A further application's domain concerns the help of autistic children. In this study [21], the authors explored multimodal emotion and gaze recognition deficits in children with autism spectrum disorders. Another example in the MER domain concerns video games; Nemati et al. [22] proposed a hybrid data fusion method in latent space for MER. Furthermore, faced with the Corona pandemic, MER applications are developed. Prasad et al. [23] propose a system that analyzes feelings and emotions for effective human-machine interaction during the COVID-19 pandemic, etc.
The rest of this paper is structured as follows: In Section 2, we clarify the background and related works. In Section 3, we defined the materials and methods we used in our paper. Then, in Section 4, we describe each of the proposed methods and their steps. In Section 5, we present the results of the comparison between our approach and the existing work. Finally, a conclusion is given in Section 6.

Related Works
Numerous studies based on DL have been conducted to study MER systems, namely, Priyasad et al. [24], who present an approach based on DL to exploit and merge text and acoustic data for emotion classification. In order to extract acoustic features from raw audio, a SincNet layer, band-pass filtering, and neural network are used. The output of these band-pass filters is then applied to the input of the Deep Convolutional Neural Network (DCNN). For word processing, they use two branches: a DCNN and a bidirectional Re-current Neural Network (RNN), followed by a parallel DCNN where cross-attention is introduced to infer correlations at the N-gram level. Their experimental results on the IEMOCAP dataset achieve a 3.5% improvement in weighted accuracy. Researchers have looked into many features of multimodal autonomous human emotion recognition in the actual world [25]. In particular, Support Vector Machine (SVM) is utilized to categorize each feature, and they subsequently suggest a cutting-edge decision-level fusion network to utilize these feature characteristics. Their network was tested using the EmotiW 2015 AFEW and SFEW datasets, and it shows good results on the testing sets. The audio/visual emotion challenge (AVEC) 2012 [26], which presented baseline models for concatenating the audio and visual features into a single feature vector and used support vector regression to predict the continuous affective values, was one of the major attempts to advance the state of the art in MER. Another study was conducted on the German language for the purpose of extracting the emotions from three modalities: visual, audio, and text by Cevher et al. [27]. For extracting face expressions and audio features, they used an off-the-shelf tool. Word2vec and Bidirectional LSTM (BiLSTM) are used to extract textual features. For the purpose of predicting emotions, Georgiou et al. [28] concatenated features from several modalities at various levels. They showed that their proposed fusion method achieves greater performance gains compared to other fusion approaches in the literature. To identify the emotions, Bahreini et al. [29] suggested combining auditory, textual, and facial information. They used CNN to extract features from speech, and ResNet 50 to extract features from visual frames. These two modalities are captured using two different pipelines, valence and arousal are then extracted using an LSTM-based fusion. Their research demonstrated that merging two distinct modalities into a multimodal approach increased the software's accuracy and produced more reliable results. A method to combine speech textual content and voice tonality for identifying emotions in conversation was proposed by Poria et al. [13] Additionally, they offered their benchmark dataset for multi-party emotional conversations based on the Friend dataset. They used 1D CNN to extract textual utterance characteristics, and they computed audio utterance features using OpenSmile to extract vocal and prosodic features of the speech. Likewise, Slavova et al. [30] used textual and speech features for the purpose of extracting sound from human speech. CNN features are retrieved from a basic plain transcription of the speech, while speech features such as speech spectrum and Melfrequency cepstral coefficients (MFCCs) are extracted from the audio stream. These two teams concentrated on the transcription of voice tone for emotion recognition. In a recent study [31], visual and textual signals were used in speech emotion recognition through a hybrid fusion technique known as a multimodal attention network (MMAN). They propose a brand-new multimodal focus mechanism called cLSTM-MMA, which selectively combines information and promotes attention across three modalities. Other unimodal subnetworks are fused with the cLSTM-MMA during late fusion. The tests show that textual and visual signals are quite helpful in identifying speech emotions. While having a significantly more condensed network topology, the suggested cLSTM-MMA alone is just as successful in terms of precision as other fusion approaches. For self-supervised learning (SSL), Siriwardhana et al. [31] investigated the use of the pretrained "BERT-like" architecture to represent both language and text modalities in order to identify the multimodal language emotions. They show that a basic fusion mechanism (Shallow-Fusion) strengthens sophisticated fusion mechanisms while simplifying the overall structure. In this work [32], the authors proposed a deep hierarchical architecture for modality fusion and applied it to the problem of sentiment analysis from the audio and text modalities. Their proposed method achieves state-of-the-art results in sentiment analysis on the MOSI database. In [17], the authors adopted an attention method to learn the multi-modal representation between speech and textual modalities and used different CNNs to extract the features from embedding word sequences and speech spectrograms. For the Audio-Visual Emotion Challenge (AVEC 2017), Huang et al. [33] proposed an automatic prediction of dimensional emotional state using a Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) to train several feature modalities and concatenate various feature vectors. A conversational transformer Big Data Cogn. Comput. 2022, 6, 95 4 of 21 network (CTNet) was suggested by another recent study [34] to describe the context-and speaker-sensitive dependencies in a conversation as well as the inter-and intra-modality interactions for multimodal features. The outcomes confirmed the viability of the suggested transformer fusion technique. In order to investigate the emotion representation of a query from word-and utterance-level views, Hui et al. [35] introduced a multi-view network (MVN). Their experimental findings on two datasets of conversations on public emotions demonstrate that their suggested model outperforms the leading-edge baselines. An innovative Transformers and Attention-based fusion technique was presented by the authors in [36] that combines multimodal self-supervised learning features based on text, audio, and visual input. The model is resilient and outperforms state-of-the-art models on four datasets, according to the evaluation research. Multimodal emotion identification based on audio, video, and text modalities was accomplished by Baijun et al. [37] using a transformer-based cross-modal fusion and the EmbraceNet architecture. On the MELD dataset, their experimental results show up to 65% accuracy.
Despite the improvements achieved by the researchers in the MER, we were able to use a classification network MNN, elaborated by us, which allows different modalities to be classified significantly. In fact, our network learns each vector's component separately and applies a concatenation until the fusion layer, not like the other architectures that merge the modalities without taking into consideration the different components of the resulting vector for each one. So, the accuracy is raised with 21% as the rate.

Convolutional Neural Networks
In this part, we will focus on one of the most powerful algorithms of DL, the Convolutional Neural Network or CNN, which was first introduced in 1998 by Yann LeCun [38], CNNs are a sub-category of neural networks and are currently one of the most efficient image classification models allowing, in particular, the recognition of images by automatically attributing to each image provided as input, a label corresponding to its class.
Generally, the architecture of a CNN consists of convolution layers, pooling layers, plus layers of neurons that are fully connected in the form of a Multilayer Perceptron (MLP) called fully connected layers.
Moreover, CNNs are composed of two main parts as presented in Figure 1, which are: the hidden layer part, also called feature extraction, and the classification part. In the hidden layers, the network performs a series of convolution and pooling operations during which features will be detected.
The convolution layer's job is to examine the images that are provided as input and find the existence of a particular set of features. A set of feature maps is located in the output of this layer.
The main difference between standard image processing algorithms and CNNs is that in the latter, the weights or values of the filters are learned by an optimization algo- In the hidden layers, the network performs a series of convolution and pooling operations during which features will be detected. The convolution layer's job is to examine the images that are provided as input and find the existence of a particular set of features. A set of feature maps is located in the output of this layer. (1) The main difference between standard image processing algorithms and CNNs is that in the latter, the weights or values of the filters are learned by an optimization algorithm during the training phase. This allows convolutional neural networks to learn the specific filters for each task. After applying convolution operations on the input images, the results (feature maps) will go through a non-linear activation function; for example, the ReLu function, which replaces all the negative values received as inputs with zeros. The interest of these activation layers is to make the model non-linear and therefore more complex. The ReLu function can be calculated by the following formula: Then, the results will be transferred to the pooling layer that is used, on the one hand, to reduce the number of parameters by minimizing the size of the characteristic maps and, on the other hand, to introduce translational invariance into the model.
Let A = [a i,j ] ∈ R n * m be a matrix that represents a characteristic map of a specific region, and that will be presented to the pooling layer. The most commonly used pooling methods are the following:

•
Max pooling: replaces the input region with its maximum value.
• Average pooling (weighted average pooling): pools the input region by taking its average or a weighted sum, which can be based on the distance from the region center.
• The Fully Connected (FC) layer, which is at the end of the CNN design and is fully connected to every output neuron, is used for classification. To classify the input image, the FC layer first applies a linear combination and then an activation function after receiving an input vector. In the end, it returns a vector of size d, where d is the number of classes and each component is the likelihood that the input image belongs to a particular class.

Long Short-Term Memory
Long short-term memory (LSTM) is an artificial recurrent neural network architecture, frequently used in natural language processing. The LSTM was initially proposed by S. Hochreiter et al. [39], then improved in the article of F. Gers et al. [40] The main idea of recurrent neural networks is to have the ability of keeping a state over a long period of time. Moreover, this is the goal of LSTM cells, which have an internal memory called cell. This last one allows a state to be maintained as long as necessary, and consists of a numerical value that the network can control according to the situation.
An LSTM's overall operation can be divided into three steps: 1. Identifying pertinent information from the past, taken from the cell state through the forget gate; 2.
Using the input gate to choose from the current input those items that will be important in the long term. The cell state, which serves as long-term memory, will be added to these;

3.
Select the crucial short-term information from the newly created cell state and use the output gate to create the subsequent hidden state.
The LSTM defines a recurrence relation using an additional variable, which is the cell state c: The information transits from one cell to the next through two channels, h and c. At time t, these two channels are updated by the interaction between their previous values h t−1 , c t−1 and the current element of the sequence x t . Figure 2 shows the simplified diagram of an LSTM cell.
1. Identifying pertinent information from the past, taken from the cell state through the forget gate; 2. Using the input gate to choose from the current input those items that will be important in the long term. The cell state, which serves as long-term memory, will be added to these; 3. Select the crucial short-term information from the newly created cell state and use the output gate to create the subsequent hidden state.
The LSTM defines a recurrence relation using an additional variable, which is the cell The information transits from one cell to the next through two channels, h and c .
At time t , these two channels are updated by the interaction between their previous val-  The equations governing the three control gates are therefore the following: they are the application of the weighted sum followed by the application of an activation function.

•
Forget gate: The equations governing the three control gates are therefore the following: they are the application of the weighted sum followed by the application of an activation function.

•
Forget gate: The forget gate determines what information must be remembered and what can be forgotten. Data derived from the current input x t and hidden state h t−1 are absorbed by the sigmoid function. The values that Sigmoid produces range from 0 to 1. It draws a conclusion regarding the necessity of the old output's part (by giving the output closer to 1). The cell will eventually use this value of f t for point-by-point multiplication.
t: timestep, f t : forget gate at t, x t : input, h t−1 : previous hidden state, w f : weight matrix between forget gate and input gate, b t : connection bias at t. • Input gate: Updates to the cell state are made by the input gate using the following operations. To begin with, the second sigmoid function receives the current state, x t , and the previously hidden information, h t−1 . The values are specified to range from 0 (important) to 1 (not important). The same data from the current state and concealed state will then be transferred through the tanh function. The network will be controlled by the tanh operator, which will produce a vector ∼ C t containing every conceivable value between −1 and 1. Point-by-point multiplication can be performed on the output values produced by the activation functions.
Big Data Cogn. Comput. 2022, 6, 95 7 of 21 t: timestep, i t : input gate at t, x t : input, w i : weight matrix of sigmoid operator between input gate and output gate, b i : bias vector at t, ∼ C t : value generated by tanh, w c : weight matrix of tanh operator between cell state information and network output, b c : bias vector at t. The input gate and forget gate have provided the network with sufficient data. Making a decision and storing the data from the new state in the cell state come next. The forget vector f t multiplies the previous cell state C t−1 . Values will be removed from the cell state if the result is 0. The network then executes point-by-point addition on the output value of the input vector i t , which updates the cell state and gives the network a new cell state C t .
t: timestep, i t : input gate at t, ∼ C t : value generated by tanh, f t : forget gate at t, C t−1 : previous timestep.

•
Output gate: The value of the following hidden state is decided by the output gate. Information about prior inputs is contained in this state. First, the third sigmoid function receives the values of the current state and the prior concealed state. The tanh function is then applied to the new cell state that was created from the original cell state. These two results are multiplied one by one. The network determines which information the hidden state should carry based on the final value. For prediction, this hidden state is used.
t: timestep, o t : output gate at t, w 0 : weight matrix of output gate, b 0 : bias vector, h t : LSTM output.

The Meaningful Neural Network
The Meaningful Neural Network (MNN), is a novel neural network model that was first invented by [41], which allows learning features from different architecture/algorithm/ descriptive vectors representing data of different modalities, i.e., sound, image, text, etc., in a meaningful way.
The main idea of MNN is to dedicate a part named (specialized layers) of this network for learning each input vector component. Indeed, the learning of the latter's characteristics is realized independently in a significant manner.
The MNN Architecture contains three types of layers show ( Figure 3): • Specialized layers: Sets of neurons that are trained specifically to extract and learn the representations of the input vector components are present in each of these layers. Depending on how many components the input vector contains, we can often have any number of neuron sets. The weights' calculation and updating during the gradient backpropagation step can be expressed as follows: Forward Propagation: • Directive layer: One directive layer is present in the suggested architecture. While the other side is fully connected, the left side is semi-connected. By taking into consideration the two sets of neurons for the specialized layers, this layer enables the control of error propagation. Forward Propagation: Backward Propagation: • Fusing Layer: This layer is completely connected. It enables the merging of learned representations from earlier levels. Forward Propagation:

Proposed Method
The proposed method consists of three steps presented in the Figure 4 below: the first one is devoted to extracting the features of each single modality (unimodal). The second step is designed to merge the two modalities audio and text (bimodal). While the third one is concerned with the classification of the three inputs (audio, text, and bimodal) concatenated in the same vector using MNN to predict emotions.  v . We will follow the same sequence of [42] to extract the features of each statement independently. For the text modality, their features will be passed to an

Proposed Method
The proposed method consists of three steps presented in the Figure 4 below: the first one is devoted to extracting the features of each single modality (unimodal). The second step is designed to merge the two modalities audio and text (bimodal). While the third one is concerned with the classification of the three inputs (audio, text, and bimodal) concatenated in the same vector using MNN to predict emotions.

Proposed Method
The proposed method consists of three steps presented in the Figure 4 below: the first one is devoted to extracting the features of each single modality (unimodal). The second step is designed to merge the two modalities audio and text (bimodal). While the third one is concerned with the classification of the three inputs (audio, text, and bimodal) concatenated in the same vector using MNN to predict emotions.  v . We will follow the same sequence of [42] to extract the features of each statement independently. For the text modality, their features will be passed to an

Extraction of Unimodal Features
Let D = {v 1 , v 2 , v 3 , . . . , v n }, be the database that contains n videos, each video v i = {u i1 , u i2 , u i3 , . . . , u im } is represented by a set of utterances noted u ij with m the number of utterances in the video v i . We will follow the same sequence of [42] to extract the features of each statement independently. For the text modality, their features will be passed to an LSTM network to allow consecutive utterances in a video to represent relevant information in the feature extraction process.

Extraction of Text Modality
The first step in extracting the text modality is to prepare and transform the text of the utterances into a digital format. First, we represent each utterance as the concatenation of vectors for constituent words. In our work, we used GloVe, which is a publicly available word representation tool based on global frequency statistics [43], with a dimensionality equal to 300 pre-trained on 42 billion vocabulary words. This dictionary provides a 300-dimensional vector for each word. Exploiting this tool, it was thought to use CNN to extract the features of each utterance given its high discrimination capability. In sequence classification, each utterance depends on the other utterances in the same sequence. The statements in a video maintain a sequence. In a video, we assume that there is a high probability of dependence between statements with respect to their sentimental cues. In particular, we argue that when classifying an utterance, the other utterances may provide important contextual information. Thus, a model is needed that takes into account these interdependencies and the effect they may have on the target utterance. To capture this flow of informational triggers between utterances, we use a recurrent neural network (RNN) based on LSTM [44] to extract contextual information based on the results obtained from CNN.
For the CNN, the input consists of the 300-dimensional pre-trained GloVe vectors. The proposed approach consists of three convolution layers: the first one contains 40 filters of size 3 × 3, and the second and the third layer contain, respectively, 100 and 150 filters of the same size (4 × 4). Each convolution layer uses a Stride equal to 2 and an identical Padding. At the end of each convolution layer, we apply the ReLu activation function. The pooling layer comes after each convolution layer. We chose the Max pooling with a 2 × 2 filter. The convoluted features are then concatenated and introduced into a fully connected layer of 1611 dimensions, whose activations form the representation of the statement.
The feature vector of the text generated by the CNN for each utterance will be grouped in a matrix that will be transmitted to the LSTM as input. The LSTM is capable of learning long-term dependencies. Its special structure with input, output, and forgetting gates controls the identification of the long-term sequence pattern. The process of the text vector is presented in the Figure 5. LSTM network to allow consecutive utterances in a video to represent relevant information in the feature extraction process.

Extraction of Text Modality
The first step in extracting the text modality is to prepare and transform the text of the utterances into a digital format. First, we represent each utterance as the concatenation of vectors for constituent words. In our work, we used GloVe, which is a publicly available word representation tool based on global frequency statistics [43], with a dimensionality equal to 300 pre-trained on 42 billion vocabulary words. This dictionary provides a 300dimensional vector for each word. Exploiting this tool, it was thought to use CNN to extract the features of each utterance given its high discrimination capability. In sequence classification, each utterance depends on the other utterances in the same sequence. The statements in a video maintain a sequence. In a video, we assume that there is a high probability of dependence between statements with respect to their sentimental cues. In particular, we argue that when classifying an utterance, the other utterances may provide important contextual information. Thus, a model is needed that takes into account these interdependencies and the effect they may have on the target utterance. To capture this flow of informational triggers between utterances, we use a recurrent neural network (RNN) based on LSTM [44] to extract contextual information based on the results obtained from CNN.
For the CNN, the input consists of the 300-dimensional pre-trained GloVe vectors. The proposed approach consists of three convolution layers: the first one contains 40 filters of size 3 × 3, and the second and the third layer contain, respectively, 100 and 150 filters of the same size (4 × 4). Each convolution layer uses a Stride equal to 2 and an identical Padding. At the end of each convolution layer, we apply the ReLu activation function. The pooling layer comes after each convolution layer. We chose the Max pooling with a 2 × 2 filter. The convoluted features are then concatenated and introduced into a fully connected layer of 1611 dimensions, whose activations form the representation of the statement.
The feature vector of the text generated by the CNN for each utterance will be grouped in a matrix that will be transmitted to the LSTM as input. The LSTM is capable of learning long-term dependencies. Its special structure with input, output, and forgetting gates controls the identification of the long-term sequence pattern. The process of the text vector is presented in the Figure 5.

Extraction of Audio Modality
To extract acoustic features. There are popular audio feature extraction toolkits available for free that are capable of extracting all fundamental features. We used The OpenSMILE Feature Extraction Toolkit [44], which brings together feature extraction

Extraction of Audio Modality
To extract acoustic features. There are popular audio feature extraction toolkits available for free that are capable of extracting all fundamental features. We used The OpenS-MILE Feature Extraction Toolkit [44], which brings together feature extraction algorithms from the speech processing and music information retrieval communities. The speech features captured by this collection include loudness, CHROMA, CENES, Mel cepstral coefficient features, and perceptual linear predictive cepstral features. Additionally, it records crucial elements such as format frequency and fundamental frequency. We computed three speech features-the MFCCS, loudness, and CHROMA features-using this package.
At a sliding window of 100 ms and a frame rate of 30 Hz, the audio features are extracted. Voice normalization is carried out and the voice intensity is thresholded to distinguish between voiced and unvoiced samples in order to compute the features. To combine with the other modalities, the generated features are further downsampled to 49.

The Fusion of Text and Audio Modalities
Fusion is one of the original topics in multimodal machine learning. In this context, we have developed a third modality to study the characteristics of text and audio fusion. We believe that this resulting vector can be used to effectively improve the classification effect.
A feature vector of 49 dimensions from the acoustic network and a feature vector of 1611 dimensions from the textual network are merged into a single vector that will present the modality (bimodal) that will be used later as a third component of the MNN network to classify emotions.

Classification
After having prepared the input vector of our system, we arrive at the fundamental part, which consists of the classification of the emotions. For this, the choice of classifier is very important because it can also significantly impact the accuracy of the recognition.
In our case, we used the MNN [41]. Its architecture is characterized by significant learning, which is very adapted to our system. As far as we know, most of the architectures that use the concatenation of feature vectors do not take into account the different components of the resulting vector. In fact, they consider this vector as a single entity while ignoring its constituent components. On the other hand, the MNN network architecture allows the components of the global feature vector to be learned in a meaningful way. It dedicates to each of its components a set of neurons belonging to a number of hidden layers.
The input vector of our system consists of three components: the text component with dimension 1611, the audio component with dimension 49, and the bimodal component with dimension 898.
At the input of the classification step, each input vector component is made of a number of neurons. We choose 500 neurons for text, 50 neurons for audio, and 300 for bimodal. The first 2 layers, named specialized layers, are accessed by the 3 components. Then, they move to the Directive layer and, lastly, to the fusion layer, before recognizing the final emotions. The passage from one layer to another is performed with a minimal number of neurons compared to the previous layer and by calculating the weights and their updates at each passage, with the backward and forward propagation formulas mentioned in Section 3. Moreover, the audio component keeps the same result at the fusion level even by increasing the number of neurons at the input. Comparing the components, the number of neurons used in text and bimodal at the input of the classification step is higher than the audio component. This justifies the absence of neurons for the audio component in the directive layer.
The parameters of each component within the MNN network are represented in the Table 1 below.

Computational Environment
The Python toolbox was used to implement our proposed model. The experiments were executed on an Asus desktop (Taiwanese multinational computer and phone hardware and electronics company headquartered in Beitou District, Taipei, Taiwan), which has 8 GB RAM, Intel (R) Core (TM) i7-8550U CPU @ 1.80 GHz (8 CPUs),~2.0 GHz, and Windows 10 as the operating system. Moreover, the Google Colab cloud service was used to train the proposed architecture.

Dataset
The Multimodal Emotion Lines Dataset (MELD) [13,14] is an evolved version of the EmotionLines Dataset. MELD has the same dialogue instances as those available in EmotionLines, but additionally includes audio, visual modality, and text. The MELD has over 1400 dialogues and 13,000 utterances from the Friends television series, with the dialogue samples grouped into 1039 for training, 114 for validation, and 280 for testing, and the utterances samples grouped into 9989 for training, 1109 for validation, and 2610 for testing.
The utterances in each dialogue were annotated with any of these seven emotions (Anger, Disgust, Fear, Joy, Neutral, Sadness, and Surprise) were coded as 0, 1, 2, 3, 4, 5, and 6 annotation labels. This annotation list was extended with two additional emotion labels: Neutral and Non-Neutral. Label distributions in the training, validation, and test datasets can be seen in Table 2. Figure 6 shows an example of dialogue extracted from the MELD dataset.  Figure 6. An example of dialogue extract from MELD dataset [13,14].

Evaluation Metrics
We evaluated the performance of our model using the following evaluation metrics: Accuracy, Precision, Recall, and F1-score. The expression of these metrics is given as follows: • Accuracy: the easiest performance metric to understand is accuracy, which is just the proportion of correctly predicted observations to all observations. (TN) A test outcome that accurately demonstrates the absence of a condition or characteristic.
(FN) A test result that falsely suggests the presence of a certain condition or attribute.
(FP) A test result that falsely suggests the absence of a certain condition or attribute. Moreover, we elaborated the macro average and weighted average metrics, we mounted the cost, training accuracy, and training accuracy/validation accuracy curves according to the epochs, and finally we displayed the confusion matrix.

Performance Evaluation
We evaluated the performance of our system through the following steps: the first contains a detailed study of the audio, text, and bimodal (text + audio) modalities; the second presented a test on the multimodal system, and the third, a comparison with existing works.

Performance Study on Audio, Text, and Bimodal Modalities
The training, validation, and test sets for the MELD dataset were already separated, as was previously announced. Since we needed the findings, we constructed our model, adjusted the training hyperparameters based on the training and validation sets, and tested the model on the test set.

Evaluation Metrics
We evaluated the performance of our model using the following evaluation metrics: Accuracy, Precision, Recall, and F1-score. The expression of these metrics is given as follows: • Accuracy: the easiest performance metric to understand is accuracy, which is just the proportion of correctly predicted observations to all observations. (FN) A test result that falsely suggests the presence of a certain condition or attribute.
(FP) A test result that falsely suggests the absence of a certain condition or attribute. Moreover, we elaborated the macro average and weighted average metrics, we mounted the cost, training accuracy, and training accuracy/validation accuracy curves according to the epochs, and finally we displayed the confusion matrix.

Performance Evaluation
We evaluated the performance of our system through the following steps: the first contains a detailed study of the audio, text, and bimodal (text + audio) modalities; the second presented a test on the multimodal system, and the third, a comparison with existing works.

Performance Study on Audio, Text, and Bimodal Modalities
The training, validation, and test sets for the MELD dataset were already separated, as was previously announced. Since we needed the findings, we constructed our model, adjusted the training hyperparameters based on the training and validation sets, and tested the model on the test set.
The tables above display the metrics of the three modalities obtained with the MELD dataset. Each metric for a given modality generates different values for every emotion. This block of values (modality, emotion) is called the test classification results.
From Table 3, the maximum value of the precision for the text modality concerns the anger emotion (90%), and with a value of 98% for the disgust emotion in the two other modalities. Concerning the metric Recall, the text modality reaches a maximum value of 95% for the emotion surprise, 97% for the audio modality with the disgust emotion, and 98% for the same emotion for the bimodal modality. Regarding the third metric, F1score registers 89% as a value for the modality text (emotion surprise), a value of 97% for audio, and 98% for bimodal with emotion disgust. Table 3. Classification report results for the MELD on text, audio, and bimodal modalities in the function of (Precision, Recall, and F1-score) (Unit = %). Concerning the values of Macro average and weighted average metrics, they gave a general view on all the samples of the dataset used.

Emotion
According to Table 4, the bimodal has the highest value of Accuracy with a rate of 86.51%. In second place, the text modality achieves 83.98% as a value, and in last place, the audio modality has a value of 81.79%.  9 show the training cost of the three modalities. We can see that the maximum value is reached around 0 epochs and is reduced by increasing the number of epochs. The cost achieves the value of 0.0115 for the text modality, 0.0634 for the audio, and 0.0046 for the bimodal.
The tables above display the metrics of the three modalities obtained with the MELD dataset. Each metric for a given modality generates different values for every emotion. This block of values (modality, emotion) is called the test classification results.
From Table 3, the maximum value of the precision for the text modality concerns the anger emotion (90%), and with a value of 98% for the disgust emotion in the two other modalities. Concerning the metric Recall, the text modality reaches a maximum value of 95% for the emotion surprise, 97% for the audio modality with the disgust emotion, and 98% for the same emotion for the bimodal modality. Regarding the third metric, F1-score registers 89% as a value for the modality text (emotion surprise), a value of 97% for audio, and 98% for bimodal with emotion disgust. Table 3. Classification report results for the MELD on text, audio, and bimodal modalities in the function of (Precision, Recall, and F1-score) (Unit = %). Concerning the values of Macro average and weighted average metrics, they gave a general view on all the samples of the dataset used.

Emotion
According to Table 4, the bimodal has the highest value of Accuracy with a rate of 86.51%. In second place, the text modality achieves 83.98% as a value, and in last place, the audio modality has a value of 81.79%.  9 show the training cost of the three modalities. We can see that the maximum value is reached around 0 epochs and is reduced by increasing the number of epochs. The cost achieves the value of 0.0115 for the text modality, 0.0634 for the audio, and 0.0046 for the bimodal.   In Figures 10-12, we visualize the training accuracy of the three modalities. We notice that for the latter, a maximum value is reached, then a decrease to resume a maximum value, which stabilizes from a certain number of epochs. For the text modality, the maximum value of training accuracy is 83.76%. For the audio modality, it is 96.92%. Then, it reaches 98.03% as the value for the bimodal.     In Figures 10-12, we visualize the training accuracy of the three modalities. We notice that for the latter, a maximum value is reached, then a decrease to resume a maximum value, which stabilizes from a certain number of epochs. For the text modality, the maximum value of training accuracy is 83.76%. For the audio modality, it is 96.92%. Then, it reaches 98.03% as the value for the bimodal.   In Figures 10-12, we visualize the training accuracy of the three modalities. We notice that for the latter, a maximum value is reached, then a decrease to resume a maximum value, which stabilizes from a certain number of epochs. For the text modality, the maximum value of training accuracy is 83.76%. For the audio modality, it is 96.92%. Then, it reaches 98.03% as the value for the bimodal.    In Figures 10-12, we visualize the training accuracy of the three modalities. We notice that for the latter, a maximum value is reached, then a decrease to resume a maximum value, which stabilizes from a certain number of epochs. For the text modality, the maximum value of training accuracy is 83.76%. For the audio modality, it is 96.92%. Then, it reaches 98.03% as the value for the bimodal.     In Figures 10-12, we visualize the training accuracy of the three modalities. We notice that for the latter, a maximum value is reached, then a decrease to resume a maximum value, which stabilizes from a certain number of epochs. For the text modality, the maximum value of training accuracy is 83.76%. For the audio modality, it is 96.92%. Then, it reaches 98.03% as the value for the bimodal.

Performance Results on Multimodal System
Above, we have studied the recognition of emotions for the text, audio, and bimodal modalities separately. In what follows, we go on with the multimodal modality of the proposed method. These three figures (Figures 13-15) show the variation in training accuracy and validation accuracy according to the number of epochs. We notice that the three curves are almost superimposed and reach close maximum values.

Performance Results on Multimodal System
Above, we have studied the recognition of emotions for the text, audio, and bimodal modalities separately. In what follows, we go on with the multimodal modality of the proposed method. These three figures (Figures 13-15) show the variation in training accuracy and validation accuracy according to the number of epochs. We notice that the three curves are almost superimposed and reach close maximum values.

Performance Results on Multimodal System
Above, we have studied the recognition of emotions for the text, audio, and bimodal modalities separately. In what follows, we go on with the multimodal modality of the proposed method. These three figures (Figures 13-15) show the variation in training accuracy and validation accuracy according to the number of epochs. We notice that the three curves are almost superimposed and reach close maximum values.

Performance Results on Multimodal System
Above, we have studied the recognition of emotions for the text, audio, and bimodal modalities separately. In what follows, we go on with the multimodal modality of the proposed method.

Performance Results on Multimodal System
Above, we have studied the recognition of emotions for the text, audio, and bimodal modalities separately. In what follows, we go on with the multimodal modality of the proposed method. Using the multimodal system on the MELD dataset, we notice that the evaluation metrics reach a maximum value of 98% for the disgust emotion (Table 5). Comparing the single modalities, we can see equality between the bimodal and multimodal values, which is not the case. Our method performs better by recording an advanced decimal value (the values are rounded by the system). Table 5. Classification report results for the MELD on the multimodal system in the function of (Precision, Recall, F1-score) (Unit = %).

Multimodal Precision
Recall F1-Score Accuracy The accuracy reaches 86.69%, where an improvement is detected with the use of a multimodal system.
Regarding the loss indicator ( Figure 16), it registers the value of 0.0039, which is lower from a certain number of epochs compared to other previous systems. For training accuracy (Figure 17), the system reaches its peak with a value of 98.42%. Using the multimodal system on the MELD dataset, we notice that the evaluation metrics reach a maximum value of 98% for the disgust emotion (Table 5). Comparing the single modalities, we can see equality between the bimodal and multimodal values, which is not the case. Our method performs better by recording an advanced decimal value (the values are rounded by the system). Table 5. Classification report results for the MELD on the multimodal system in the function of (Precision, Recall, F1-score) (Unit = %). The accuracy reaches 86.69%, where an improvement is detected with the use of a multimodal system.

Emotion
Regarding the loss indicator ( Figure 16), it registers the value of 0.0039, which is lower from a certain number of epochs compared to other previous systems. For training accuracy (Figure 17), the system reaches its peak with a value of 98.42%.   Using the multimodal system on the MELD dataset, we notice that the evaluation metrics reach a maximum value of 98% for the disgust emotion (Table 5). Comparing the single modalities, we can see equality between the bimodal and multimodal values, which is not the case. Our method performs better by recording an advanced decimal value (the values are rounded by the system). Table 5. Classification report results for the MELD on the multimodal system in the function of (Precision, Recall, F1-score) (Unit = %). The accuracy reaches 86.69%, where an improvement is detected with the use of a multimodal system.

Emotion
Regarding the loss indicator ( Figure 16), it registers the value of 0.0039, which is lower from a certain number of epochs compared to other previous systems. For training accuracy (Figure 17), the system reaches its peak with a value of 98.42%.  In Figure 18, the training accuracy and validation accuracy curves are almost superimposed and this regenerates the absence of the overfitting problem.  In Figure 18, the training accuracy and validation accuracy curves are almost superimposed and this regenerates the absence of the overfitting problem. Figure 18. Training accuracy/validation accuracy versus epochs for the multimodal system.
In the multimodal confusion matrix (Figure 19), the maximum value of the true label is 6744 for the disgust class. This represents the high value in this matrix. Our multimodal fusion model outperformed the unimodal findings, according to all evaluation metrics.

Comparison with State-of-the-Art Methods
In order to evaluate the performance of our approach, we have compared it with other state-of-the-art methods. We have considered relevant approaches that give a good Accuracy regarding the MELD dataset, and we have drawn a comparison with the methods proposed by [34][35][36][37]. The obtained results are presented in Table 6. Table 6. A brief comparison of our proposed approach with other related works in the function of Accuracy (unit = %).

Approaches
Year Accuracy Zheng et al. [34] 2021 62.0 Hui et al. [35] 2022 63.69 Siriwardhana et al. [36] 2020 64.3 Baijun et al. [37] 2021 65 Our proposed approach 2022 86.69 In the multimodal confusion matrix (Figure 19), the maximum value of the true label is 6744 for the disgust class. This represents the high value in this matrix.  Figure 17. Training accuracy versus epochs for the multimodal system.
In Figure 18, the training accuracy and validation accuracy curves are almost superimposed and this regenerates the absence of the overfitting problem. Figure 18. Training accuracy/validation accuracy versus epochs for the multimodal system.
In the multimodal confusion matrix (Figure 19), the maximum value of the true label is 6744 for the disgust class. This represents the high value in this matrix. Our multimodal fusion model outperformed the unimodal findings, according to all evaluation metrics.

Comparison with State-of-the-Art Methods
In order to evaluate the performance of our approach, we have compared it with other state-of-the-art methods. We have considered relevant approaches that give a good Accuracy regarding the MELD dataset, and we have drawn a comparison with the methods proposed by [34][35][36][37]. The obtained results are presented in Table 6. Table 6. A brief comparison of our proposed approach with other related works in the function of Accuracy (unit = %).

Comparison with State-of-the-Art Methods
In order to evaluate the performance of our approach, we have compared it with other state-of-the-art methods. We have considered relevant approaches that give a good Accuracy regarding the MELD dataset, and we have drawn a comparison with the methods proposed by [34][35][36][37]. The obtained results are presented in Table 6. Table 6. A brief comparison of our proposed approach with other related works in the function of Accuracy (unit = %).

Conclusions and Futures Work
Through this work, we have proposed a multimodal emotion recognition system for conversations. This system consists of a model with two different modalities (text, audio) that are respectively extracted with (CNN, LSTM) and OpenSmile. A third modality is built from the fusion of the audio and text features into a single vector. All three modalities feed a meaningful neural network and result in allowing good feature learning to accurately identify emotional states. Moreover, due to the meaningful neural network structure, the proposed architecture can be robustly extended to a larger set of input modalities. The obtained results demonstrate the performance of our proposed approach. Regarding our future works, we plan to deeply study temporal performance in order to construct a real-time multimodal emotion recognition system focusing on the visual modality as well as evaluating other multimodal datasets.