Multimodal Emotion Recognition Using the Symmetric SELM-LUPI Paradigm

Lingzhi Yang 1,2, Xiaojuan Ban 1,*, Michele Mukeshimana 3 and Zhe Chen 4 1 School of Computer and Communication Engineering, Beijing Key Laboratory of Knowledge Engineering for Materials Science, University of Science and Technology Beijing, Beijing 10083, China; yanglingzhi@citicsteel.com 2 Citic Pacific Special Steel Holdings Qingdao Special Iron and Steel Co., Ltd., Qingdao 266000, China 3 Faculity of Engineering Sciences, University of Burundi, P.O. Box 1550 Bujumbura, Burundi; Mukeshimana@ustb.edu.cn 4 Qingdao Hisense Group Co., Ltd., Qingdao 266000, China; chenzhe@hisense.com * Correspondence: banxj@ustb.edu.cn; Tel.: +86-158-0658-0782


Introduction
The development of interactive technologies has enabled humans to communicate with computing devices in a more natural way.An important part of natural interaction is "emotion", which takes a key role in the natural interaction of people.Emotional recognition research is a challenging task, we will help improve the interaction between computers and people if we try our best to do emotional recognition research, such as automatic counseling system, automatic answering machine.We can help patients, the elderly and the disable in a real sense, if we can identify the user's emotion effectively.
The expression of human emotions is multi-channel, and humans infer the meaning of expression through different clues, such as facial expressions, sound features, speech content, posture language of gestures and gestures.There are also physical reactions of the body, such as heart beat rhythm, Symmetry 2019, 11, 487 2 of 18 heart blood pressure, and brain function.And physical examination equipment is required to do these tests.
At present, the computer has the ability of recording and processing user's input information.The input information includes voice information, video information, bio-electronic information, text information and so on.However, the recognition ability of computers is lower than people's recognition ability certainly.One of the main problems is that how to express the true intentions with the people's multiple performance.So it is especially important to identify people's emotions through multiple channels.

Related Work
Multimodal emotion recognition refers to the recognition of emotions through a combination of two or more modal information.Humans have a variety of ways to express emotions and combine multiple sources of information to recognize the emotions of others.Computers try to combine information sources in multiple ways to achieve the level of human emotion recognition.
Poria S., Chaturvedi I., Cambria E. et al. [1] proposed a temporal Convolutional Neural Network to extract features from visual and text modalities.Tzirakis P., Trigeorgis G., Nicolaou M.A. et al. [2] proposed an emotion recognition system using auditory and visual modalities.They utilize a Convolutional Neural Network to extract features from the speech, while for the visual modality, a deep residual network of 50 layers.Latha G.C.P. and Priya M.M. [3] also use Convolution Neural Network Model with multiple signal processing features Huang Y., Yang J., Liao P. et al. [4] proposed two multimodal fusion methods between brain and peripheral signals for emotion recognition.The input signals are electroencephalogram and facial expression.Torres-Valencia C. et al. [5] proposed SVM-based feature selection methods for emotion recognition from multimodal data.Chan W.L., Song K.Y., Jeong J. et al. [6] proposed convolutional attention networks for multimodal emotion recognition from speech and text data.
Learning Using Privileged Information (LUPI) paradigm has been largely applied with Support Vector Machine plus (SVM+) algorithm.Feyereisl et al. [7] has worked on the importance and incorporation of privileged information in cluster analysis and their method has improved the clustering performance.Ji et al. [8] have proposed a multi-task multi-class by learning using privileged information on support vector machines.In their work, they have obtained improved results for multitask multiclass problems.Liu et al. [9] have empirically demonstrated an improvement of v-Support Vector for Classification and Regression by using the privileged information to solve practical problems in the experiments.Recently, Wang et al. [10] recognized audience's emotion from EEG signals with the help of the stimulus videos, and tagged the video's emotions with the aid of Electro Encephalogram (EEG) signals.Their implicit fusion has performed comparatively, or even better than the methods based on explicit fusion.In all the aforementioned-experiments, the most used algorithm is the SVM+ algorithm (Algorithm 1).In these experiments, the complexity relating to SVM parameterization considerably increased the training time.As an alternative solution, the exploitation of privileged information has been extended to the Extreme Learning Machine (ELM) method which is faster and has less parameterization.
The research work of the above scholars is to improve the classification accuracy by changing the extraction features and trying different information fusion models.How to establish a real-time and stable automatic identification system is an important problem that needs to be solved in multi-modal emotion recognition design.The system automatically detects, models, and generates natural interactions.The research focus of this paper is feature extraction and data fusion to achieve the purpose of improving classification accuracy and shortening recognition time.
In summary, multimodality is a typical feature of emotional expression.The system with multi-modal emotion automatic recognition mainly involves information fusion technology.Its main idea is to fuse raw data.However, there are many finished results in terms of literal meaning, because it is easier to implement than the fusion of data and features.In contrast, the way data and feature fusion are identified is more accurate.In the field of multimodal emotion recognition, many achievements have been made, but natural, immediate and accurate emotional interaction is still an elusive goal.The advances in automatic feature extraction techniques support the study of new machine learning algorithms such as standing by vector machines and extreme learning machines.The research results have enhancement to promote the realization of natural instant emotional interaction.

The New Method of Symmetric S-ELM-LUPI (Symmetric Sparse Extreme Learning Machine-Learning Using Privileged Information) Using for Multimodal Emotion Recognition
During learning using the privileged information, the training set is composed of triplets, i.e., standards variables X, privileged variables X*, and their corresponding label Y, but the testing set comprehends standard variables X and the labels only.During the learning process of ELM, the vector X of the standard information is mapped into the hidden-layer feature space by h(x) and the vector X* of the privileged information is mapped into the hidden-layer correcting space by h*(x*).The two kernel functions h(x) and h*(x*), can be different, or the same.Figure 1 exemplified the single hidden layer feedforward neural networks (SFLNs) representation including the LUPI paradigm.The research results have enhancement to promote the realization of natural instant emotional interaction.

The New Method of Symmetric S-ELM-LUPI (Symmetric Sparse Extreme Learning Machine-Learning Using Privileged Information) Using for Multimodal Emotion Recognition
During learning using the privileged information, the training set is composed of triplets, i.e. standards variables X, privileged variables X*, and their corresponding label Y, but the testing set comprehends standard variables X and the labels only.During the learning process of ELM, the vector X of the standard information is mapped into the hidden-layer feature space by h(x) and the vector X* of the privileged information is mapped into the hidden-layer correcting space by h*(x*).The two kernel functions h(x) and h*(x*), can be different, or the same.Figure 1 exemplified the single hidden layer feedforward neural networks (SFLNs) representation including the LUPI paradigm.
. Figure 1 represents the flow of the processes in the ELM using privileged information with a standard space (X input space), of m-dimension; and the privileged information space (X* input space), of mp-dimension.There are L hidden nodes in the hidden layer and the C-classes.Figure 1 serves a simplified representation; the computation considers the two inputs spaces independent space to learn in parallel, and define two mapping functions h(x) and h*(x*).These functions can be the same or different.They are mapped into the same decision space.
The introduction of the LUPI model in S-ELM finds origins in the optimization-based ELM method.Based on the ELM, the slack variables are unknown to the learner.If there is an oracle who can give more information, they can be estimated by a correcting function defined by that additional information.The correcting function, which estimates the slack value, is computed as follows: Hence, the problem of optimization of ELM becomes as follows: where β* is the correcting weight connecting the hidden node to the out-put node in the correcting space and γ is introduced for the regularization.Figure 1 represents the flow of the processes in the ELM using privileged information with a standard space (X input space), of m-dimension; and the privileged information space (X* input space), of mp-dimension.There are L hidden nodes in the hidden layer and the C-classes.Figure 1 serves a simplified representation; the computation considers the two inputs spaces independent space to learn in parallel, and define two mapping functions h(x) and h*(x*).These functions can be the same or different.They are mapped into the same decision space.
The introduction of the LUPI model in S-ELM finds origins in the optimization-based ELM method.Based on the ELM, the slack variables are unknown to the learner.If there is an oracle who can give more information, they can be estimated by a correcting function defined by that additional information.The correcting function, which estimates the slack value, is computed as follows: Hence, the problem of optimization of ELM becomes as follows: where β * is the correcting weight connecting the hidden node to the out-put node in the correcting space and γ is introduced for the regularization.To minimize the above functional subject to constraints, The Lagrangian function is: where α i > 0 and µ i > 0, are the Lagrangian multipliers and are non-negative values.In order to solve the above optimization problem, it is needed to find the saddle point of the Lagrangian (the minimum with respect to β, β * and the maximum with respect to α i and µ i ), i=1 . . .N.
The KKT optimality condition of (3) are as follow: then the optimization problem becomes: where K and K* are two kernels in two different spaces namely, decision space and the correcting space.Then the decision function is computed as follows: and the corresponding correcting function is: In this computation, two kernels define similarity between two objects in different spaces (decision and correcting spaces).The decision function value depends directly on the kernel defined in the decision space.Still, it receives the contribution of the additional information knowledge through the computation of coefficient α which depends on the similarity measure in both spaces.
The proposed algorithm for Symmetric Sparse Extreme Learning Machine Learning Using Privileged Information is summarized as follows: Input: Training set X, Privileged Information X*, L hidden number and activation functions g and g*.
Output: The prediction of the approximated function f(x).
(2) Calculate the hidden node output matrices H and H*; Compute the output weight β, solving the dual expression of optimization according to the Equation ( 5) and constraints (6).
Compute the decision function f(x), the predictive function.1. start 2.
Random generation of the input weights W // For the standard training set 3.
Random generation of the input weights W*. // privileged information set 4.
Compute the output weight β according to Equation (4) 7. End

Data Preparation and Feature Extraction
The data set created by Martin et al. [8] is used as the experimental data in this paper.This is a data set that can be downloaded for free.It contains two modes of data records: Speech and facial expressions.
There were 42 participants from 14 different nationalities.Each participant received five sentences and expressed pronunciation for six different emotions (namely happiness, fear, disgust, surprise, sadness and anger).81% of participants were male and the others were female.All of them spoke English.The samples were selected randomly.This data set belongs to the type that induces emotional expression.
There were 1166 video sequences, the number of female videos was 264, and the others were male videos.This database has the advantage of being close to reality and acquiring easily and free.Before the experiment, the data set went through a series of pre-processing procedures.

Date Processing
The ENTERFACE 05 audiovisual data set is a data set invented by Martin et al.The original file is in the format of zip.The unzipped folder contains 44 folders which correspond to 44 topics for recording.Each theme's folder contains six folders that correspond to six types of emotions, including anger, disgust, fear, happiness, sadness, and surprise.
In this study, each file contains audio speech and facial expressions.The main research modes include sound mode and visual mode.Before the feature extraction, the two modes have been extracted and separated.Then two new entities are obtained: One contains the visual image, and the other contains the sound signal.

Feature Extraction
Symmetry 2019, 11, x FOR PEER REVIEW 6 of 18 recording.Each theme's folder contains six folders that correspond to six types of emotions, including anger, disgust, fear, happiness, sadness, and surprise.In this study, each file contains audio speech and facial expressions.The main research modes include sound mode and visual mode.Before the feature extraction, the two modes have been extracted and separated.Then two new entities are obtained: One contains the visual image, and the other contains the sound signal.

Feature Extraction
Nj represents the number of frames contained in the j record a i represents the feature value in the i frame.Therefore, the obtained values have different sizes.The values are normalized.The processed values are in the range of 0 to 1. Features number by category.the link to the standardized value component means that all extracted functions must undergo the standardization phase.The normalization is separately done file by file.The final number of features obtained in each category is represented in the Table 1.

Experimental Design
The standard information and privilege information are symmetric.The standard information is used as a modal feature set.Another modal feature set is used as privilege information, and the audio feature is treated as an entity.Multiple facial expression related features are regarded as one sample space.As the result, we got a fusion combination which is shown in Figure 3:

Categories Number
Audio 53

Experimental Design
The standard information and privilege information are symmetric.The standard information is used as a modal feature set.Another modal feature set is used as privilege information，and the audio feature is treated as an entity.Multiple facial expression related features are regarded as one sample space.As the result, we got a fusion combination which is shown in Figure 3: In the first group, audio information is represented at the top as a standard information set, and facial expressions are indicated at the bottom as privileged information.In the second group, the opposite is true.
This experiment has three goals: The first goal is to test the effects of multi-modal emotion recognition in real life.Each mode can be treated as privileged information for another mode.The facial expression feature data is divided into three groups.Audio data is no longer feature separated as a data set because of the small number of features.
The second goal is to contrast the mentioned method with the others and assess applicability.The method is contrasted with machine learning based on neural network learning methods, such as extreme learning machine (ELM), sparse extreme learning machine (S-ELM), extreme learning machine-learning using privilege information (ELM -LUPI) and support vector machine (SVM).For contrasting conveniently, the selected parameters must have high similarity and comparability, such as the number of hidden nodes and other parameters.
The third goal is to improve recognition accuracy and execution efficiency.When the recognition time is as small as possible, the proposed algorithm has more chance to serve the actual application.

Experimental Design Analysis of Results Analysis of Results
According to the three objectives of the experiment, the analysis of the experimental results is divided into the following three parts.In the first group, audio information is represented at the top as a standard information set, and facial expressions are indicated at the bottom as privileged information.In the second group, the opposite is true.
This experiment has three goals: The first goal is to test the effects of multi-modal emotion recognition in real life.Each mode can be treated as privileged information for another mode.The facial expression feature data is divided into three groups.Audio data is no longer feature separated as a data set because of the small number of features.
The second goal is to contrast the mentioned method with the others and assess applicability.The method is contrasted with machine learning based on neural network learning methods, such as extreme learning machine (ELM), sparse extreme learning machine (S-ELM), extreme learning machine-learning using privilege information (ELM -LUPI) and support vector machine (SVM).For contrasting conveniently, the selected parameters must have high similarity and comparability, such as the number of hidden nodes and other parameters.
The third goal is to improve recognition accuracy and execution efficiency.When the recognition time is as small as possible, the proposed algorithm has more chance to serve the actual application.

Experimental Design Analysis of Results Analysis of Results
According to the three objectives of the experiment, the analysis of the experimental results is divided into the following three parts.

Date Processing
The purpose of the first type of experiment was to assess the application of the mentioned method for multi-modal emotion recognition.Audio and facial expressions are used as standard information sets and privileged information sets respectively.In Table 2, the corresponding results are shown.The results shown in Table 2 are experimental data of the mentioned method based on multimodal emotion recognition.The 'Dataset' column corresponds to a dataset that is utilized as standard or privileged information.The main datasets are audio-based data sets and visually relevant data sets (EOH, LBP and LDN).The best performance demonstrated by the experimental results corresponds to the average of the minimum execution time, and the average of the maximum recognition rates.
There are six different emotional states (anger, disgust, fear, happiness, sadness, and surprise).The values in the table correspond to values with a hidden node number is 1000.
The recognition accuracy and the mean of the test time of the sentiment type are shown in Figure 4.

Date Processing
The purpose of the first type of experiment was to assess the application of the mentioned method for multi-modal emotion recognition.Audio and facial expressions are used as standard information sets and privileged information sets respectively.In Table 2, the corresponding results are shown.The results shown in Table 2 are experimental data of the mentioned method based on multimodal emotion recognition.The 'Dataset' column corresponds to a dataset that is utilized as standard or privileged information.The main datasets are audio-based data sets and visually relevant data sets (EOH, LBP and LDN).The best performance demonstrated by the experimental results corresponds to the average of the minimum execution time, and the average of the maximum recognition rates.
There are six different emotional states (anger, disgust, fear, happiness, sadness, and surprise).The values in the table correspond to values with a hidden node number is 1000.
The recognition accuracy and the mean of the test time of the sentiment type are shown in Figure 4.It was observed that the method is suitable for solving the problem of multi-modal emotion recognition automatic learning, and the accuracy of recognition rate is always above 80%.Because of privileged information learning (LUPI), the proposed method always improves the recognition rate in unknown sample predictions.Figure 4 depicts this improvement.
In Figure 5, The recognition rate of the training set is expressed by the accuracy rate.In Figure 4, the accuracy of the test is better than the accuracy of the training.In essence, the test set provides more information for the training set to better identify new unknown samples.

Date Processing Analysis of Results Compared with Other Methods
The purpose of this set was to assess the ability of the proposed method by other methods.The results are shown in Table 3  It was observed that the method is suitable for solving the problem of multi-modal emotion recognition automatic learning, and the accuracy of recognition rate is always above 80%.Because of privileged information learning (LUPI), the proposed method always improves the recognition rate in unknown sample predictions.Figure 4 depicts this improvement.
In Figure 5, The recognition rate of the training set is expressed by the accuracy rate.In Figure 4, the accuracy of the test is better than the accuracy of the training.In essence, the test set provides more information for the training set to better identify new unknown samples.It was observed that the method is suitable for solving the problem of multi-modal emotion recognition automatic learning, and the accuracy of recognition rate is always above 80%.Because of privileged information learning (LUPI), the proposed method always improves the recognition rate in unknown sample predictions.Figure 4 depicts this improvement.
In Figure 5, The recognition rate of the training set is expressed by the accuracy rate.In Figure 4, the accuracy of the test is better than the accuracy of the training.In essence, the test set provides more information for the training set to better identify new unknown samples.

Date Processing Analysis of Results Compared with Other Methods
The purpose of this set was to assess the ability of the proposed method by other methods.The results are shown in Table 3

Date Processing Analysis of Results Compared with Other Methods
The purpose of this set was to assess the ability of the proposed method by other methods.The results are shown in Tables 3-6.
This four tables represent the best performance of the ELM PLUS, S-ELM, ELM and SVM methods in recognition accuracy and length of execution.The first column contains two sets of data sets for the standard set and the privilege information set.The 'audio' set represents the audio feature set, and the 'EOH / LBP / LDN' set represents the visual features.
The data show that the ability of the proposed method is better than the stability and recognition rate of other methods.Figure 6 shows a comparison of the identification accuracy changes for different methods.The precision is expressed in the number of hidden nodes.The mentioned method has better generalization and stability than other methods.Figure 6 shows a comparison of the identification accuracy changes for different methods.The precision is expressed in the number of hidden nodes.The mentioned method has better generalization and stability than other methods.

Analysis of Improved Result Analysis
The purpose of the third type of experiment is to evaluate improved content related to the performance of the proposed method.SYMMETRIC S-ELM-LUPI inherits the main advantages of the original method in the process of introducing the method.Firstlu, the method inherits the advantage of the Extreme Learning Machine method in fast calculations.Secondly, it gains the advantage of saving memory from the sparse limit learning machine.Finally, it achieved an increase in the recognition rate because of using the LUPI.

Analysis of Improved Result Analysis
The purpose of the third type of experiment is to evaluate improved content related to the performance of the proposed method.SYMMETRIC S-ELM-LUPI inherits the main advantages of the original method in the process of introducing the method.Firstlu, the method inherits the advantage of the Extreme Learning Machine method in fast calculations.Secondly, it gains the advantage of saving memory from the sparse limit learning machine.Finally, it achieved an increase in the recognition rate because of using the LUPI.
In the improvement analysis of recognition accuracy, the mentioned method is contrasted with the results of single-mode experiments.
Figure 6 illustrates the comparison of the proposed method recognition rate with the corresponding single mode recognition rate.
Figure 7 shows the result of the recognition accuracy change follow the number of hidden nodes.The EOH feature of the audio feature is added in Figure 7a; the LBP feature of the audio feature is added in Figure 7b; the LDN feature of the audio feature is added in Figure 7c.Visually relevant features (EOH, LBP and LDN) were used as standard information.The x-axis represents the hidden node and the y-axis represents the accuracy of the recognition corresponding single mode recognition rate.
Figure 7 shows the result of the recognition accuracy change follow the number of hidden nodes.The EOH feature of the audio feature is added in Figure 7a; the LBP feature of the audio feature is added in Figure 7b; the LDN feature of the audio feature is added in Figure 7c.Visually relevant features (EOH, LBP and LDN) were used as standard information.The x-axis represents the hidden node and the y-axis represents the accuracy of the recognition When comparing the accuracy of the SYMMETRIC S-ELM-LUPI method with the other four methods, the facial visual information is used as a standard information set, and the voice is used as a privileged information set.
As a result, it can be found that the proposed method is superior to the single-modal method in the recognition of the multi-mode emotion recognition problem, and the execution time is also different.Figure 8 shows the corresponding results.When comparing the accuracy of the SYMMETRIC S-ELM-LUPI method with the other four methods, the facial visual information is used as a standard information set, and the voice is used as a privileged information set.
As a result, it can be found that the proposed method is superior to the single-modal method in the recognition of the multi-mode emotion recognition problem, and the execution time is also different.Figure 8 shows the corresponding results.When comparing the accuracy of the SYMMETRIC S-ELM-LUPI method with the other four methods, the facial visual information is used as a standard information set, and the voice is used as a privileged information set.
As a result, it can be found that the proposed method is superior to the single-modal method in the recognition of the multi-mode emotion recognition problem, and the execution time is also different.Figure 8 shows the corresponding results.It is seen very clearly that this method is faster than other methods.The number of test set dimension is reduced, and the core computing features and rapid generalization capabilities of the ELM are utilized, so that the mentioned method can solve real-time problems.
The results shown in Figure 8 illustrate that the execution time depends on the size of the test set.The traditional method utilizes the same dimension in the training and test set; the method can effectively reduce the training time because the privilege information dimension is small.
Concerning the study on the modality's individual contribution, the recognition is compared based on the coupling point of view.The corresponding results are represented on Figure 9.It is seen very clearly that this method is faster than other methods.The number of test set dimension is reduced, and the core computing features and rapid generalization capabilities of the ELM are utilized, so that the mentioned method can solve real-time problems.
The results shown in Figure 8 illustrate that the execution time depends on the size of the test set.The traditional method utilizes the same dimension in the training and test set; the method can effectively reduce the training time because the privilege information dimension is small.
Concerning the study on the modality's individual contribution, the recognition is compared based on the coupling point of view.The corresponding results are represented on Figure 9.  Figure 9, represents the comparison of the different modality's contribution to each other, using the proposed method.The title of each graph signifies the considered standard information source set, i.e., EOH in 9a, LBP in 9b, LDN in 9c, and Audio in 9d.The 'Testing Accuracy EOH(LBP/LDN/AUDIO)-AUDIO (EOH/LBP/LDN)', means that the first set is the standard information source set and the second is the 'Additional (Privileged) Information' source set.In observed results, the use of different modality (audio and visual) gives better results than the unimodality using multiple features type.In fact, the use of multiple features helps to collect more information on the same data sets from different point of views, but the multiple modalities combination procures more useful information; to make a distinction between emotional states.Therefore, the use of features from different modalities is better applied with this proposed method than the use of multiple features from one same modality.

Conclusions
A new method of emotion recognition based on multi-modality is to use the sparse limit learning machine and using the symmetric privileged information learning in this paper.Symmetric S-ELM-LUPI Paradigm has passed the tests performed on the data set.The symmetry used in this paper refers to the symmetry of the method.This method regards one pattern as the standard information source, while the other pattern as the privileged information source, each mode can be treated as privileged information for another mode, rather than the symmetry of the simple data level, such as the symmetry of the data level.For example, the symmetry of data on both sides of the face.This method has been proven to be applicable to data sets based on multimodal emotions, and the correct recognition rate is over 80% at very fast execution speeds.
The experimental results prove that the method is very reasonable in multi-modal emotion recognition because of the method's stability.In fact, multimodal emotion recognition is a real-life problem that requires a very stable method in predicting.The method of this paper provides a new thought for the accurate identification of emotions in real life.

Algorithm 1 :
Symmetric S-ELM-LUPI algorithmInput: Training set X, Privileged Information set X*, hidden nodes number L and activation functions g and g*, gamma γ, kappa κ Output: The prediction of the approximated function f(x).

Figure 2 .
Figure 2. Files split-up and feature extraction operation.

Figure 2 .
Figure 2. Files split-up and feature extraction operation.

Figure 3 .
Figure 3. Representation of the combinations.

Figure 3 .
Figure 3. Representation of the combinations.

Figure 4 .
Figure 4. (a) Recognition accuracy and testing time of EOH-AUDIO; (b) Recognition accuracy and testing time of LBP-AUDIO; (c) Recognition accuracy and testing time of LDN-AUDIOTest.Acc.EOH (LBP, LDN or Audio) indicates the accuracy of EOH (LBP, LDN or Audio) as a standard information set.Test.Time EOH (LBP, LDN or Audio) represents the test time when EOH (LBP, LDN or Audio) is used as the standard information set.

Figure 6 .
Figure 6.The stability of the training accuracy on EOH-Audio.

Figure 6 .
Figure 6.The stability of the training accuracy on EOH-Audio.

Figure 7 .
Figure 7. (a) Recognition improvement for adding EOH feature; (b) Recognition improvement for adding LBF feature; (c) Recognition improvement for adding LDN feature.

Figure 7 .
Figure 7. (a) Recognition improvement for adding EOH feature; (b) Recognition improvement for adding LBF feature; (c) Recognition improvement for adding LDN feature.

Figure 8 .
Figure 8.(a) The comparison of testing time EHO-Audio of different methods; (b) The comparison of testing time LBP-Audio execution time of different methods; (c) The comparison of testing time LDN-Audio execution time of different methods.

Figure 9 .
Figure 9. Modality contribution comparison of EOH; (b) Modality contribution comparison of LBP; (c) Modality contribution comparison of LDN;(d) Modality contribution comparison of Audio.

Figure 9 ,Figure 9 .
Figure 9, represents the comparison of the different modality's contribution to each other, using the proposed method.The title of each graph signifies the considered standard information source set, i.e.EOH in 9a, LBP in 9b, LDN in 9c, and Audio in 9d.The 'Testing Accuracy EOH(LBP/LDN/AUDIO)-AUDIO (EOH/LBP/LDN)', means that the first set is the standard information source set and the second is the 'Additional (Privileged) Information' source set.In observed results, the use of different modality (audio and visual) gives better results than the Symmetry 2019, 11, x FOR PEER REVIEW 3 of 18 elusive goal.The advances in automatic feature extraction techniques support the study of new machine learning algorithms such as standing by vector machines and extreme learning machines.

Table 1 .
List of the final number of features obtained for each category.

Table 1 .
List of the final number of features obtained for each category.
to Table6.This four tables represent the best performance of the ELM PLUS, S-ELM, ELM and SVM methods in recognition accuracy and length of execution.The first column contains two sets of data Figure 4. (a) Recognition accuracy and testing time of EOH-AUDIO; (b) Recognition accuracy and testing time of LBP-AUDIO; (c) Recognition accuracy and testing time of LDN-AUDIOTest.Acc.EOH (LBP, LDN or Audio) indicates the accuracy of EOH (LBP, LDN or Audio) as a standard information set.Test.Time EOH (LBP, LDN or Audio) represents the test time when EOH (LBP, LDN or Audio) is used as the standard information set.
to Table6.This four tables represent the best performance of the ELM PLUS, S-ELM, ELM and SVM methods in recognition accuracy and length of execution.The first column contains two sets of data

Table 3 .
Comparison with other methods in 'Training Accuracy'.

Table 4 .
Comparison with other methods in 'Testing Accuracy'.

Table 5 .
Comparation with other methods in 'Training Time'.

Table 6 .
Comparison with other method in 'Recognition Time'.