A Multimodal Facial Emotion Recognition Framework through the Fusion of Speech with Visible and Infrared Images

: The exigency of emotion recognition is pushing the envelope for meticulous strategies of discerning actual emotions through the use of superior multimodal techniques. This work presents a multimodal automatic emotion recognition (AER) framework capable of differentiating between expressed emotions with high accuracy. The contribution involves implementing an ensemble-based approach for the AER through the fusion of visible images and infrared (IR) images with speech. The framework is implemented in two layers, where the ﬁrst layer detects emotions using single modalities while the second layer combines the modalities and classiﬁes emotions. Convolutional Neural Networks (CNN) have been used for feature extraction and classiﬁcation. A hybrid fusion approach comprising early (feature-level) and late (decision-level) fusion, was applied to combine the features and the decisions at different stages. The output of the CNN trained with voice samples of the RAVDESS database was combined with the image classiﬁer’s output using decision-level fusion to obtain the ﬁnal decision. An accuracy of 86.36% and similar recall (0.86), precision (0.88), and f-measure (0.87) scores were obtained. A comparison with contemporary work endorsed the competitiveness of the framework with the rationale for exclusivity in attaining this accuracy in wild backgrounds and light-invariant conditions.


Introduction
An emotional state of a person is the gauge of several manifestations being held in one's mind. It is an index to perceive the intentions and sentiment of a subject that assists in self-regulating tasks such as monitoring, automatic feedback, therapies, marketing, automotive safety, and assessments. This field of human-computer interaction (HCI), referred to as Affective Computing, assists in shifting the cognitive load from humans to machines. HCI aims to manifest a human-human like interaction between humans and machines. Interaction is more innate when the machine is aware of the user's emotional state. It becomes imperative when the machine's decision table looks for responses depending on the user's mood. For example, an autonomous car might take a quiet, longer path when the user's emotions dispose of anger, and a congested road is more likely to aggravate that emotion. An impeccable judgment requires a perfect discernment of emotions. Faces, speech, physiological signals, gestures, and even eye gaze carry manifestation of the person's emotional state. Several works have leveraged these modalities to predict emotions with varied accuracy. A diverse set of mechanisms and techniques have been discovered and applied at junctures to achieve higher accuracy. Paul Ekman, a pioneer in setting the trend of identifying human emotions using Action Units (AU) [1][2][3], identified the six basic emotions that portray the affective states of a person as happiness, anger, surprise, sadness, fear, and disgust [1]. Facial expression was the principal modality for emotion detection, and later a diverse stack of modalities unfolded as corollary modalities.
While uni-modal emotion detection was still blossoming, HCI researchers disembarked on the multimodal territory in the quest for a more accurate and natural way of interaction with machines. The combination of modalities showed a promising perspective and was capable of extracting concealed feelings from perceptible sources. Although assorted combinations of modalities were explored, a survey of various techniques and databases described in the following sections implied the need for a robust AER framework that works under natural conditions. Although the experiments performed in labs boast high accuracy, many have not been proved to work in natural or in-the-wild conditions. A few works conducted the in-the-wild experiments [4][5][6]. However, they were limited by the apprehension of the role of "light" in the detection of emotions. A gloomy environment may affect facial emotion recognition and sometimes even forbid accurate detection. Additionally, facial expressions might not display the correct psychological state of a person if they do not concur with each other. For example, an angry person might not always exhibit facial expressions for the same, and the emotion could be easily confused with neutral or sad. These were some of the issues that still need to be addressed. A framework with the capability of discerning emotions, even in-the-wild conditions, was necessary. The work presented in this paper confronted those limitations with a novel modality-the infrared images.
This paper presents an automatic emotion recognition (AER) framework using the fusion of visible images and infrared images with speech. The work follows an ensemble-based approach by exploiting the best services offered by different techniques. We adhered to certain baselines for the development of the framework. A baseline was to avoid the overwhelming use of sensors to eliminate any unwanted stress or anxiety induced due to monitoring. Further, facial expression was used as the principal modality, while infrared images were used to counter the limitations posed in previous approaches. In addition, the speech was used as a supporting modality to refine the classification further. Feature level and decision level fusions were applied at different stages to devise a framework for light-invariant AER in-the-wild conditions. The rest of the paper is organized as shown in Figure 1.

Related Work
Affective computing has been improved with time, and the exertion of multitudinous modalities has underpinned the motion. Identification of emotions with multiple modalities has shown a tidy enhancement over the AER systems with unaccompanied modalities. The fusion of different input streams has also been fostered to strengthen the recognition process with the advent of multimodal HCI. This section presents a discussion of related work based on these aspects.

Use of Speech, Visible, and Infrared Images
Facial expressions were one of the first modalities to indicate the emotional state of a person. Facial Action Coding System (FACS) was first coined by Paul Ekman [1,3]. Facial expressions were followed closely by speech for emotion detection, and several other relevant input types were contrived for a high precision affect recognition. A wonted combination of modalities that forms a very natural blend is the speech and the facial expressions as this is the closest to human-human interaction [4,. Gestures have been used together with facial and speech data to decipher affective state of a proband [36][37][38][39][40][41][42].
Infrared images have not been used for AER frequently. Wang et al. used IR images for emotion recognition using Deep Boltzmann Machine (DBM) and reported an accuracy of 62.9% with the Natural Visible and Infrared Facial Expression (NVIE) DB [43,44]. The authors reported an accuracy of 68.2% after using additional unlabeled data from other databases to improve feature learning performance. Elbarawy et al. made use of the Imaging, Robotics, and Intelligent Systems (IRIS) DB [45] for AER using the Discrete Cosine Transform (DCT) with Local Entropy (LE) and Local Standard Deviation (LSD) features [46]. In another work, K-Nearest Neighbor (kNN) and Support Vector Machines (SVM) were combined to generate a recognition model of facial expressions in thermal images. The highest accuracy of 90% was obtained using the LE features with kNN. Another instance of the use of thermal images for AER involved the use of DBM to classify emotions with an accuracy of 51.3% [47]. In addition, Basu et al. used thermal facial images and employed the moment invariant, histogram statistics, and multi-class SVM to achieve an accuracy of 87.50% [48]. A smart thermal system to detect emotions using IR images was also developed with an accuracy of 89.89% [49]. Yoshitomi et al. [50] and Kitazoe et al. [51] performed AER using visible, and thermal facial images; and speech to report an accuracy of 85%. The fusion in this works was elementary and based on the maximum of the output of the three modalities.
Databases form the propellant of the pattern recognition based AER. Any level of sophistication in machine learning (ML) algorithms cannot undermine the merit of a comprehensive database. A diverse set of databases has been developed, ranging from multimodal, bi-modal, and uni-modal databases (used in various combinations with other databases) to assist with multimodal AER. Table 1 summarizes the available databases with their attributes.

Emotion Identification
Emotion recognition involves finding patterns in the data and classify that data to the emotional states. The methods of emotional identification are predominantly cleaved into two portions, the methods of recognition and the methods of parameterization [85].
Pattern identification in the modalities serves as a primary source of AER by recognition. In [36], authors used Naive Bayes classifier for developing an ensemble-based emotion recognition system and used it to solve the problem of missing data at the fusion stage. Xu et al. used Hidden Markov Models (HMM) and Artificial Neural networks (ANN) for AER utilizing facial expressions and voice [8]. Bahreini et al. used the WEKA tool to apply hybrid fusion methods for voice feature extraction, classification, and collating the result of 79 available classifiers [7]. The Bayesian classifier has also been used for emotion recognition through face, voice, and body gestures [37,39]. Deep Belief Networks (DBN) have been very effective in deciphering the emotions and have been put to use in multiple AER systems. Ranganathan et al. used DBN and Convolutional Deep Belief Networks (CDBN) for unsupervised classification of 23 emotion classes [38]. DBNs and CNNs have also been used with K-Means and Relational auto-encoders for detecting emotions in videos [4]. At the same time, Kim et al. [14] applied four different DBN models using facial and audio data for four emotion classes (angry, happy, neutral, and sad). Nguyen et al. [22] used 3D-CNNs cascaded with DBNs for modeling spatio-temporal information and classifying five emotion categories using facial expressions and speech. Zhang et al. [24] utilized DBNs in a hybrid deep model approach involving CNN and 3D-CNN. CNNs have been pressed to service for work in different forms (3D-CNN, Deep CNNs (DCNN), Multi-function CNN (MCNN)) and combinations to achieve high AER accuracy [11,12,17,22,32,40,82,[86][87][88][89][90][91][92][93].
Another genre of emotion recognition methods uses facial structure, such as the shape of mouth, eyes, nose, etc. Loomed up by the widespread use of patter-recognition methods, a substantial decline in the application of this genre for AER has been seen in the recent past. Geometric-based methods involve the standardization of the image data followed by the detection of the region containing face and a decision function-based classification of the emotions by investigating and analyzing the facial components. Facial Action Coding System (FACS) was devised to exploit the contraction of facial muscles, which were termed as Action Units (AU) [1,3]. In [94], an AER model was deployed using an extended Kohonen self-organizing map (KSOM). The model utilized a feature vector containing feature points of facial regions such as lip, eyes, etc. Another geometric-based method where candid nodes were placed to track the facial regions of interest was presented in [95]. The work presented an approach using facial AUs and a Candide model. Several other older works have utilized FAUs for AER [96][97][98][99].

Data Fusion
An integration of different modalities to form a coherent combination of input is termed as fusion [100]. Feature level and decision level form the two most widely used forms of fusion [101], while the hybrid approach involves a mix of both. Here we briefly discuss works that used each of these different types of fusion.

Feature Level
Feature level fusion, shown in Figure 2a, is also known as early fusion due to its anterior application to the modalities. Data from facial expressions, voice, and body gestures were combined at the feature level to achieve an accuracy of 78.3% [37,39]. Similarly, multimodal physiological signals were combined to achieve an accuracy of 81.45% for SVM, 74.37% for MLP, 57.75% for kNN, and 75.94% for Meta-multiclass (MMC) classifiers [102]. An ensemble approach was implemented in [11], where a feature output of a CNN was fused with a feature output of a ResNet and fed to a Long Short Term Memory (LSTM) network. In one of the studies, facial features were combined with the EEG features using different feature level techniques such as Multiple feature concatenation (MFC), Canonical Correlational Analysis (CCA), Kernel CCA (KCCA), and MKL [103]. In different studies, modalities were combined using this method, e.g., pulse activity and ECG data [104]; face and speech [18,19,23,26,33,90]; linguistic and acoustic cues [105]; audio and text [93,[106][107][108]; audio, video, and text [109,110]; facial texture, facial landmark action, and audio [32]; audio, video, and physiological signals [111]; and speech from two different languages [73].

Decision Level
Decision level fusion or late fusion is usually applied after individual decisions have been taken, as shown in Figure 2b. Such a framework was tested for AER to solve the problem of missing data at the fusion stage [36]. Another work used an ANN for fusion of facial expressions and speech [8]. Using late fusion, Kessous et al. evaluated bi-modal combinations of face, speech, and body gestures [37], while Alonso et al. [9] combined facial data and speech for human-robot interaction. Several other works exploited this fusion method to combine different combinations of input modalities, e.g., face, speech, and body gestures [39][40][41], aligned faces and non-aligned faces [5], face, audio, and physiological signals [88], physiological signals [89], face and speech [13,[15][16][17]22,26,27], face and EEG [103], finger pulse activity and ECG data [104], face, speech, and language [112], direct person-independent perspective and relative person-dependent perspectives [113], and facial texture, facial landmark action, and audio [32].

Hybrid Fusion
An integration of feature and decision level fusion in different combinations is termed as a hybrid approach, as shown in Figure 2c. Bahreini et al. used a hybrid fusion approach for a real-time AER using images and voice [7]. In another study, a hierarchical classification fusion framework was developed that utilized a layer of feature level fusion that fed to decision level fusion [6]. Yin et al. proposed a Multiple-fusion-layer based Ensemble classifier of Stacked Autoencoders (MESAE) for AER using physiological signals such as GSR, EMG, EOG, EEG, and Blood Volume Pressure (BVP) [114]. Another 2-stage fusion network where the first stage employed two DCNNs and the second stage integrated the output of these two DCNNs using a fusion network achieved an accuracy of 74.32%, compared to 66.17% (audio) and 60.79% (face) [90]. An ensemble of CNN methods was proposed by [91], where the output of the CNN was fused with a probability based-fusion. Tripathi et al. used various deep-learning-based architectures (LSTM, CNN, fully connected MLP) to first get the best individual detection (Speech: 55.65%, Text: 64.78%, Motion Capture: 51.11%) and then combined the output using an ensemble-based architecture (Speech + Text: 68.40%, Speech + Text + Motion Capture: 71.04%) [115].

Proposed Framework
In this section, we discuss the datasets, their pre-processing, and the specifics of the proposed methodology.

VIRI Dataset
An exiguous number of IR emotional databases are available for use. However, most of them did not serve the requirements of our framework. Consequently, we developed a facial image database of visible and IR images, called VIRI database. This DB confronts the limitations of the existing IR DBs and presents the spontaneous facial expressions in both visible and IR format in uncontrolled wild backgrounds. The DB was created at The University of Toledo and is comprised of the consented pictures from on-campus students. Five expressions have been captured-happy, sad, angry, surprised, and neutral-and comprise 110 subjects (70 males and 40 females), resulting in 550 images in a radiometric JPEG format. This format contains both visible and thermal data captured by the FLIR ONE Pro thermal camera. After extracting visible and Multi-Spectral Dynamic Imaging (MSX) thermal images, a total of 1100 images (550 visible and 550 IR images) were obtained. It is a varied DB in terms of proband's age (17-35 years) and ethnicity (White, African-American, and Asian). Images were taken in three formats, visible, infrared, and MSX format and VIRI DB contains all the three formats.
Data Pre-processing and Image Augmentation: The resolution of each original extracted image was 1440 × 1080 pixels, and some pre-processing was carried out to remove noise and discrepancies from the images. The subjects were not ideally centered since the images were captured in the wild. A batch process of crop and overexposure was executed on the images to bring the subjects to the center and reduce the darkness in the pictures. Each image's size was reduced to 227 × 227 pixels to bring it to a size suitable for use in CNN training. Figure 3 shows a sample image set from VIRI DB for all emotion classes. The DB is available for use upon request at the following URL: https://www.yazdan.us/research/repository. For training the CNN for AER by IR images and visible images, the number of images required to achieve a respectable accuracy was not adequate. To proliferate the images and to bring more heterogeneity, a series of image augmentation techniques were applied to each image randomly. Approximately 2000 images of each emotion class were obtained after applying a series of rotation, zoom, skew, distortion, shear, and reflection. The augmented data were then used to train the CNN for AER. Table 2 also presents a brief comparison with related popular datasets showing its superiority due to images captured in-the-wild.  Speech samples were already available in several databases. We selected The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), a nascent audio-visual emotional DB in North American English [118]. Developed at the SMART Lab of Ryerson University, Canada, the DB consists of validated emotional speech and songs. A total of 24 professional actors uttered in the North American accent for lexically-matched recitations. The audio portion covers seven emotional states (calm, happy, sad, angry, fearful, surprise, and disgust) and songs comprise of 5 emotions (calm, happy, sad, angry, and fearful). The corpus contains 2452 audio recordings for two emotional levels (normal, strong). Two hundred forty-seven individuals provided the assessment of the DB for the rating, and a set of 72 people provided test-retest data. The DB is available for download under a creative commons license. It should be noted that the images form this dataset were not used owing to the lack of thermal/infrared images and their capturing in controlled environment while we required images captured in-the-wild.
Data Pre-processing: The audio data consists of the recordings with a standard unique filename that identifies the traits of the file. We selected 1886 samples that belonged to the 5 emotion classes our work focused on. However, we faced a few issues during training and found that pre-processing was unavoidable. The length of the files was not uniform and required a batch process of trimming. This was also necessary to eradicate any issues that might arise during the creation of audio spectrograms and the CNN training. Furthermore, during training, some of the files were identified as corrupt and were causing issues in accessing audio vectors. Those files were identified and removed from the training corpus. Finally, the samples for all emotions lied in the range of 370-380, totaling to 1876 (out of 2452). This balance among the samples for each emotion was also a prerequisite to ensure proper training.

Two-Stage AER Framework
In the proposed method, ANN was employed for deep learning and ensemble-based classification of the emotion. The model consists of two layers of detection. In the first layer, two CNNs were trained using visible and infrared images individually. Transfer learning was incorporated to extract the features of the images. A feature-level fusion at this stage was engaged, and the fused feature was then fed to an SVM for classification. At the same layer, a third CNN was deployed to learn the emotions from the speech by incorporating audio spectrograms for training the ANN. The SVM and the third CNN (for speech) output was fed to a decision-level fusion in the second layer. The final classification was obtained as the output of the second layer. The proposed model is depicted in Figure 4.

Stage Ia: AER with Visible and IR Images
Transfer learning is the process of applying a muscle of a pretrained deep learning network by altering some of its layers and fine-tuning it to learn a new function. This process is considerate as it tolerates the absence of a large amount of training data and is capable of training a network with even small datasets. Compared to training from scratch, the process is generally expeditious and superior to networks trained from scratch for the same amount of training data. A typical transfer learning process by CNN for image classification involves the following steps: • Choose a pre-trained network.

•
Replace the final layers to adapt to the new dataset.

•
Tweak the values of hyperparameters to achieve higher accuracy and train.
For this work, a pre-trained CNN, the AlexNet, was used to perform transfer learning over the VIRI DB. AlexNet has been trained for more than a million images (called the ImageNet database [119]) and can classify them into 1000 categories. It is a 25-layer deep CNN that comprises of a diverse set of layers for specific operations. For image classification, a mandatory image input layer provides an input image to CNN after data normalization. The original 1440 × 1080 pixel images in the database were down-sampled to 227 × 227 to feed to the input layer due to the image size accepted by AlexNet. There is a convolutional layer that smears sliding convolutional filters to the input from the image input layer and produces a feature map as the output. There are multiple Rectified Linear Unit (ReLU) layers in the CNN. AlexNet was modified for the classification of emotions by replacing the last three layers with new layers that were appointed to learn from the new database. Five new layers (2 fully connected, one activation, one softmax, and one classification layer) were added as the final layers of the network, which resulted in a 27-layer CNN. The features generated at the last fully connected layer (3rd last layer of the CNN) were extracted and used for the feature level fusion.
We also employed pooling layers whose primary function is to down-sample the input to reduce the size that helps decrease the computations performed in subsequent layers and reduce over-fitting. Several dropouts, fully connected, and channel normalization layers were also used. The final layers include a softmax (that applies the softmax function) and a classification layer that gives the class probability of the input. It is the output unit activation function after the last fully connected layer.

Neutral
Happy Angry Sad Surprised A feature level fusion technique, based on Canonical Correlation Analysis (CCA) [120,121], has been applied to combine the features extracted from the two CNNs in the first layer of the framework. The CCA is a multivariate statistical analysis method for the mutual association between two arbitrary vectors. The fused feature vector is more discriminative than any of the individual input vectors. We used the summation method to combine the transformed feature vectors. The fused features from the two CNNs were fed to an SVM classifier for intermediate classification of AER. Fundamentally, it maps the data as points in space in a manner that the categories they belong to reside as far away from each other as possible. Hence, when it encounters a new data, it maps the data in the same space and gives predictions of the category it belongs to by finding which side of the boundary it was mapped. The SVM classified the fused vector for producing image classification (visible and IR images) for AER.

Stage Ib: AER with Speech
Speech-based emotion recognition was the second major component of the AER framework developed in this work. As shown in Figure 4, emotion recognition through speech was carried out by training a CNN with the spectrograms of the speech data. A spectrogram may be defined as a plot of the spectrum of frequencies of a signal with time. Usually represented as a heat map, spectrograms are a visual representation of the speech signal, while the variation in brightness or colors depicts the intensities. Representation of speech waveforms with their spectrograms is shown in Figure 5. A separate 24-layer CNN was put to service for training the speech emotion recognition portion. The speech and songs from RAVDESS DB were used to train this CNN for detecting the emotional content with background noise and silences for effectively incorporating their inevitable presence during the testing and validation. CNN performed the classification for the speech signals, and this formed the decision to be fed in the final decision level fusion besides the image classification decision.

Stage II: Decision Level Fusion
As a culmination of this work, the classifications from speech and images were combined using a decision level fusion at the second stage of the framework. A late fusion technique termed as weighted decision templates (WDT) [122] was applied to integrate the decisions emanating from SVM (for images) and the CNN (for speech) as shown in Figure 4. The decision template (DT) fusion algorithm computes the DTs per category by taking an average of the decisions for the training samples belonging to each class by every classifier. The WDT is an improvement over the conventional DT as it assigns weights to each classifier based on its performance and output. A more reliable classifier (higher accuracy) is weighted more and contributes significantly to making a decision. We used the fusion rule of the weighted sum to evaluate the final probability of a result after fusion belonging to a specific emotion.

Results
The framework was trained over the VIRI and RAVDESS datasets. The results were combined using a feature level fusion and a decision level fusion at different stages to achieve adequate accuracy. This section provides an overview of the results at every stage of the framework and compares individual and fused input modalities' accuracy. The framework accuracy augmented with each layer of an integrated modality, and the trend is illustrated in the results obtained. The results are presented in the form of confusion matrices and the well-known measures derived from the matrices, i.e., accuracy, precision, recall, and F 1 -measure.

Visible Images Only
The first layer of the framework was meant for solitary identification of emotions. The CNN for visible images was able to identify emotions with an overall accuracy of 71.19%. The confusion matrix for the AER with visible images, as shown in Figure 6i, indicates that the accuracy of individual emotions ranges between 32.92% (sad)-92.32% (angry). The treemap for the detected emotional category's proportional distribution is presented in Figure 7i. The emotions were correctly classified with low confusion with other emotions if the area of the detected emotion is more compared to the other emotional categories in a treemap. It can be observed that the pronounced emotions such as angry and surprise were detected without much uncertainty. Emotions that were more likely to be confused with other emotions showed a lower proportion in terms of area in the treemap, e.g., sad emotion was heavily confused with angry, neutral, and surprise emotions when only visible images were used. The recall for the AER by visible images was 0.71, with the precision and f-measure being 0.78 and 0.75, respectively.

IR Images Only
The overall accuracy achieved from the second CNN using IR images was 77.34%. The confusion matrix for the AER with infrared images is shown in Figure 6ii, which shows that the accuracy for individual emotions varied between 63.15-94.99%. Clearly, this range is better than what was achieved with visible images; however, the worst performed for this type of data was still the sad expression. Although sad emotion was mostly confused with neutral emotion, it was expected due to similarities in the facial expressions of these emotions. The proportional distribution of the detected emotions is depicted in the treemaps shown in Figure 7ii. The IR images were able to identify happy emotion with the least confusion. The recall for the AER using the IR images was 0.77, while the precision and f-measures were both 0.78.

Fusion of IR and Visible Images
A feature level fusion of the visible and infrared images produced an improvement over the individual image types. This was observed in the accuracy of the SVM classifier, which was recorded to be 82.26%. The confusion matrix for the fused images is depicted in Figure 6iii. Proportional distribution of the detected emotions after fusion is shown in Figure 7iii. Except for the surprise emotion, all other emotions exhibit improved accuracy compared to the use of the individual type of images. The improved recall, precision, and f-measures for the fused images were obtained as 0.82, 0.85, and 0.83, respectively.

Speech Only
Speech recognition was the final unit of the first layer of the framework. It was a supporting modality for the facial expressions, but a solitary mention of the metrics for the AER by speech alone is also presented. CNN for speech signals was able to classify emotions in speech with an accuracy of 73.28%. The confusion matrix for the speech is shown in Figure 8ii. A proportional distribution treemap for the AER by speech for all emotional classes has been shown in Figure 9ii. The area distribution in the treemap for the speech signals indicate a clear detection of angry and surprise emotions while the remaining three-faced confusion during classification. The recall was found to be 0.73, while both precision and f-measures were found to be 0.72.

Fusion of IR Images, Visible Images, and Speech
The second layer of the AER framework, which fused the results obtained with images and speech, was able to achieve a noteworthy improvement over the individual or fused (visible and IR) results. The confusion matrix of the emotional classification for the proposed AER framework is shown in Figure 8iii. The treemap in Figure 9iii depicts the accuracy of the AER framework while displaying the ambiguity the framework faces while classifying the emotions. The largest area for the correctly classified emotion conveys the superiority of the ensemble-based framework over the individual modalities. An accuracy of 86.36% was achieved for this step. A recall of 0.86, a precision of 0.88, and the F-measure of 0.87 was achieved. It is noteworthy that the accuracy achieved lied between 79.41-100.0%.

Discussion
The metrics for the individual modalities and subsequent fusion of those modalities are presented in Table 3. Certain emotions are harder to comprehend and pose difficulty in recognition, e.g., differentiating between emotions such as sad and neutral is challenging, attributing to similarities in their facial expressions. Minute changes in the facial regions when exhibiting these emotions can also be a rationale for such ambiguity. This is evident from the low classification accuracy with both image type, i.e., visible and infrared. The sad emotion was muddled with angry and neutral emotions because the facial regions such as the shape of mouth do not alter significantly. Overall, the low proportion of the detected emotion in the treemap (Figure 7i,ii) for these emotions show a lesser area as compared to other emotions. However, they still outperform individual modalities or fused results of both types of images.
Contrarily, loud emotions such as surprise, angry, and happy posed less ambiguity in their detection. This is again evident from the treemaps of visible and IR images in Figure 7i,ii. A contrasting observation of this fact is the lower area of facial region engaged by the emotion happy for detection in the visible images. A wide-eyed expression generally accounts for the angry and surprised emotions. This is evident from the treemap of visible images (Figure 7i) where, to some extent, both of these expressions were confused with each other. Another trait for a loud emotional expression is the opening of the mouth in amazement or for laughing in joy. This can be seen as a source of confusion while detecting emotion happy with surprise.
The temperature of facial regions turned out to be a rescuer, as expected. The intricacies of the emotional detection for subtle emotions were handled well by IR images. The fusion of the visible and IR images provided better detection accuracy for these emotions. The increase in the area of anticipated emotion for all the emotional categories in the treemap (Figure 7iii) elucidates this information. All five expressions, happy, angry, sad, neutral, and surprise were detected with justifiable accuracy by the mere fusion of IR and visible images. However, a clashing observation for this trend is observed in case of sad emotion where it was confused with the affective state of neutral and angry due to the reasons mentioned above.
The speech was harnessed to further tackle this ambiguity by clarifying the confusion between facially similar expressions. The tone and pitch of the speech govern the shape of the audio spectrograms, and these spectrograms were able to identify emotions based on their trend in a CNN. The speech was also used alone for AER and did a decent job of classification with an accuracy of 73.28%. This was quite similar to the ones achieved by visible or IR images alone but less than their fusion. In the speech, when expressing an angry or surprised emotion, the pitch is generally high, and these emotions are less ambiguous to identify. This is evident from treemap in Figure 9ii where emotions surprise and angry do cover majority of the area for their perceived emotions. The affective states of sad and neutral were not very sharp in classification because of their similarity with other emotional classes while exhibiting them. Emotion happy also showed a slight confusion with angry because of the evident similarity of being loud while exhibiting these emotions. Finally, the fusion of speech with the images provided a high accuracy framework that distinguished emotions with higher credibility. The fusion of images with speech removed the ambiguity where the other modality suitably removed a confusion in two emotions. The treemap in the Figure 9iii shows a clear classification for all the emotions with the surprise emotion being classified with surety and sad emotion being the least accurate emotion to be classified. The fusion of images and speech was able to achieve an accuracy of 86.36%. The Table 4 presents a comparison of the presented method with the recent multimodal emotional detection works.

Conclusions and Future Work
This work presents a novel multimodal approach for facial emotion detection by the fusion of visible and IR images with speech. A framework based on the ensemble approach of emotion detection using pattern detection methods is presented. A novel database comprising of visible and corresponding IR images was created to tackle the inevitable light invariant conditions. The fusion of visible and infrared images classified emotions depicted with an accuracy of 82.26%. Speech samples from the RAVDESS multimodal dataset resulted in detection accuracy of 73.28%. The decisions from the images and speech were fused using decision templates (a decision level fusion technique) to achieve an overall accuracy of 86.36%. A comparison of the accuracy with the recent work in multimodal emotion detection proves the framework's superiority. The framework is able to detect emotions with a comparable accuracy with most of the contemporary work. The rationale for its exclusivity is its attainment of this accuracy in the wild background and light invariant conditions. The framework could be further perfected in various ways. For example, using the images and voice samples from the same subjects and using other modalities such as physiological signals might further enhance the framework for detecting the emotional intensity and subtle expressions.