Improving Facial Emotion Recognition Using Residual Autoencoder Coupled Afﬁnity Based Overlapping Reduction

: Emotion recognition using facial images has been a challenging task in computer vision. Recent advancements in deep learning has helped in achieving better results. Studies have pointed out that multiple facial expressions may present in facial images of a particular type of emotion. Thus, facial images of a category of emotion may have similarity to other categories of facial images, leading towards overlapping of classes in feature space. The problem of class overlapping has been studied primarily in the context of imbalanced classes. Few studies have considered imbalanced facial emotion recognition. However, to the authors’ best knowledge, no study has been found on the effects of overlapped classes on emotion recognition. Motivated by this, in the current study, an afﬁnity-based overlap reduction technique (AFORET) has been proposed to deal with the overlapped class problem in facial emotion recognition. Firstly, a residual variational autoencoder (RVA) model has been used to transform the facial images to a latent vector form. Next, the proposed AFORET method has been applied on these overlapped latent vectors to reduce the overlapping between classes. The proposed method has been validated by training and testing various well known classiﬁers and comparing their performance in terms of a well known set of performance indicators. In addition, the proposed AFORET method is compared with already existing overlap reduction techniques, such as the OSM, ν -SVM, and NBU methods. Experimental results have shown that the proposed AFORET algorithm, when used with the RVA model, boosts classiﬁer performance to a greater extent in predicting human emotion using facial images.


Introduction
Human emotion identification is a growing area in the field of Cognitive Computing that incorporates facial expression [1], speech [2], and texts [3]. Understanding human feelings is the key to the next era of digital evolution. Recent developments in the field have realized its potential in fields such as mental health [4], intelligent vehicles [5], and music [6]. Recognizing emotions from facial expressions is a trivial task for the human brain, but it associates a higher level of complexity when carried out using machines. The reason for this intricacy is the non-verbal nature of the communication that is enacted through facial cues. Emotion prediction through other forms of data sources such as texts are comparatively easier tasks because of the word-level expressions that can be easily annotated through hashtags or word dictionaries [7][8][9].
Emotion recognition through facial images has been comprehensively studied in the last decade. The studies conducted in the recent years are mostly focused on the application of Deep Neural models. This is mostly because of the variance in the realworld sets. In [10], the use of two residual layers (each composed of four convolutional layers, two short-connection, and one skip-connection) with traditional Convolutional Neural Networks (CNNs) resulted in an average enhancement in performance of 94.23% accuracy. Lin et al. [11] proposed a model utilizing multiple CNNs and utilized an improved Fuzzy integral to find out the optimal solution among the ensemble of CNNs. Facial Emotion Recognition has also been utilized in medical applications. Specifically, Facial Emotion analysis has been mostly utilized in psychiatric domains such as Autism and Schizophrenia. Sivasangari et al. [12] illustrated an IoT-based approach to understand patients suffering from Autism Spectrum Disorder (ASD) by integrating facial emotions. Their framework is built to monitor the patients and is equipped to propagate information to the patient's well-wisher. The emotion identification module developed using a Support Vector Machine is designed to help the caretaker to understand the emotional status of the subject. Jiang et al. [13] proposed an approach to identify subjects with ASD by utilizing facial emotions detected using an ensemble model of decision trees. Their approach was found to be 86% accurate in the appropriate classification of subjects. One study by Lee et al. [4] performed emotional recognition on 452 subjects (with 351 patients with schizophrenia and 101 healthy adults). Facial Emotion Recognition Deficit (FERD) is a common deficit found in patients with Schizophrenia. In [14], the authors highlighted the drawbacks of FERD screeners and proposed an ML-FERD screener to undertake a concrete discrimination between Schizophrenia patients and healthy adults. The ML-FERD framework was built using an Artificial Neural Network (ANN) and trained using 168 images. Their approach demonstrated a high True Positive Rate (TPR) and True Negative Rate (TNR). Recent studies have also focused on the emotion inspection from videos. Hu et al. [15] concentrated their study on extracting facial components from a video sequence. The authors developed a model that modifies Motion History Image (MHI) by understanding the local facial aspects from a facial sequence. One interesting approach proposed by Gautam and Thangavel [16] trains the CNN with 3000 facial images using an iterative optimization and tested the model on a video of American Prison. The primary interest of the authors was to develop an automated prison surveillance system, and the proposed approach recorded an average accuracy of 93.5% over the video tests. Haddad et al. [17] tried to preserve the temporal aspect of video sequences by using a 3D-CNN architecture and optimized it using a Tree-structured Parzen Estimator. Another approach called Contrastive Adversarial learning [18] was recently proposed by Kim and Song to perform a person-independent learning by capturing the emotional change through adversarial learning. Their approach resulted in reliable results on video sequence data. Auto-encoder networks in emotion recognition has also been accentuated in recent years [19]. In 2018, two studies [20,21] addressed the problem of computational complexity in Deep Networks and proposed a Deep Sparse Autoencoder Network (DSAN) to re-construct the images and integrated it with a softmax classifier capable of sorting out seven emotional categories that can be determined from the faces. Convolutional Autoencoders were found to be useful in continuous emotion recognition from images [22]. One approach using Generative Adversarial Stacked Convolutional Autoencoders was illustrated by Ruiz-Gracia et al. [23] in the context of Emotion Recognition. The pose and illumination invariant model was found to achieve 99.6% accuracy on a bigger image dataset. Sparse autoencoders were also explored with Fuzzy Deep Neural Architectures by Chen et al. [24]. The authors obtained reliable results on three popular datasets by applying a 3-D face model using Candide3. In another recent work by Lakshmi and Ponnusamy [25], the authors used Support Vector Machine (SVM) with Deep Stacked Autoencoder (DSAE) to predict the emotions from facial expressions. The pre-processing approach proposed by the authors is developed on a spatial and texture information extraction using a Histogram of Oriented Gradients (HOG) and a Local Binary Pattern (LBP) feature descriptor. Multimodal applications in emotion recognition have also been explored with autoencoders. In [26], the authors developed a novel autoencoder-based framework to integrate visual and audio signals and classified emotions using a two-layered Long Short-Term Memory network. Label distribution learning has been explored in [27,28] for chronological age estimation from human facial images.

Motivation
The class overlapping problem is well-known in the research community, however, very few research works have addressed it. The majority of research work focuses on the effects of class overlapping in the presence of imbalanced classes. Apart from these, few domain-specific works have been reported. The class overlapping problem in the context of face recognition has been studied in [29]. The proposed method used Fisher's Linear Discriminant combat majority biased face recognition; however, in the presence of overlapping classes, a new distance-based technique has been proposed. The study also pointed out the challenges in learning overlapped classes by various classifiers such as ANNs. Fuzzy rules have been used to address the same [30], where both imbalanced and overlapped classes are learned. The fuzzy membership values of data points have been used to partition the data points into several fuzzy sets. Batista et al. [31] found that classifiers may find difficulty in learning imbalanced classes in presence of overlapped classes, especially the minority classes. Similar studies [32,33] have also pointed out this issue where the performance of classifiers have been tested by varying the degree of overlapping. Another study [34] reported the effect of overlapped classes, where the overlapping region has majorly occupied minority samples. It has been found that the presence of overlap makes class-biased learning difficult. Later, Garcia et al. [35] studied the problem in detail and recorded the effects of overlapping classes in the presence of overlapping. It has been reported that the imbalance ratio might not be the primary cause behind the dramatic degradation of the classifier, whereas overlapped classes play a vital role. It established the fact that class overlapping is more important to classifier performance than class imbalance. Lee et al. [36] proposed an overlap sensitive margin classifier by taking the leverage of fuzzy support vector machines and k-nearest neighbor classifiers. The degree of overlap for individual data points are then calculated using the KNN classifier and used in a modified objective function to train the fuzzy SVM in order to split the data space into two regions, known as the Soft overlap and Hard overlap regions. Devi et al. [37] adopted a similar approach, where a ν-SVM was used as one class classifier to identify novel data instances from a dataset. However, the explicit detection of data points in an overlapping region is not reported. Neighborhood-based strategies have also been employed to undersample data points in the overlapping region and subsequently removing those data points to improve classifier performance [38].

Contribution
In the context of emotion recognition, the effect of class overlapping has not been preciously addressed. The challenge of overlapped classes appear as studies have revealed [39] that the presence of multiple facial expression is common in humans. Hence, facial images categorized in a particular class may have close similarity with other categories, which leads to the severe overlapping of classes. In order to address this problem, in the current study, a residual variational autoencoder (RVA) has been used to represent a facial image in latent space. After training the RVA model, only the encoder part transforms the images of all classes to a latent vector form. Now, to overcome the overlapped classes, an affinitybased overlap reduction technique (AFORET) has been proposed in the current article. The proposed method reduces the overlapping of classes in latent space. After modifying the dataset, it has been used to train a wide range of well-known classifiers. The performances of the classifiers have been tested by using well-known performance indicators. A thorough comparative analysis has been conducted to understand how the degree of overlap affects the classifiers' performance. The ingenuity of the proposed algorithm has been compared with the OSM [36], ν-SVM [37], and Neighborhood Undersampling (NBU) techniques, which have also attempted to address the overlapping problem in general. Overall the contributions of the current study are as follows:

1.
To address the overlapped classes in emotion recognition, an affinity-based class overlapping reduction technique has been proposed. 2.
An affinity-based metric is used to identify the data points in overlapping regions. Unlike previous methods [37,38], affinity values of data points provide a better understanding of whether a data point belongs to an overlapping region or not.

3.
As it is evident from the work described in [36] that the removal of data points from the initial dataset is essential to improve classifier performance, hence, a similar approach is also adopted in the current study. However, it may be noted that the removal of too many data points from the original dataset may cause the classifier to improperly learn the underlying decision boundary. Thus, extensive analyses have been carried out in order to clearly understand how much data removal is optimal in the case of facial emotion recognition.
The rest of the article is arranged as follows: Section 2 introduces the residual variational autoencoder model, which is followed by the affinity-based overlap reduction technique. Next, in Section 4, these two methods are combined together to address the class overlapping problem in facial emotion recognition. Section 5 begins with a discussion on experimental setup, and the classifier and overlapping techniques are compared in terms of experimental performances. Finally, the conclusions are made in Section 6.

Residual Variational Autoencoder
Among various generative models, autoencoders are designed to transform inputs into a low-dimensional latent vector representation and transform them back to their original form. Such networks are trained in unsupervised mode in order to extract the most useful features of the input using unlabeled data [40]. A typical autoencoder consists of two components, viz., an encoder and a decoder. The encoder usually takes an input and eventually reduces its shape through a series of convolutional layers. The output of the encoder is a latent vector which can be passed to the decoder to reconstruct the original input. For every instance y i in the training dataset D = {y 1 , y 2 , . . ., y N }, where y i and N represent the input vector of the ith sample and the number of instances, respectively. The encoding layer can be represented as: where s e (.), W e , and b e represent the activation function, the weight matrix, and the bias vector of the encoding layer, respectively. In the same manner, the decoding layer can be defined as: where s d (.), W d , and b d denote the activation function, the weight matrix, and the bias vector of the decoding layer, respectively. Hence, the output of the autoencoder for the instances can be defined as: The Variational Autoencoders (VAE) have proved to be a major improvement while dealing with the feature representation capability [41]. The VAEs are generative models that are based on the Variational Bayes Inference [42] and combine deep neural networks which aim to regulate the encoding pattern during training so that the latent space has good properties to enable the process of the instance generation using a probabilistic distribution. The VAE has had many applications in the domain of image synthesis [43], video synthesis [44], and unsupervised [45], respectively. As described in [46], numerous data points with similar characteristics to the input can be created by sampling different points from the latent space and decoding them for use in downstream tasks. However, a constraint is imposed on learning the latent space to store the latent attribute as a probability distribution in order to generate new high-quality data points.
In VAE model, the input is as follows: where f is a posterior probability function that uses a deep neural network to perform a non-linear transformation. The exact computation of the posterior p θ (z|x) in this model is not mathematically feasible. Instead, a distribution q φ (z|x) [41] is used to approximate the true posterior probability. This inference network q φ (z|x) is parameterized as a multivariate normal distribution as shown below: where both σ 2 φ (x) and µ φ (x) represent the vector's variance and means, respectively. In case of deep networks, convergence of may lead to degradation problem [47]. With the increase in the depth of the networks, performance saturates to an unsatisfactory level. Furthermore, in case of autoencoders, proper reconstruction of the input may not be achieved; thereby, the essential features cannot be captured in the latent vectors. This problem is solved by introducing skip connections (Figure 1). Such residual blocks enable the autoencoder to learn a layer-wise identity relation which does not incur the cost of learning any extra parameters. Moreover, the applications of autoencoders have been successfully studied in facial image restoration and emotion recognition. Motivated by this, in the current study, we employ a residual variational autoencoder model to extract the most important features in a latent space. The proposed RVA model architecture has been depicted in Figure 2.

Affinity-Based Overlapping Detection
In the current article, the detection of an overlapping region between different classes has been achieved by using the notion of affinity. Let us assume a labeled dataset D = {(p 1 , y 1 ), (p 2 , y 2 ), . . .(p m , y m )}, where the ith data point p i denotes a point in R n , and y n is the label associated with it. We assume that data points belong to k(k ≥ 2) classes. Hence, for any ith data point, the class label y i ∈ [1, k]. Data points belonging to a particular class are considered as a labeled cluster. The entire cluster is represented by a cluster representative. The cluster membership is calculated by taking the mean of data points in a cluster and calculated using Equation (6): where n j denotes the number of data points in jth cluster. In the initial dataset, the membership of the data points are crisp. However, such crisp label information does not reveal how close a data point is to its cluster representative. Therefore, we define an affinity score associated with every data point for all class representatives. The affinity score is designed to reflect the confidence of membership of a data point. The affinity score of a data point is calculated using Equation (7): where a ij represents the affinity between the ith data point and the jth cluster, and d ij denotes the distance between the same. The scaling parameter σ decides the the closeness between one data point and class representative within σ units. The affinity score between any data point and class representative becomes high when they are close and becomes eventually small as the becomes far apart. Now, we define a metric β, which is used to decide whether a data point is in an overlapping region or not. It is defined in Equation (8): To elaborate further, a binary classification has been considered in Figure 3, where C 1 and C 2 represents class representatives of two classes, viz., '1' and '2', respectively. The elliptical boundary denotes the class data distributions. Data points p 1 and p 2 both belong to class '1'. However, p 1 is outside the overlapping region. and p 2 is inside the overlapping region. The affinity of these data points with respect to both class representatives can be calculated by using Equation (7), and subsequently, the β value is calculated using Equation (8). Affinity between p 1 and C 1 is denoted by a 11 and the other affinity values are mentioned on the line joining them in Figure 3.

Proposed Method
In Section 3, the preliminary concept of affinity-based overlapping region detection has been discussed. In this section, we introduce the overlap reduction method and the overall proposed scheme for emotion recognition. The proposed method has been explained in Figure 4. The initial facial emotion dataset is first used to train the Residual Variational Autoencoder (RVA). After training the RVA, only the encoder has been used to convert all images to a latent vector form. These latent vectors corresponding to various emotion categories are overlapped. Hence, the affinity score is calculated for all latent vectors using Equation (7), and the corresponding β values are calculated using Equation (8). Now, the β value increases as the data point becomes closer to the overlapping region. Therefore, the data point having β value greater than a predefined threshold β t has been removed from the dataset. After that, a set of well-known classifiers have been trained with both overlapped and overlapped reduced modified datasets. The performances in both cases are calculated based on test phase confusion matrix. The analogy behind using the β value to determine the overlapped region can be conceptualized by using Figure 5, which plots the posterior densities of two classes for a binary classification problem. Data instances of this binary classification problem have a single feature only, which is plotted along the horizontal axis. The density of class '1' is plotted in blue, and the same for class '2' has been plotted in red. The posterior densities reveal that for all patterns within range [−1, 3.5] will incur some error in the decisionmaking process. Furthermore, at the point at which both densities intersect with each other, the data points having a feature value of 1.8 will have equal probability of being in class '1' and '2'. In addition, a region around that point in feature space will have data points for which membership to a particular class is uncertain as posterior densities indicate that they have almost equal chance of being a member of both classes. Along with the densities, a black dashed plot has been depicted in Figure 5. This line plots the β values corresponding to every data point. This plot reveals that the β value increases as the uncertainty about the membership of data points increase. At the intersection of densities, the corresponding data point achieves the highest β value. Hence, by using a threshold on β, data points having less confidence about their membership can be discarded from the dataset, thereby reducing the overlapping region of the dataset.  Figure 6a depicts a similar dataset which has two categories of data instances. Both categories of data instances are plotted using different colored markers. It can be observed that there is a substantial amount of overlapping between the classes. After applying the proposed affinity-based overlap reduction method, the modified dataset is shown in Figure 6b. The class representatives of both the classes are marked using red marker. The β t was set to 0 to obtain this dataset which has almost no overlap between the classes. The contribution of the affinity score in this process can be further elaborated by the affinity plots depicted in Figure 7a,b. These figures depict the affinity values of individual data points of the dataset (Figure 6a) with respect to class representatives '1' and '2', respectively. Figure 7a reveals that the affinity of data points with respect to class '1' increases as it becomes closer to the class representative of class '1'. A similar trend can be observed in the case of class '2', as well (Figure 7b). Algorithm 1 explains the proposed RVA model supported AFORET method. From steps 1 to 9, RVA model training has been explained. In line 10, the trained encoder has been used to obtain the overlapped latent vectors. Lines 11 to 14 calculate the affinity of the data points for all classes. From lines 15 to 21, the β values for all data points have been calculated. Finally, from lines 22 to 26, data points having β values greater than the threshold β t have been selected in the final latent vector set.

end for
7: φ, θ Update the parameters using Stochastic Gradient Descent 9: until The parameters φ, θ converges 10: L = P φ (I) P φ is trained encoder. Set of latent vectors L = {l 1 , . . .l n } 11: for i ← 1 to n do 12: a ij = e −d 2 ij 2σ 2 d ij is the eucledian distance between ith data instance 13: and jth class representative 14: end for 15: for i ← 1 to n do 16: s i ← 0 17: for j ← 1 to k do 18:

Experimental Setup
The proposed affinity-based overlap reduction technique (AFORET) coupled with the initial stage of the RVA model has been tested by using the popular Affectnet Facial Expression Dataset [48]. Out of the original 11 categories of facial emotion images of Affectnet, 7 categories of emotions, viz., 'Neutral', 'Happy', 'Sad', 'Surprise', 'Fear', 'Disgust', and 'Anger', have been considered in the current study. As evident from previous studies [49,50], the presence of overlapped classes in the dataset have significantly reduced the classifier performance in predicting facial emotions. Thus, in the current study, the modified dataset is first used to train the proposed RVA model. Later, the encoder of the trained RVA model is used to convert the input images to a latent form. The shape of the latent vectors are decided by a separate experiment.
To reduce the overlapped region of the latent vectors, the affinity-based overlap region reduction technique has been applied. The β t threshold for the study has been decided by conducting an extensive analysis. The performances of the classifiers have been checked for β t values such that the total amounts of data loss are 5%, 10%, and 15%. The performances of the classifiers have been checked in terms of performance indicators such as 'Accuracy', 'Sensitivity', 'Specificity', 'Balanced Accuracy', 'G-mean', 'Area Under Curve' (AUC), and 'Matthews Correlation Coefficient' (MCC). Firstly, the original latent vectors with overlapping have been used to train and test the classifier. After that, the modified dataset obtained after applying overlap reduction technique has been used to test the classifiers for data losses of 5%, 10%, and 15%. For all experiments, 10-fold cross validation has been used.
In addition, AFORET has been compared with three well-known overlap region reduction techniques, viz., OSM [36], ν-SVM [37], and Neighborhood Undersampling (NBU) [38]. After converting the images to latent vector form using RVA, the aforementioned algorithms have been applied to reduce the overlapping region present in the latent dataset. The modified datasets corresponding to individual algorithms have been employed to train and test the best-performing classifier to compare their performance in terms of all the performance metrics.

Analysis Using Classifiers
The classifiers used in the current study have been trained using the latent vectors obtained from the RVA model. In this section, the performances of the classifiers have been compared by training them using modified datasets with varying degrees of data loss. Instead of comparing in terms of varying β t value, a more logical alternative of comparing in terms of the amount of data loss has been considered to understand in a better way as to how the performance change. For this purpose, AFORET algorithm has been applied on the initial latent vectors and datasets with 5%, 10%, and 15% data loss have been obtained. For each modified dataset, all the classifiers have been trained. Table 1 depicts the performance of classifiers in terms of accuracy. For the original dataset with the overlapped region being untouched, the performance of the classifiers has been found to be poor. The best performance is achieved by XGBOOST, which has achieved an accuracy of 0.61. After modifying the dataset by applying AFORET and reducing 5% of the data from the original dataset, it is used to train the classifiers. The performances of all classifiers have been found to have improved with an average of 0.94. On the other hand, the dataset with 10% data loss does not improve the performance beyond 0.94. However, KNN has reported slight improvement with an accuracy of 0.98. next, with the 15% data loss, the performance is improved further to 0.95 with an overall improvement in almost all classifiers.  Table 2 reports the performance of all classifiers in terms of sensitivity. It has been observed that the performances of almost all classifiers are not satisfactory for the original overlapped dataset (column "Overlapped"). The best performance is achieved by XGBOOST with a sensitivity score of 0.57. On average, the classifiers have been able to achieve a sensitivity of 0.45 for the overlapped dataset. Next, the modified dataset with 5% data loss is used to train the classifiers. Significant improvement can be observed in the performance in terms of sensitivity. The best performance is achieved by the NB classifier with a sensitivity of 0.95, whereas the average performance is improved to 0.92. For 10% data loss, the performance improves further with an average sensitivity of 0.94. Finally, after reducing 15% of data, the average performance improves, whereas a few classifier's performance decreases. The average performance becomes 0.95, which is slightly better than a 10% data loss.  Table 3 reports the performance of classifiers in terms of specificity. The classifiers' performance have been unsatisfactory for original latent vectors with overlapped regions, where the average performance in terms of specificity has been recorded to be 0.45. However, after applying AFORET, the performance of the classifiers gradually improves from an average of 0.94 for 5% data loss to 0.96 at 15% data loss. The performance of all individual classifiers have also reflected a similar trend of improvement. Among all, XGBOOST has been found to perform the best with an accuracy of 0.99.  Table 4, the performances of the classifiers have been compared in terms of balanced accuracy. As observed earlier, the performance of classifiers for the overlapped dataset is unsatisfactory. On average, the classifiers achieved a score of 0.45 in terms of balanced accuracy. However, as AFORET is applied and the initial dataset is modified, the performance improves. At 5% data loss, the performance, on average, is 0.93. Further loss of data to 10% and 15% improves it even further to 0.95 and 0.96, respectively. Table 5 tabulates the performance in terms of the G-mean. This performance metric reflects the combined effect of sensitivity and specificity. Hence, a similar trend of performance has been recorded. XGBOOST remains the best performer for the overlapped dataset. However, KNN has been found to be the best after applying AFORET. This indicates that the latent space embedding produced by the proposed RVA model is efficient enough that local information could be sufficient to distinguish between different emotions.
Tables 6 and 7 report the performance of classifiers in terms of the AUC and MCC scores. These two performance metrics reveals that the performances of the classifiers at 10% data loss are slightly improved with 15% data loss. Thus, in order to minimize the amount of data loss and at the same time to achieve the best classification performance, 10% data loss is sufficient and optimal. It can be further noted from Tables 1-7 that the performances of the classifiers for the original overlapped dataset are significantly lower compared to their performance when the dataset is processed with AFORET. This reveals that the original latent vector form of dataset has all classes highly overlapped with each other. After the reduction of the overlapping region even by 5%, the performance of the classifiers improves significantly. Table 8 reports the accuracy scores of the individual classes for all classifiers. The classifiers are trained with the overlapped dataset and test phase performance in terms of accuracy have been measured. Next, the same experiment has been repeated with the overlapped reduced dataset. Previous experiments have already revealed that a 10% data reduction would be sufficient to alleviate the overlapped class problem. Hence, AFORET with 10% data loss has been considered for this class-wise comparison. Table 8 reveals that for all seven categories of emotions which are considered in the current study, it is observed that AFORET significantly improves classifier performance in detecting individual categories.

Comparative Study of Overlap Reduction Methods
In Section 5.2, various classifiers have been compared in terms of several performance indicators to understand the ingenuity of the proposed AFORET method used to mitigate the overlapped classes. In the current section, the proposed AFORET is compared with three well known overlapping removal techniques, viz., OSM [36], ν-SVM [37], and Neighborhood Undersampling (NBU) [38]. It has been observed in Section 5.2 that the performance of a majority of the classifiers are close to each other. Hence, in this section, the overlapped latent vectors are processed using the overlap reduction/removal techniques, and the modified dataset is then used to train and test all the previously used classifiers. The performances of the classifiers in terms of all performance indicators have been recorded. In order to compare the algorithms, the data losses in all methods have been restricted to 10% of the original set only. Table 9 reports the performance of the overlap removal algorithms in terms of various performance metrics for all classifiers. In terms of accuracy, it has been observed that the performance of almost all classifiers have achieved the best results for the proposed AFORET method. However, the LR, KNN, and MLP classifiers have performed equally well for NBU as well. Next, a sensitivity-based comparison reveals that the performance of AFORET remains the best for all classifiers except KNN. In addition, the ν-SVM performed equally well for NB, RF, and KNN. However, the average performance of classifiers remains best for AFORET only. After that, the performance analysis for specificity reveals a similar trend. In case of balanced accuracy, OSM and NBU performed equally well on all classifiers, whereas the performance of ν-SVM and the proposed method is close for a few classifiers. However, the average performance of AFORET with 0.95 is significantly better than ν-SVM.
G-mean, AUC, and MCC reveal a similar trend of performance. It has been observed that in terms of all performance metrics, the average performance of OSM is almost same as NBU, whereas few classifiers have reported equal performance of ν-SVM with the proposed AFORET. However, upon taking the average performance obtained by all classifiers, it has been found that the performance of proposed AFORET remains better than all other remaining methods. This extensive comparative analysis with existing overlap removal technique establishes the fact that the proposed AFORET-based method to reduce overlap between classes has significantly improved the performance of the classifiers to detect human emotions based on an RVA model.

Conclusions
The current article has proposed a novel overlapping reduction technique to improve classification performance in emotion recognition using facial images. The class overlapping problem in facial emotion detection has been solved by using an affinity-based overlap reduction technique. The proposed AFORET method has been used to reduce the overlapping region so that performance of classifiers in emotion recognition can be improved. AFORET has been tested for various degrees of data loss starting from 5% up to 15%. The original facial image dataset is transformed to a latent vector form to capture the most important features for the classification task. These latent vectors are then modified using AFORET to reduce the overlapping region. After reducing the overlapping region, a set of well-known classifiers have been trained and tested to establish the ingenuity of the proposed model. Experimental results have revealed that 10% data loss using AFORET sufficiently reduces the overlap regions and improves classifier performance. Any extra data loss beyond 10% does not improve classifier performance further. In addition, a comparative analysis with existing overlap removal techniques, viz., OSM, ν-SVM, and NBU, has been conducted. The comparative study revealed that the proposed AFORET is better than all other methods in addressing the class overlapping problem in facial emotion recognition. Overall, the proposed RVA model combined with AFORET has been able to significantly improve classification performance to a greater extent.