Feature Selection for Facial Emotion Recognition Using Cosine Similarity-Based Harmony Search Algorithm

: Nowadays, researchers aim to enhance man-to-machine interactions by making advancements in several domains. Facial emotion recognition (FER) is one such domain in which researchers have made significant progresses. Features for FER can be extracted using several popular methods. However, there may be some redundant / irrelevant features in feature sets. In order to remove those redundant / irrelevant features that do not have any significant impact on classification process, we propose a feature selection (FS) technique called the supervised filter harmony search algorithm (SFHSA) based on cosine similarity and minimal-redundancy maximal-relevance (mRMR). Cosine similarity aims to remove similar features from feature vectors, whereas mRMR was used to determine the feasibility of the optimal feature subsets using Pearson’s correlation coefficient (PCC), which favors the features that have lower correlation values with other features—as well as higher correlation values with the facial expression classes. The algorithm was evaluated on two benchmark FER datasets, namely the Radboud faces database (RaFD) and the Japanese female facial expression (JAFFE). Five different state-of-the-art feature descriptors including uniform local binary pattern (uLBP), horizontal–vertical neighborhood local binary pattern (hvnLBP), Gabor filters, histogram of oriented gradients (HOG) and pyramidal HOG (PHOG) were considered for FS. Obtained results signify that our technique effectively optimized the feature vectors and made notable improvements in overall classification accuracy.


Introduction
In recent days, due to the rapid growth of technology, human-computer interaction has begun to gain courtesy in the research domain. Expression of emotions through facial expressions is an important aspect of human communication that serves social interaction. In the work described by Shan et al. in [1], researchers have validated seven universal feature expressions, namely disgust, anger, happiness, sadness, neutral, fear and surprise. According to facial action coding system [2], various positions of the muscles on the face are responsible for the facial expression as explained by Ekman and Rosenberg [3]. The total number of action units in facial action coding system is found to be 46. These actions are associated with defined set of muscle movement [4]. The main use of Facial Emotion Recognition (FER) [5] lies in determining the emotional state of a person which further helps in depicting the mental state identification and mental disorders. Besides, FER also contributes to video indexing and data driven animation [6]. FER system is capable of making human-robot communication much more significant and effective. There are various challenges involved in FER. Predominantly, the camera angle is one of the key issues in data collection part. Wrong camera angle can hamper the recognition accuracy as it can distress the small muscle movements. Besides, there are many types of noises that also affect the minute muscle movements in eyes, nose or other parts of the face.
For a robust FER system, the primary need is to design a competent-feature vector. As the features are extracted from facial images showing different human emotions, in many cases, irrelevant or redundant features are generated. This ultimately stretches the dimension of feature set and brings down the overall accuracy in making predictions. Feature selection (FS)-an initiative to optimize feature dimensions undertaken by various researchers-aims in extricating redundant/irrelevant features that do not make any significant contribution in the overall prediction process [7,8]. FS is a useful way for substantial reduction of the size of original-feature vectors used to predict target facial emotions expressed by humans. This not only reduces computation time required to make the prediction, but also improves final accuracy by removing redundant and irrelevant features. Redundant and noisy features lead to misclassification and higher memory requirements, which, in turn, significantly increase model building time [9]. FS requires choosing a best-feature subset from 2 N feature subset combinations, where N is the size of the feature subset [10]. Some of the techniques of FS may prove to be very costly. As a result, in many cases, random search and heuristic search techniques are employed. In [11], an algorithm for unsupervised FS using feature similarity measure, called maximum information compression index was elucidated. Another FS technique was proposed in [12] which uses sequential search methods characterized by dynamically changing the number of features, i.e., "floating" methods. The works described in [13][14][15][16] are some other established applications of FS in solving problems like handwriting recognition, handwritten script classification [17], handwritten numeral classification, etc. Another instance of FS, called histogram-based fuzzy ensemble technique was applied to UCI datasets for evaluation in [18].
FS algorithms are segregated into three categories [19]. The filter method applies a statistical or probabilistic approach to assign a scoring to each feature and these features are selected to be either kept or removed from the original-feature set. The wrapper method is often used in conjunction with a machine learning algorithm (which works as a classifier), where the learning algorithm plays the part of the feature validation process and selects an optimal feature subset which enhances the classification accuracy. The hybrid method tends to perform computationally better than wrapper method. Both wrapper and hybrid methods make selections based on classifier that may or may not work well with any other classifier. That is because the optimal feature subset is built when the classifier is built, and the selection depends on the hypotheses which the classifier makes. In our work, we have focused mainly on the supervised filter-based FS method [20] approach, where the feasibility check for a feature subset is based on a popular statistical measure called Pearson's correlation coefficient (PCC) [21] and the minimal redundancy maximal relevance (mRMR) [22], which takes into the account of relation of features with other features and also with the classes. Our adopted technique applies a supervised filter harmony search algorithm (SFHSA) in order to perform FS. Keeping the fundamentals of the algorithm intact, we have made modifications to the pitch adjustment procedure of the algorithm where we have incorporated the concept of cosine similarity for adjusting values of the variables. The details of implementation were elucidated in Section 4. Some applications of HSA on FS was discussed in [23][24][25][26][27][28][29]. The proposed HSA-based FS was applied on a set of on five different state-of-the-art feature descriptors that include uniform local binary pattern (uLBP), horizontal-vertical neighborhood local binary pattern (hvnLBP), Gabor filters, histogram of oriented gradients (HOG) and pyramidal HOG (PHOG). The method is tested on two popular FER datasets, namely Japanese female facial expression (JAFFE) [30] and Radboud faces database (RaFD) [31].
The flow of the paper is arranged as follows: Section 2 describes motivation behind the present work along with some state-of-the-art methods related to FER. Section 3 presents the detailed description of the datasets used. It also explains the five feature descriptors briefly. The proposed optimal feature subset selection method, called SFHSA, was detailed in Section 4. The performance evaluation of the SFHSA is reported in Section 5. Section 6 finally concludes the paper. The supporting codes are uploaded in the Github link: https://github.com/Soumyajit-Saha/Cosine-Similarity-based-Harmony-Search-Algorithm.

Motivation and Related Work
High dimensional datasets often impose computational overheads especially in cases where achieving a near perfect classification accuracy score is the primary concern. Hence, researchers focus mainly on reducing the dimensions of feature sets by filtering out irrelevant features that do not have a significant impact on the classification process. The overheads imposed by high dimensional feature sets mainly include time and space overheads [9,10,23]. Nowadays-especially in fields of bioinformatics, medicine and genetics [24], which include features or genes denoted in the form of microarray data of humongous dimension-FS is a necessity in countering the underlying challenges of space and execution time. Many efficient algorithms have been established for this purpose. With an objective to serve the same purpose in FER domain, we have established a FS algorithm to increase the efficiency of the machine learning algorithm and reducing the dimension of the feature set simultaneously. We developed a novel HSA-based FS technique-SFHSA-and applied it on the FER datasets and demonstrated a clear comparison in juxtaposition with other established FS techniques.
In FER, the features are obtained from datasets having a large dimension. However, there are many redundant features present within the feature sets that bring down the classification accuracy. To eradicate those redundant and irrelevant features as well as make the predictions more accurate, we have taken the initiative to apply our proposed FS technique over FER datasets. The features, extracted from facial expression images include uLBP, hvnLBP, Gabor filters, HOG and PHOG. There are few FS works found in the literature on FER. In [25], a wrapper-based approach to FS using genetic algorithm (GA) was applied to features for FER, obtained using log-Gabor filters. A linear programming technique was used for FS in [26], where the features for FER was extracted using Gabor filters. A FS based on random forest classifiers was incorporated in the FER system where visual-appearance based features were extracted from Gabor filter bank in [27]. For the FS, the mRMR method based on mutual information (MI) quotient was used. Li et al. in [28] use the combination of fixed filters and trainable non-linear 2D filters based on the biologic mechanism of shunting inhibition. Finally, FS was performed using MI and class separation scores. The work described in [29] proposes automatic FER where features were generated using methods like Gabor filters, log Gabor filter, LBP operator, higher-order local autocorrelation and higher order local autocorrelation-like features. A self-learning attribute reduction algorithm was proposed for FS in [32] which is based on rough set and domain oriented data-driven data mining. The authors in [33] describe an efficient FS technique, late-hill-climbing-based memetic algorithm (LHCMA) on feature sets for FER, which indeed outperforms many previously established FS algorithms. A facial expression recognition system with hvnLBP-based feature extraction and micro GA-along with the particle swarm optimization (PSO)-based feature optimization technique-was proposed in [34] by Mistry et al. The FER method, based on a wrapper-based FS technique called multi-objective differential evolution algorithm, was proposed in [35]. The evaluation was done using support vector machine (SVM) classifiers. We have seen that HSA has provided satisfactory results in the case of FS for holistic Bangla word recognition [36], digit classification [37], email classification [38], epileptic seizure detection [39] and protein sequence classification [40]. However, to best of our knowledge, to date, the HSA-based FS technique has not been applied to FER systems. This has served as a motivation for us to perform the SFHSA for FER systems.

Dataset and Feature Description
This section is divided into two subsections: dataset description and feature extraction. The first subsection deals with the dataset and preprocessing information, whereas the next subsection puts forward a brief description of the feature descriptors used here.

Dataset Description
Two popular benchmark FER datasets were included in the present work, namely JAFFE and RaFD. Details of those databases and their preprocessing algorithms are demonstrated in the following sections.

JAFFE
The JAFFE dataset [30] includes facial expression of 10 different Japanese females. It consists of 7 basic facial expressions: surprise, angry, happy, disgusted, fearful, sad and neutral. In total, there were 213 images, which result in an unequal number of samples per class. Hence, data augmentation [41] was performed to handle this issue. The augmentation was achieved by introducing Gaussian white noise of constant variance and mean to 11 sample images. In the end, the dataset contained a total of 224 (= 32 × 7) images, i.e., 32 images per class. Sample images from the JAFFE dataset are shown in Figure 1.

Dataset and Feature Description
This section is divided into two subsections: dataset description and feature extraction. The first subsection deals with the dataset and preprocessing information, whereas the next subsection puts forward a brief description of the feature descriptors used here.

Dataset Description
Two popular benchmark FER datasets were included in the present work, namely JAFFE and RaFD. Details of those databases and their preprocessing algorithms are demonstrated in the following sections.

JAFFE
The JAFFE dataset [30] includes facial expression of 10 different Japanese females. It consists of 7 basic facial expressions: surprise, angry, happy, disgusted, fearful, sad and neutral. In total, there were 213 images, which result in an unequal number of samples per class. Hence, data augmentation [41] was performed to handle this issue. The augmentation was achieved by introducing Gaussian white noise of constant variance and mean to 11 sample images. In the end, the dataset contained a total of 224 (= 32 × 7) images, i.e., 32 images per class. Sample images from the JAFFE dataset are shown in Figure 1.

RaFD
The RaFD facial emotion dataset [31] was constructed from 67 models (consisting of Caucasian males and females, Moroccan Dutch males and Caucasian boys and girls). It consists of 8 various expression classes: disgust, fear, happy, neutral, contempt, surprise, sadness and anger. While capturing each facial expression, five different camera angles and three different gaze directions (frontal, left and right) were used. In this dataset, each class contains 201 images. Figure 2 shows sample images taken from RaFD dataset.

RaFD
The RaFD facial emotion dataset [31] was constructed from 67 models (consisting of Caucasian males and females, Moroccan Dutch males and Caucasian boys and girls). It consists of 8 various expression classes: disgust, fear, happy, neutral, contempt, surprise, sadness and anger. While capturing each facial expression, five different camera angles and three different gaze directions (frontal, left and right) were used. In this dataset, each class contains 201 images. Figure 2 shows sample images taken from RaFD dataset.

Dataset and Feature Description
This section is divided into two subsections: dataset description and feature extraction. The first subsection deals with the dataset and preprocessing information, whereas the next subsection puts forward a brief description of the feature descriptors used here.

Dataset Description
Two popular benchmark FER datasets were included in the present work, namely JAFFE and RaFD. Details of those databases and their preprocessing algorithms are demonstrated in the following sections.

JAFFE
The JAFFE dataset [30] includes facial expression of 10 different Japanese females. It consists of 7 basic facial expressions: surprise, angry, happy, disgusted, fearful, sad and neutral. In total, there were 213 images, which result in an unequal number of samples per class. Hence, data augmentation [41] was performed to handle this issue. The augmentation was achieved by introducing Gaussian white noise of constant variance and mean to 11 sample images. In the end, the dataset contained a total of 224 (= 32 × 7) images, i.e., 32 images per class. Sample images from the JAFFE dataset are shown in Figure 1.

RaFD
The RaFD facial emotion dataset [31] was constructed from 67 models (consisting of Caucasian males and females, Moroccan Dutch males and Caucasian boys and girls). It consists of 8 various expression classes: disgust, fear, happy, neutral, contempt, surprise, sadness and anger. While capturing each facial expression, five different camera angles and three different gaze directions (frontal, left and right) were used. In this dataset, each class contains 201 images. Figure 2 shows sample images taken from RaFD dataset. Facial emotion images generally suffer from various extraneous noises, as they are captured in different environmental conditions. These affect the feature extraction process. A suitable preprocessing technique was required to overcome this problem. Hence, Viola Jones [42] was applied to focus attention on the important region within the whole image. The important region or point of interest was mainly concerned with the areas containing facial expressions such as the lips, eyebrows, nose, eyes, etc. The Viola Jones algorithm returns the coordinates of the bounding box that contains the regions of interest. To make the method more robust and comparable with real-world scenario, the technique was assessed with various image dimensions. In the present work, facial images of three different dimensions that include 32 × 32, 48 × 48 and 64 × 64 were considered for the evaluation purpose. After preprocessing, the facial images were resized to their corresponding resolutions. Then, the images were used for the feature extraction. Finally, the extracted features were fed to the proposed feature selection algorithm. The use of three image sizes took into account the variation in image quality in practical applications for emotion recognition.

Feature Description
In this section, all the feature extraction methods are explained briefly. In the present work, we considered HOG, PHOG, uLBP, hvnLBP and Gabor filter-based feature extraction methodologies.

Histogram of Oriented Gradients
HOG [43] is a texture-based feature descriptor that adopts the histogram of gradients as a statistical measure. The primary idea behind the concept is that any local shape and object can be demonstrated using the gradient intensity distribution or edge direction. As HOG is invariant to geometric transformation, it is widely used in pattern recognition domain. In the computation part, first of all the entire facial image was divided into cells. After that, the gradients were calculated according to Equation (1). In this work, one dimensional gradient was taken. A matrix M = 1 0 −1 was used to calculate the gradient in the X direction denoted by Grad X . Subsequently, M T was adopted for gradient in Y direction denoted by Grad Y . In this work, the matrix M was used as a mask which was passed through the entire image and the variable here was the pixel intensity. The gradient was computed by taking the intensity difference of the neighboring pixels in particular direction. In case of Grad X , the intensity difference along horizontal neighboring pixels was considered. Similarly, for Grad y the intensity difference along vertical neighboring pixels was considered. The final gradient direction Grad Dir was calculated as follows: Next, the entire gradient direction domain (lying between 0 • to 360 • ) was divided into 8 histogram bins. Initially, the entire image was divided into cells. For every cell, the bin count for each histogram bins was obtained. We increment the bin count if the value of the gradient direction falls in the range of a particular bin. This way, we achieve the histogram of the gradients for each cell. Finally, those histograms obtained from each cell were concatenated to get the final feature vector. For the extraction of HOG feature, three image dimensions were considered. HOG feature descriptors were applied on each facial emotion image and the final feature vectors were obtained. The feature size of HOG depends on the image size and the cell dimension. The feature dimension for 32 × 32, 48 × 48 and 64 × 64 images are 324, 900 and 1764 respectively.

Pyramidal HOG
Researchers have mainly relied on PHOG feature descriptors [44,45] for object recognition. The spatial pyramid representation of HOG is computed here. It mainly sets up the local shape and Appl. Sci. 2020, 10, 2816 6 of 22 retains the spatial information by segregating it into various levels. The spatial information was retained by using HOG descriptor in every level. In this work, Canny edge detector [46] was applied to obtain the contour of each region. The edge detector was mainly used to capture the local shapes. An arrangement of minute spatial grids was shaped by doubling the separations in each axis direction for every region. For each resolution level L = l, the grid consists of 2ˆl cells along each dimension. For example, at L = 2, the number of grids along X axis will be 4. As a result, the total number of grids becomes 4 * 4 = 16. Further, a Sobel mask [47] of window size 3 × 3 was applied for extracting the orientation of the gradients along the edge contours. At this stage, the procedure of gradient binning was performed similar to HOG descriptor. The gradients corresponding to same cell are quantized and combined into N histogram bins. We increment the bin count if the gradient direction value lies in the range of that particular bin. Mainly, the orientation for binning was performed using either [0 • to 360 • ] range or [0 • to 180 • ] range where the contrast sign was neglected [36]. These bins were sorted and concatenated into a single sequence corresponding to the same level. For each level, a histogram will be obtained. Finally, all the histograms corresponding to each level were merged to get the final feature vector. In the present work, we have used the PHOG descriptor to capture facial features from the images (

Gabor Filter
In this work, we have also used the Gabor filter, a well-known frequency-based feature descriptor [48,49]. It was a linear filter that was mainly applied for texture study. The Gabor filter mainly analyzes the presence of an explicit frequency content in a specific direction within a localized region around the point or region of interest in the image. It is also invariant to rotation, translation and scale. Besides, it is also robust towards any kind of photometric disorders, mainly occur as illumination changes and image noise [37]. In the spatial domain, two dimensional Gabor filter includes the sinusoidal plane modulated Gaussian kernel function [50]. Equations of the kernel function for calculating Gabor filter-based features in the spatial domain are given in Equations (2)-(4).
Here, the standard deviation of the Gaussian envelope is expressed as σ. γ is the spatial aspect ratio and the ellipticity of the support of the Gabor function. i denotes the imaginary number. Phase offset is specified as ω. f stands for sinusoid frequency and θ symbolizes the alignment of normal to parallel stripes of the Gabor function. Eight different orientations and five distinct scales were taken in the Gabor model that results in 40 diverse Gabor filters.
Multiple spatial resolutions and orientations were measured from the set of 2D Gabor filter bank. Then, these were used for convolution of each facial image sample. Let us consider a sample facial image, img(x, y) and the corresponding Gabor filter kernel is ∆ u,v (x, y). The characterization of the output image, Out u,v (x, y) is given in Equation (5) [39] as follows:  [51], is a useful texture-based feature. It mainly captures the edge properties by taking the intensity differences of the center pixel with its surrounding pixels. In this work, we have considered a window size of 3 × 3 around a center pixel. As a result, a total of eight neighboring pixels were considered. The difference between the center pixel and each of the surrounding pixels were calculated. If the difference was greater than zero, we assigned 1; else 0. In this way, each center pixel was represented in a two-digit LBP code. The calculation of the LBP code is shown in Figure 3. In this figure, the top left corner is taken as the 7th bit and the bits were considered in a clockwise fashion until it reaches to the 0th bit. The resultant 8-bit binary number is formed which is then converted to its equivalent decimal number. This process is formulated in Equation (6), where, (x cen , y cen ) is the center pixel and i k is one of the surrounding pixels.

Uniform Local Binary Pattern
LBP, which was first introduced by Ojala et al. [51], is a useful texture-based feature. It mainly captures the edge properties by taking the intensity differences of the center pixel with its surrounding pixels. In this work, we have considered a window size of 3 × 3 around a center pixel. As a result, a total of eight neighboring pixels were considered. The difference between the center pixel and each of the surrounding pixels were calculated. If the difference was greater than zero, we assigned 1; else 0. In this way, each center pixel was represented in a two-digit LBP code. The calculation of the LBP code is shown in Figure 3. In this figure, the top left corner is taken as the 7 ℎ bit and the bits were considered in a clockwise fashion until it reaches to the 0 ℎ bit. The resultant 8-bit binary number is formed which is then converted to its equivalent decimal number. This process is formulated in Equation (6), where, ( , ) is the center pixel and is one of the surrounding pixels. A transition in a binary string is defined as the change from 0 to 1 or change from 1 to 0. The strings that comprise less than or equal to two transitions are known as uniform strings and others are called as non-uniform strings. The main purpose of using uniform pattern was to eliminate the redundant features and capture the information properly. In this work, uniform property was incorporated with the binary strings obtained by LBP. Therefore, most of the redundant binary strings will be eliminated as they were non-uniform. The obtained uniform strings were converted to their respective decimal values. The histogram of those values were taken as the feature vector as formulated in Equation (7), where, denotes the number of bins.
In this work, initially all the images were divided into 16 ( = 4 × 4) blocks irrespective of the image dimension. The main purpose of the blocking was to preserve the local information of the A transition in a binary string is defined as the change from 0 to 1 or change from 1 to 0. The strings that comprise less than or equal to two transitions are known as uniform strings and others are called as non-uniform strings. The main purpose of using uniform pattern was to eliminate the redundant features and capture the information properly. In this work, uniform property was incorporated with the binary strings obtained by LBP. Therefore, most of the redundant binary strings will be eliminated as they were non-uniform. The obtained uniform strings were converted to their respective decimal values. The histogram of those values were taken as the feature vector as formulated in Equation (7), where, L denotes the number of bins. In this work, initially all the images were divided into 16 (= 4 × 4) blocks irrespective of the image dimension. The main purpose of the blocking was to preserve the local information of the image. Divisions were made uniformly in the entire image. Then uLBP was applied in each sub blocks. The obtained feature vectors from each sub block were concatenated to get the final feature vector. Here, the feature dimension of uLBP was 59 (= 8 * (8 − 1) + 3). As the image was divided in 16 sub blocks, the final feature dimension becomes 944.

Horizontal vertical Neighborhood Local Binary Pattern
hvnLBP is a very useful texture feature which was first proposed by Mistry et al. [34]. It acquires better contrast information among the neighborhood pixels such as edges and corners. Similarly, as uLBP, a 3 × 3 window was considered for hvnLBP that contains eight neighboring pixels. The surrounding pixels are denoted as L = {l 0 , l 1 , l 2 , l 3 , l 4 , l 5 , l 6 , l 7 }. In case of hvnLBP, the comparison was done among the neighboring pixels. By comparing those surrounding values, we get a binary string of 8 bits which was further converted to its equivalent decimal value. The process is formulated in Equations (8) and (9). An example for calculating hvnLBP is also shown in Figure 4.
In Equation (9), the l b can also be absent. In that case, only two-pixel intensities will participate in the comparison.

Horizontal vertical Neighborhood Local Binary Pattern
hvnLBP is a very useful texture feature which was first proposed by Mistry et al. [34]. It acquires better contrast information among the neighborhood pixels such as edges and corners. Similarly, as uLBP, a 3 × 3 window was considered for hvnLBP that contains eight neighboring pixels. The surrounding pixels are denoted as = { , , , , , , , }. In case of hvnLBP, the comparison was done among the neighboring pixels. By comparing those surrounding values, we get a binary string of 8 bits which was further converted to its equivalent decimal value. The process is formulated in Equations (8) and (9). An example for calculating hvnLBP is also shown in Figure 4.
(max( , , )) = 1 0 ℎ In Equation (9), the can also be absent. In that case, only two-pixel intensities will participate in the comparison. As the binary string consists of 8 bits, the total number of possible values are 256. After obtaining the equivalent decimal values, the histogram was considered as described in Equation (7).
For generating discriminative facial representation, hvnLBP was combined with the 2-D Gabor filter. In this work, a total of 16 magnitude images of various wavelengths and orientations from Gabor filter were taken at the beginning. Finally, hvnLBP was applied on those 16 magnitude images. The feature vector obtained from each magnitude images were concatenated to get hold of the final feature vector. In a typical hvnLBP, there were total 256 features. As there were 16 magnitude images, the final feature dimension becomes 4096.

Proposed Work
In this paper, we have proposed a FS technique based on HSA, which we have named as supervised filter harmony search algorithm (SFHSA). SFHSA uses cosine similarity and mRMR combined with PCC. HSA has proved to be an efficient technique [52][53][54][55] for providing an optimal solution to real life problems, in terms of feasible computation time and usage of memory, as proposed by Lee and Geem in [53]. It simulates the procedure opted by the musicians to find out the finest tune by selecting a particular combination of frequencies produced by sundry musical instruments. It optimizes an objective function by selecting an appropriate combination of solutions from existing set of solutions by employing random search. In [56], an improved version of HSA has As the binary string consists of 8 bits, the total number of possible values are 256. After obtaining the equivalent decimal values, the histogram was considered as described in Equation (7).
For generating discriminative facial representation, hvnLBP was combined with the 2-D Gabor filter. In this work, a total of 16 magnitude images of various wavelengths and orientations from Gabor filter were taken at the beginning. Finally, hvnLBP was applied on those 16 magnitude images. The feature vector obtained from each magnitude images were concatenated to get hold of the final feature vector. In a typical hvnLBP, there were total 256 features. As there were 16 magnitude images, the final feature dimension becomes 4096.

Proposed Work
In this paper, we have proposed a FS technique based on HSA, which we have named as supervised filter harmony search algorithm (SFHSA). SFHSA uses cosine similarity and mRMR combined with PCC. HSA has proved to be an efficient technique [52][53][54][55] for providing an optimal solution to real life problems, in terms of feasible computation time and usage of memory, as proposed by Lee and Geem in [53]. It simulates the procedure opted by the musicians to find out the finest tune by selecting a particular combination of frequencies produced by sundry musical instruments. It optimizes an objective function by selecting an appropriate combination of solutions from existing set of solutions by employing random search. In [56], an improved version of HSA has been formulated which uses fine-tuning parameters of mathematical techniques to enhance the performance of HSA. Due to the efficient nature of HSA in finding the global optimum solution, it has been exploited in various works related to FS. The traditional approach of HSA involves adjustment of the pitch considering a parameter named Bandwidth (BW). In this phase of the algorithm, we have made a modification in adjustment of the pitch. In our case it was the adjustment of the features by replacing their values with the values of their cosine similar features subject to satisfaction of certain conditions rather than selecting adjacent features as in conventional way. The goodness of a subset was determined using mRMR.
The proposed algorithm is a supervised filter method. As mentioned before, the proposed algorithm is used to optimize the features extracted for FER; the objective of SFHSA is to reduce the feature dimensions while maintaining or increasing the accuracy score. Prior to execution of the algorithm, each feature vector was divided into training and testing sets in the ratio of 2:1. The training set was then used to find the optimal feature subset. The algorithm used is provided in Algorithm 1. if(P1 < HMCR) { 10: Choose a feature f j from the subset 11: Generate a random value for P2 in [0, 1] 12: if(P2 < PAR) { 13: Generate a random value for ε in [−1, 1] 14: Randomly choose a feature f k from subset such that cosine similarity f j , f k is in (−ε, ε) 15: The parameters that we have used include:
Pitch Adjustment Rate (PAR)

Number of Iterations
The HMS determines the size of the harmony memory (HM), i.e., the number of feature subsets present in the HM. Throughout the process, HMCR was applied in order to decide whether the feature to be selected from a feature subset in the HM or not. Its value is in the range of [0, 1] and based on experiments we have initialized it to 0.8. The parameter PAR was used (upon satisfaction of condition) to select a feature randomly whose cosine similarity measure with the pre-selected feature is within the range (−ε, ε), where ε is a randomly generated value of range [−1, 1]. Its value lies in the range [0, 1] and we have fixed it to 0.5 on an experimental basis, in order to set the probability of PAR being satisfied equal to the probability of PAR not being satisfied. At the commencement of SFHSA, all the parameter values were initialized. The value of HMS was set to 15 in order to keep 15 feature subsets in HM and to have more diverse feature subsets for obtaining better results and the maximum number of iterations was set to 20. The initialization phase was followed by the random selection of feature subsets from the training set. In this phase, we have randomly created m feature subsets, where m = HMS. The feature subsets were considered to have a dimension ranging from 80% to 90% of the dimension n of the actual feature set, which is to be reduced in the subsequent phases. These m feature subsets were used to populate the HM initially.
To find the cosine similarity of the features, the each of the feature sets was normalized. The normalization was performed column wise for each feature present in the feature set. The normalized value, N(x) was calculated using Equation (10), where x is an attribute, curr(x) denotes the value of an attribute corresponding to the current instance; and min(x) and max(x) denote the minimum and maximum values of an attribute, respectively, corresponding to all the instances.
Thereafter, the features that had maximum and minimum values equal for all instances were filtered out as they were not considered to have significant contribution in the classification stage.
The normalized values of the features were utilized to find out the cosine similarity [57] between each of two features from the feature set. Equation (11) measures the similarity between features p and q, where they represent any two features from a feature set, p i and q i denote the values for features p and q respectively, corresponding to ith instance and n denotes the total number of instances. The values of the cosine similarities were stored in the form of a matrix for faster computation.
For instance, let there be a feature vector . . . f n are the features in a feature set with dimension n, then the cosine similarity matrix is represented as given in Equation (12), where cosθ a,b is the cosine similarity between ath and bth feature and a, b ≤ n.
For each feature subset, a new feature subset was created using HMCR and PAR. Either a feature is selected from the feature subset in HM (based on cosine similarity value) or a random feature is selected from the existing feature set. For example, let there be a feature subset S = f 2 , f 8 , f 6 , f 5 , f 3 in HM, having a combination of features from the existing feature set say F = f 1 , f 2 , f 3 , f 4 , f 5 , f 6, f 7 , f 8 , f 9, f 10 . If the HMCR is satisfied in an iteration, we select a feature, say f 5 . Again, if PAR is satisfied then the feature f 5 is replaced by selecting a cosine similar feature, say f 2 and removing f 5 . Another option is to select a complete random feature, say f 7 from the global feature set, if neither of the condition is satisfied. Suppose f 2 has been used to replace f 5 and for another feature say f 3 , again f 2 is found to be its cosine similar feature. Therefore, in this case both f 5 and f 3 are replaced by a single feature f 2 ; thus, reducing the feature dimension. This selection of features for replacement has been done n times to improvise the new feature subset from the existing feature subset in HM, where n is the dimension of S and the improvisation of the new feature subset carries on maximum number of iteration times. Thus, in each iteration a new improvised feature subset (with reduced or same dimension) was generated. The selection method of an optimal feature subset using SFHSA is explained in Algorithm 1 whereas the flowchart of the same is provided in Figure 5.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 11 of 22 subset (with reduced or same dimension) was generated. The selection method of an optimal feature subset using SFHSA is explained in Algorithm 1 whereas the flowchart of the same is provided in Figure 5. A decision is then made, whether the improvised subset is better in quality than the previous feature subset. For determining the quality of the feature subsets, mRMR [58] was applied. In evaluation of the feature subsets, mRMR has proved to be an efficient technique [33,[59][60][61]. The concept of mRMR involves maximizing the Relevancy-R (PCC [62] between the feature and the class) and minimizing the Redundancy-D (PCC between the features in the subset). The features which were highly correlated with the class were considered to be relevant and the features those were highly correlated with each other were considered to be redundant. The calculation of PCC is performed using Equation (13)  A decision is then made, whether the improvised subset is better in quality than the previous feature subset. For determining the quality of the feature subsets, mRMR [58] was applied. In evaluation of the feature subsets, mRMR has proved to be an efficient technique [33,[59][60][61]. The concept of mRMR involves maximizing the Relevancy-R (PCC [62] between the feature and the class) and minimizing the Redundancy-D (PCC between the features in the subset). The features which were highly correlated with the class were considered to be relevant and the features those were highly correlated with each other were considered to be redundant. The calculation of PCC is performed using Equation (13), where x and y were two sets of values and x and y are their expectation values respectively. The values of both R and D were calculated using Equations (14) and (15) respectively. x i and x j represent features from the feature subset S and c represents the facial class label.
The value of mRMR is calculated based on finding out the quality score of the feature subset defined by V(S), defined in Equation (16) as follows: During the decision-making process, if the quality score of the newly generated feature subset is found to be better than the worst subset in HM, it replaces the worst subset and again the worst subset is found out from the updated HM. The entire process is iterated until the stopping criterion is met. The finishing stage of the process includes calculating the accuracy score using a classifier and generating the results which is discussed in the next section. Therefore, the proposed HSA is a filter-based FS method based on mRMR. The use of PCC in mRMR makes the algorithm quite effective in selecting the best subsets. The extracted feature descriptors were refined using SFHSA, and the final selected features were chosen from the testing feature sets and passed through a classifier to find the recognition accuracy of the classification problem under consideration (here, FER). A schematic diagram of the proposed model is shown in Figure 6.
The value of mRMR is calculated based on finding out the quality score of the feature subset defined by ( ), defined in Equation (16) as follows: During the decision-making process, if the quality score of the newly generated feature subset is found to be better than the worst subset in HM, it replaces the worst subset and again the worst subset is found out from the updated HM. The entire process is iterated until the stopping criterion is met. The finishing stage of the process includes calculating the accuracy score using a classifier and generating the results which is discussed in the next section. Therefore, the proposed HSA is a filterbased FS method based on mRMR. The use of PCC in mRMR makes the algorithm quite effective in selecting the best subsets. The extracted feature descriptors were refined using SFHSA, and the final selected features were chosen from the testing feature sets and passed through a classifier to find the recognition accuracy of the classification problem under consideration (here, FER). A schematic diagram of the proposed model is shown in Figure 6.

Results and Discussion
The SFHSA-based FS technique was applied to the features obtained from the images of two standard FER datasets, namely JAFFE and RaFD. The five feature sets obtained include uLBP, hvnLBP, Gabor-filter-based, HOG, PHOG features. We considered the facial images of three dimensions: 32 × 32, 48 × 48 and 64 × 64. As a result, for the present FS problem, the total number of feature sets taken under consideration becomes 30 (= 2 × 5 × 3 ), 2 FER datasets, 5 feature sets and facial images of 3 dimensions. Table 2 represents the measurement of sizes of the 5 feature vectors produced using 3 different dimensions of the facial images along with its recognition accuracy. This table also highlights the size of reduced-feature vector with recognition accuracy from two standard FER datasets. Table 2. Tabular representation of the details of original-feature size, reduced-feature size, accuracy score and their corresponding percentage of reduced-feature size with respect to original-feature size and change in accuracy score of 5 different feature vectors extracted from facial images of 3 different dimensions for two popular FER datasets. As mentioned previously, each feature set was segregated into training and testing sets with ratio of 2:1 and SFHSA was applied on the training set only. The testing set was used to get the accuracy score by only selecting the attributes (feature indices) that were present in the reduced version of training feature set, defined as follows: Accuracy score(%) = # f acial images success f ully recognized #total f aical images in the test set × 100 The initial value of HM consists of HMS (=15), which is randomly generated feature subsets having dimensions ranging from 80% to 90% of the original-feature vector. The detail evaluation on the reduced-feature subsets obtained as a result of using SFHSA on uLBP, hvnLBP, Gabor-filter-based, HOG, PHOG feature descriptors and their corresponding accuracy score were presented in Tables 3-7 respectively. We have used the sequential minimal optimization (SMO) classifier with linear kernel [33] to evaluate the recognition performance on the reduced-feature sets. This was done with an aim to achieve higher accuracy scores and also to make convenient to compare with the past experimental results. In Table 3, detailed outcomes for different FS techniques on feature sets extracted using uLBP method were provided in terms of both reduced-feature dimensions and recognition accuracy. Tables 4-7 show the similar comparison of the different FS techniques on feature sets extracted using hvnLBP, Gabor-filter-based, HOG and PHOG feature vectors respectively.
In case of uLBP, hvnLBP, Gabor-filter-based, HOG and PHOG features, our proposed SFHSA algorithm produces reduced-feature sets which were 65%, 67%, 82%, 69% and 60% less than the original-feature vectors, respectively, and also increased the recognition accuracies up to 17%, 24%, 20%, 18% and 25%, respectively, than the original ones. Thus, it can be concluded that SFHSA demonstrates the best performance in case of PHOG features in terms of accuracy score and dimension reduction for both JAFFE and RAFD datasets. Table 6 reflects the detailed outcomes after applying different optimization techniques on PHOG features. The content of the HM was presented to provide a better understanding of reduced-feature subsets obtained at the end of execution of algorithm. This shows how the HM appears after the SFHSA was executed on feature vectors. In this regard, it is worth mentioning that for all other feature vectors, we have obtained similar content in the HM.
Tables 3-7 also highlight the detailed comparative results observed in the present experiment with some other standard optimization algorithms such as simulated annealing (SA), GA, memetic algorithm (MA), mutation enhanced binary particle swarm optimization (ME-BPSO) [63], whale optimization algorithm-crossover mutation (WOA-CM) [64] and LHCMA [33]. The achieved reduced-feature sets along with highest accuracy were marked in bold in the Table Analyzing the observed outcomes, it can be said that our SFHSA-based FS technique has surpassed the above mentioned techniques. It can be observed that there was a significant increment in accuracy score as compared to previous techniques. Out of 30 cases, our proposed technique achieves finer results in 16 cases as compared to all other techniques (2 cases in Gabor-filter-based feature vectors (up to 8% increment), 3 cases of HOG feature vectors (up to 2.60% increment), 4 cases of PHOG feature vectors (up to 3.60% increment), 3 cases of hvnLBP feature vectors (up to 9% increment) and in 4 cases of uLBP feature vectors (up to 2.40% increment). Thus, it is evident that our technique of feature optimization reduced feature dimensionality and improved recognition accuracy. Proposed SFHSA performs quite well against wrapper-based algorithms. This supports the effectiveness of the proposed method. It was observed from the preceding experiment that the second and third best performing algorithms (in terms of accuracy scores) were LHCMA, WAO-CM. Therefore, we have also presented the performance comparison of these techniques with our proposed SFHSA (which achieves the highest accuracy score) in terms of three well-known statistical parameters, namely precision, recall and F-measure. This is done in order to enhance the clarity of the comparison. Tables 8-12 present the performance comparison of SFHSA with respect to LHCMA and WAO-CM for uLBP, hvnLBP, Gabor filter, HOG and PHOG features respectively. The comparison demonstrates that SFHSA has outperformed the rest two techniques in 16 out of 30 cases, from which a very high capability of FS of our technique can be inferred.
For each feature vector, we have presented the best reduced-feature set in term of both achieved recognition accuracy and reduced dimensionality out of all the HMS (=15) feature subsets in HM. Since, we have obtained the best results for PHOG features, considering facial images of dimension 32 × 32, so through the visual presentations of Figures 7 and 8, we have shown the different reduced-feature sets with recognition accuracies for both JAFFE and RAFD databases respectively. In Figure 7, different colors were used to denote the dimension (number of features) of the reduced-feature sets obtained as a result of applying SFHSA on PHOG features extracted from images of dimension 32 × 32 in JAFFE database and the corresponding recognition accuracies were presented on the top of each bar, which represents a specific reduced-feature set in HM. Similarly in Figure 8, distinguishable colors were used to specify the dimension (number of features) of the reduced-feature sets in HM obtained as a result of applying SFHSA on PHOG features extracted from images of dimension 32 × 32 in RaFD database along with corresponding recognition accuracies, represented in similar fashion as in Figure 7.

Conclusions
In this paper, we have focused our attention in reducing the dimension of the feature sets obtained from facial expression images extracted using five feature descriptors-uLBP, hvnLBP, Gabor filters, HOG and PHOG. The evaluation of the proposed methodology, called SFHSA, is done on two benchmark FER datasets, namely RaFD and JAFFE. The proposed algorithm was applied on the extracted feature sets and reduced-feature subsets were obtained with higher accuracy score. It is evident from the results that our FS technique has effectively filtered out redundant/irrelevant features and also outperformed many existing FS techniques like SA, GA and MA. Following the primary backbone of traditional HSA, we have proposed filter variant of the HSA. Cosine similarity was used for the purpose of adjustment of features. On the other hand, both the mRMR and PCC values were used as a way for determining the feasibility of the optimal feature subsets. The performance of the SFHSA was evaluated using SMO classifier. The comparison presented has enabled us to conclude that our algorithm can be applied for FS in domains where the curse of dimensionality puts forth a concerning challenge to the researchers.