Face Gender Recognition in the Wild: An Extensive Performance Comparison of Deep-Learned, Hand-Crafted, and Fused Features with Deep and Traditional Models

: Face gender recognition has many useful applications in human–robot interactions as it can improve the overall user experience. Support vector machines (SVM) and convolutional neural networks (CNNs) have been used successfully in this domain. Researchers have shown an increased interest in comparing and combining different feature extraction paradigms, including deep-learned features, hand-crafted features, and the fusion of both features. Related research in face gender recognition has been mostly restricted to limited comparisons of the deep-learned and fused features with the CNN model or only deep-learned features with the CNN and SVM models. In this work, we perform a comprehensive comparative study to analyze the classiﬁcation performance of two widely used learning models (i.e., CNN and SVM), when they are combined with seven features that include hand-crafted, deep-learned, and fused features. The experiments were performed using two challenging unconstrained datasets, namely, Adience and Labeled Faces in the Wild. Further, we used T-tests to assess the statistical signiﬁcance of the differences in performances with respect to the accuracy, f-score, and area under the curve. Our results proved that SVMs showed best performance with fused features, whereas CNN showed the best performance with deep-learned features. CNN outperformed SVM signiﬁcantly at p < 0.05.


Introduction
Gender recognition is vital in interconnected information societies; it has applications in many domains such as security surveillance, targeted advertising, and human-robot interactions. Face gender recognition plays a key role in the latter domain since it allows robots to adapt their behavior based on the gender of the interacting user, which increase user acceptance and satisfaction [1]. A wide range of contributions exist in literature that present a variety of frameworks [2][3][4][5][6][7], feature descriptors [8][9][10][11][12][13], classification model architectures [14][15][16], and benchmark datasets [17] with state-of-the-art results. Despite the achieved success, face gender recognition is still considered a challenging and unsolved problem; therefore, researchers continue to seek a solution [15,18].

Literature Review
Gender recognition is a domain where high state-of-the-art accuracy has been achieved by SVMs and CNNs [21,33,34,37,38]. These results, however, have been attributed to the characteristics of the dataset used [17,21,22,31]. For example, many of the early efforts in gender recognition have used constrained datasets that included frontal face images that were taken under controlled conditions of facial expressions, illumination, and background [19][20][21], and hence cannot achieve the same performance with images taken in the wild by surveillance or robot cameras. Building a gender recognition model based on face images is similar to other computer vision tasks; the process has three main stages: selecting the benchmark dataset, feature extraction and selection, and classification. In the text below, we highlight the main efforts made in each stage for progress in the field. Furthermore, we summarize the results of the most relevant works in Table 1.
A dataset is an integral part of gender recognition research. Selecting an appropriate dataset to benchmark the proposed approach is a crucial decision because datasets introduce different challenges, such as pose variations, illumination variations, and occlusions. Gender recognition datasets can be broadly categorized into constrained and unconstrained datasets. The former includes frontal face images that were taken under controlled conditions of facial expressions, illumination, and background. Numerous early studies have been criticized for benchmarking their works with constrained datasets, such as FERET [19][20][21][22] and UND [20] because they do not reflect real-world situations [23,39]. Therefore, many studies were aimed at the challenges posed by the images taken under uncontrolled conditions, for example, LFW [20,22] and Adience [17,23,32,40,41] datasets and datasets with occlusions (e.g., sunglasses and hats), such as AR [20,22], Gallagher [32], and MORPH [40]. The authors in [17] offered a unique unconstrained and unfiltered dataset. Torralba and Efros [42] argued that the most popular datasets were biased, and they emphasized that using a single dataset for training and testing was not representative of the variations that exist in the real world. Therefore, to simulate real-life situations, recent studies [17,21,22,31] have adopted a cross-data approach, where one dataset is trained, and another dataset is tested. Other contributions [38,41] used a fusion of multiple constrained and unconstrained datasets for testing purposes. Moreover, some efforts have targeted a specific type of image, such as low-resolution thumbnail faces [43] and low-frequency components of the mosaic 8 × 8 images [44].
A fundamental problem was to determine what features in a person's face can help determine the person's gender. A wide range of studies have been devoted to improving the extraction and selection of features [45][46][47]. In recent years, there has been an increasing amount of computer vision literature that distinguishes between hand-crafted features and deep-learned features [45]. The hand-crafted features are designed beforehand by human experts, whereas deep-learned features are learned directly from the data using CNN. Furthermore, some studies reported performance improvements when the two features were combined [31,32]. One of the early works on the hand-crafted features is [48], where the authors combined the 3D distances with multiple measurements (such as the distance between the key points in the face, the ratios, and the angles between the key points) into a single function. Tamura et al. [44] divided the human face into four parts to determine which part contributed the most to identifying the gender. The results revealed that the face shape and cheek bone shape are the most important aspects. Further, the authors of [49] identified nine facial features that varied and hence could be used to distinguish males from females, namely, the hairline, eyebrows, eyes, distance between the eye and eyebrows, nose, lips, chin, cheeks, and face shape.
The hand-crafted features could be extracted from the facial features including the face shape by using the histogram of oriented gradients (HOGs) [50], texture using the local binary pattern (LBP) [51], and intensity features using the gray level of each pixel [20]. The geometric features can also be extracted, such as scale invariant feature transform (SIFT) [52] and Haar-Like features [21]. Jabid et al. [19] presented face images using a novel texture descriptor local directional pattern (LDP), and Shobeirinejad and Gao [10] proposed interlaced derivative patterns, which outperformed the LBP and LDP features. A number of authors have reported performance improvements when different types of hand-crafted features are fused, such as domain-specific and trainable features [18], trainable shapes, and color features [53], LBP and local phase quantization features [8], shape and texture features [54], LBP and radii spatial scales features [20], appearance-based and geometricbased features [55], appearance and geometry features [12], gradient and Gabor wavelets features [13], and LBP, SIFT, and color histogram [52]. In contrast, Alexandre [11] showed that a single feature from different scales could outperform multiple features at a single scale. In [9], adaptive features were proposed, which resulted in accuracy improvements in the SVM model. The research in [31] showed that the hand-crafted features fusion could improve the SVM performance.
A growing body of literature has investigated deep-learned features and investigated how gender recognition accuracy differs when compared and combined with hand-crafted features. Nanni et al. [29] proposed a generic computer vision system that extracted, compared, and combined hand-crafted features with deep-learned features to train an SVM model using several datasets from different domains. The authors showed that a fusion of both hand-crafted and deep-learned features provided the best performance with SVM. Ozbulak et al. [34] explored transfer learning using generic and domain-specific models to extract deep-learned features to train different CNN and SVM models. Their results proved that the use of deep-learned features extracted using domain-specific models could improve the accuracy of all the models. In [56], the authors proposed the joint features learning deep neural networks, which could learn from the joint high-level and low-level features. The proposed architecture outperformed CNNs, SVM with face pixels, and SVM with LBP features. In [35], the authors compared hand-crafted and deep-learned features by training a CNN model for pedestrian gender recognition. Their results showed that hand-crafted and deep-learned features performed comparably on small-sized homogenous datasets, but the latter performed significantly better on heterogeneous data. In [57], the authors showed that feeding deep-learned features into an SVM rather than Softmax in VGGNet-16 provided better results. The fusion of deep-learned and hand-crafted features achieved better results than using only deep-learned features with ensemble learning [58].
Studies in the field of gender recognition have only focused on comparing the two feature extraction paradigms with one model [32], a single paradigm with multiple models [33,34], or limited feature extraction methods and models [21,37,38,81]. Because of the variations in the experimental setups, the results from different studies cannot be compared. Therefore, it is not clear yet how the different feature extraction paradigms (i.e., hand-crafted, deep-learned, and fused features) would perform when combined with different models (including CNN and SVM); this concern is addressed in this research.

Methodology
We adopted an experimental methodology to compare the performances of two classification methods and seven feature extraction methods in the domain of gender recognition with respect to three performance measures. In addition, we performed a statistical analysis of the obtained results using T-tests to assess the statistical significance of the differences in performance.

Features Extraction
We applied seven features extraction methods, which can be divided into three main categories: hand-crafted features, deep-learned features, and fused features.

Hand-Crafted Features
Hand-crafted features can be categorized into global features, pixel-based features, and appearance-based features. A feature extraction method was selected from each category based on the previous usage by the community in the gender recognition domain. All the methods are well-known and widely used in many domains. We briefly explain each method below.
Local Binary Pattern (LBP): This is a simple yet effective pixel-based texture descriptor that was originally proposed by Ojala et al. [51] LBP is one of the most commonly used hand-crafted feature extraction methods in gender recognition [31,34,69,71,[82][83][84]. The original descriptor assigns a binary digit for each pixel in a 3 × 3 neighborhood by comparing their pixel intensity values with the central pixel, which acts as a threshold. One digit is assigned to the pixel if its value is greater than or equal to the central pixel; otherwise, the pixel value is zero. The binary value for the central pixel is then computed by concatenating the eight binary digits of the neighboring pixels in a clockwise direction. LBP was later improved by using flexible neighborhood sizes [85]. The descriptor has two main parameters, which are the parameters of the circular neighborhood (P, R). This parameter determines the neighborhood size, where P is the number of sampling points in a circle of radius R. In our experiments, we used P = 24 and R = 3. The resulting LBP features are of size 26.
Histogram of Oriented Gradient (HOG): This is an appearance-based descriptor that extracts the gradients and orientations of edges in an image to describe the structure or shape of the object. It was promoted by Dalal and Triggs in 2005 [50] and has been applied successfully for face gender recognition [71]. The HOG features are extracted as follows. First, we compute the gradient of each pixel in both the x and y directions. Second, using the gradients, we calculate the magnitude and direction of each pixel. Third, we divide the image into small cells, and we compute the histogram of the gradients for each cell. Next, multiple cells are combined to form a block, and normalization is applied. Lastly, the normalized histograms of the blocks are combined to form the HOG features. Multiple parameters can be tuned to improve the accuracy of this descriptor including the cell size, the overlap between cells, block normalization, and types of blocks (either rectangular R-HOG blocks or circular C-HOG blocks). The following values were used in our experiments with R-HOG blocks: cell size = (8,8), block size = (16,16), and number of orientation pins = 9. The resulting features are of size 1764.
Principal Component Analysis (PCA): This is a global feature extraction method that uses linear transformation to map the features space into lower dimensions while maximizing their variance. PCA can be applied to images' raw pixel values or to other hand-crafted features, resulting in second-order uncorrelated features. To extract the PCA features, the dataset must be standardized. Then, we identify the relationships between the features by computing a covariance matrix for the dataset. Next, we perform eigen decomposition to obtain the eigenvalues and eigenvectors of the matrix. The principle components of the dataset are the eigenvectors with the greatest eigenvalues. The user may decide to keep all or only a subset of the principle components. Lastly, the selected principle components are transposed and multiplied to the transpose of the original dataset, which yields the PCA features. In this work, PCA has been applied on the images' raw pixel values, where the first two components were used.

Deep-Learned Features
We applied deep transfer learning by using a CNN as a fixed feature extractor (see the upper part of Figure 1). Similar to the methods used in [34,75,86], we used a pre-trained VGG-16 on ImageNet [87] and removed the last fully connected layer. We treated the rest of ConvNet as a fixed feature extractor for our datasets. The input layer accepted images of the size 224 × 224 and had three channels: red, green, and blue. The input images went through a series of hidden convolution layers, which used the rectified linear unit activation function. Some layers were followed by a max-pool layer, which was performed over non-overlapping max-pool windows of the size 2 × 2 with the stride equal to two. The dimension of the deep-learned features was 7 × 7 × 512.

Deep-Learned Features
We applied deep transfer learning by using a CNN as a fixed feature extractor (see the upper part of Figure 1). Similar to the methods used in [34,75,86], we used a pretrained VGG-16 on ImageNet [87] and removed the last fully connected layer. We treated the rest of ConvNet as a fixed feature extractor for our datasets. The input layer accepted images of the size 224 × 224 and had three channels: red, green, and blue. The input images went through a series of hidden convolution layers, which used the rectified linear unit activation function. Some layers were followed by a max-pool layer, which was performed over non-overlapping max-pool windows of the size 2 × 2 with the stride equal to two. The dimension of the deep-learned features was 7 × 7 × 512.

Fused Features
The fusion of deep-learned and hand-crafted features was aimed to provide a holistic description of the images. As mentioned previously, several studies have reported that fusing specific hand-crafted features with images can improve the performance of CNNs [30,32]. For this purpose, the extracted deep-learned features were concatenated with the hand-crafted features, namely LBP, HOG, and PCA, yielding fusion of HOG and deeplearned, fusion of LBP and deep-learned, and a fusion of PCA and deep-learned features. The fused features are then fed to the classification model, as shown in Figure 1.

Dataset
There are mainly two types of benchmark datasets that have been used in literature. The first type is the constrained dataset, in which images were taken under controlled conditions. The second type was unconstrained datasets, where images are taken under uncontrolled conditions. In this study, we used two challenging and commonly used unconstrained benchmark datasets, which are briefly described below.

Fused Features
The fusion of deep-learned and hand-crafted features was aimed to provide a holistic description of the images. As mentioned previously, several studies have reported that fusing specific hand-crafted features with images can improve the performance of CNNs [30,32]. For this purpose, the extracted deep-learned features were concatenated with the hand-crafted features, namely LBP, HOG, and PCA, yielding fusion of HOG and deep-learned, fusion of LBP and deep-learned, and a fusion of PCA and deep-learned features. The fused features are then fed to the classification model, as shown in Figure 1.

Dataset
There are mainly two types of benchmark datasets that have been used in literature. The first type is the constrained dataset, in which images were taken under controlled conditions. The second type was unconstrained datasets, where images are taken under uncontrolled conditions. In this study, we used two challenging and commonly used unconstrained benchmark datasets, which are briefly described below.

Labeled Faces in the Wild
We used LFW deep funneled images dataset [36]. LFW consists of over 13,000 face images of real people from both genders collected from the web. The face images are of varying conditions of image quality, facial expressions, head poses, illuminations, and occlusions. The samples are shown in Figure 2. We used the deep funneled version of the dataset because it is the best version available in terms of achieved accuracy. In this version, the face images were aligned using deep learning [36]. Similar to [20] and [39], we used a subset of the dataset. The original dataset was unbalanced; therefore, we performed undersampling of the majority class to create a balanced dataset having a size of 6000 images. Further, following [86], we performed preprocessing to resize all the images to 224 × 224 to be processed by the VGG-16 model. The dataset was divided into balanced five folds to perform cross validation.

Labeled Faces in the Wild
We used LFW deep funneled images dataset [36]. LFW consists of over 13,0 images of real people from both genders collected from the web. The face images varying conditions of image quality, facial expressions, head poses, illuminations, clusions. The samples are shown in Figure 2. We used the deep funneled version dataset because it is the best version available in terms of achieved accuracy. In th sion, the face images were aligned using deep learning [36]. Similar to [20] and [ used a subset of the dataset. The original dataset was unbalanced; therefore, we perf under-sampling of the majority class to create a balanced dataset having a size images. Further, following [86], we performed preprocessing to resize all the im 224 × 224 to be processed by the VGG-16 model. The dataset was divided into ba five folds to perform cross validation.

Adience
This dataset is one of the most challenging available datasets because it include images and subjects than other available datasets, such as Gallagher and PubFig contains more than 26,000 images of over 2000 people uploaded to the Flicker.com albums. According to the authors, the faces in the images were first detected using and Jones face detector [88], and the facial feature points were then identified by a fied version of the study in [89]. In this research, we used the whole dataset of the a and cropped face image version, which was already divided into five folds for cro dation [17].

SVM
SVM is a widely used learning model, which is applied for classification and sion. The basic idea of SVM is to separate the data by finding a hyperplane that max the margin between the two classes of data. The margin represents the distance b the data points from each class that lies closest to the hyperplane, known as suppo tors. SVM uses a kernel function to map non-linearly separable data into a higher sional feature space, where it becomes linearly separable. SVM performance can b mized by tuning the parameters kernel, C, and gamma. The kernel variations used i linear, RBF, and polynomial kernel. The parameter C is used for regularization; if to a large value, a small margin will be used for optimization and vice versa. Gam set when a Gaussian RBF kernel is used. Features are fed directly to SVM, but in

Adience
This dataset is one of the most challenging available datasets because it includes more images and subjects than other available datasets, such as Gallagher and PubFig [17]. It contains more than 26,000 images of over 2000 people uploaded to the Flicker.com public albums. According to the authors, the faces in the images were first detected using a Viola and Jones face detector [88], and the facial feature points were then identified by a modified version of the study in [89]. In this research, we used the whole dataset of the aligned and cropped face image version, which was already divided into five folds for cross validation [17].

SVM
SVM is a widely used learning model, which is applied for classification and regression. The basic idea of SVM is to separate the data by finding a hyperplane that maximizes the margin between the two classes of data. The margin represents the distance between the data points from each class that lies closest to the hyperplane, known as support vectors. SVM uses a kernel function to map non-linearly separable data into a higher dimensional feature space, where it becomes linearly separable. SVM performance can be optimized by tuning the parameters kernel, C, and gamma. The kernel variations used include linear, RBF, and polynomial kernel. The parameter C is used for regularization; if C is set to a large value, a small margin will be used for optimization and vice versa. Gamma is set when a Gaussian RBF kernel is used. Features are fed directly to SVM, but in case of the deep-learned features, they are first flattened from 7 × 7 × 512 to a one-dimension vector of size 25,088. In this work, we used SVM with an RBF kernel. The values of the parameters are C = 10 and gamma = 0.001.

CNN
As explained previously, the deep-learned features were extracted using a pre-trained VGG-16 model. The last maximum pooling layer in the model was connected to a global average pooling to convert the image features from a 7 × 7 × 512 vector to a 1 × 1 × 512 vector. Then, we trained three dense layers for our dataset with two dropout layers with 0.5 probability to avoid overfitting. The Softmax function was used on the last layer to convert the layer output to a vector that represented the probability distribution of a list of possible outcomes for two classes. In our experiments, CNN was trained with 2000 epochs, 128 batch sizes, an Adam optimizer, and the binary cross-entropy loss function.

Performance Evaluation
Unlike most of the existing efforts in literature, which adopt the classification rate as the only performance measure, we recognize the importance of looking at the performance of a classifier from different angles [90]. Therefore, we evaluate the performance of the classification models with respect to three important metrics, namely, accuracy, F-score, and AUC. Therefore, investigating the performance with respect to different metrics can help the community improve the performance of classifiers in this domain. Further, the k-fold cross-validated paired t-test is applied to assess the statistical significance between two models A and B according to Equation (1) below.
where k is the number of folds, p i is the difference between the model performances in the ith iteration p i = p i A − p i B and p computes the average difference between the model performances p = 1 k ∑ k i=1 p i .

Results and Discussion
The experimental results are shown in Tables 2-4 Table 2 is quite revealing in several ways. First, we can observe that, on average, SVM performs comparably with HOG and LBP features, whereas it has slightly less accuracy using the PCA features. Yet, when deep-learned features are used, SVM performance with respect to the accuracy increases by 12.95% as compared with the best performance with hand-crafted features. However, what is interesting in our result is that the best SVM performance is achieved when fused features are used because the classifier achieves at least 22.40% and 9.45% increase in accuracy as compared with hand-crafted and deeplearned features, respectively. Our SVM results with deep-learned features outperform those reported in [33] when SVM with dropout and oversampling is trained on the Adience dataset. Next, we considered the CNN model. We observed that the CNN model had the best performance with deep-learned features. Table 2 shows that the model accuracy is 86.60% with deep-learned features; however, this accuracy drops by at least 5.65% when fused features are used. These results contradict earlier findings by [32], which showed that feeding hand-crafted features to CNN can improve their performance. This difference can be explained by the fact that only Gabor filters were used in [32] as hand-crafted features. Furthermore, the CNN accuracy achieved in this research is higher than that reported in [41], where a CNN model trained on the Adience dataset achieved 84% accuracy.
On comparing the SVM and CNN performances with different types of features, we can see that the CNN model with deep-learned features outperforms the best SVM result with fused features on the Adience dataset. However, opposite results were obtained on the LFW dataset. Our T-test shows that the result with the Adience dataset (p = 0.0002) is significant whereas the result with LFW (p = 0.093) is insignificant at p < 0.05. These results suggest that CNN with deep-learned features is superior to SVM using any type of feature. These results further support the observations from earlier studies [33]. Similar trends can be observed in Tables 3 and 4, where the performances are presented with respect to the F-score and AUC, respectively. In both the tables, SVM exhibits the worst average performance with hand-crafted features. The model's average performance improves when deep-learned features are used, whereas further improvement is achieved with fused features. For CNNs, the latter features yield the worst performance as compared with the performance with deep-learned features. In addition, similar to the observations in Table 2, the CNN model performs significantly better at p < 0.05 than the best-performing SVM with fused features with p = 0.002 on the Adience dataset; however, the difference in performances between the SVM with the fused features and CNN with deep-learned features on the LFW dataset is insignificant (p = 0.123). Similar observations apply on the AUC with p = 0.00003 on the Adience dataset and p = 0.098 on the LFW dataset.

Conclusions
Face gender recognition plays a key role in robot-human interactions since it allows robots to adapt their behavior based on the gender of the interacting user, which increases user acceptance and satisfaction. The main goal of the current study was to comprehensively assess the performance of the most successful machine learning model in gender recognition, namely CNN and SVM, when combined with seven common feature extraction methods that included hand-crafted, deep-learned, and fused features. Previous studies on the subject have been mostly restricted to making limited comparisons of hand-crafted and deep-learned features with one model [27,46] or deep-learned features with multiple models [16,21]. Furthermore, contradictory findings have been reported about the best-performing combination in the latter category. For this purpose, we performed a comparative analysis of the CNN and SVM models when trained using three hand-crafted features (HOG, LBP, and PCA), deep-learned features (using transfer learning to extract features from a pre-trained VGG-16 model), and a fusion of both features; this analysis yielded seven sets of features. We used the most challenging datasets available, namely, Adience and LFW, and we presented the performance with respect to the accuracy, f-score, and AUC.
The most significant findings from this study are that (1) SVM performs the best when trained on a fusion of hand-crafted and deep-learned features, followed by deep-learned features. The worst performance is exhibited when trained on hand-crafted features.
(2) CNN performance decreases when the features are fused with hand-crafted features, including HOG, LBP, and PCA. (3) The CNN model outperforms SVM with all three feature extraction paradigms. The results of this study prove that although deep-learned features can enhance the performance of SVM, CNN still exhibits superior performance in the gender recognition domain. The reported results are possibly influenced by the fact that the Adience dataset is much larger in size than LFW (26,000 vs. 6000) but is more challenging dataset since, unlike LFW, it contains images of individuals from eight age groups [17]. A natural progression of this research would be to analyze the performance using other hand-crafted features, such as SIFT and Gabor filters and with deep-learned features extracted by CNNs of varying architectures and with fine tuning. Another possible area for future research would in investigating whether the findings of this research would hold with cross-data training, where a model is trained on a dataset and tested on a different dataset.