An Intelligent Gender Classiﬁcation System in the Era of Pandemic Chaos with Veiled Faces

: In the world of chaos, the pandemic has driven individuals around the globe to wear face masks for preventing the virus’s transmission, however, this has made it difﬁcult to determine the gender of the person wearing a mask. Gender information is part of soft biometrics, which provides extra information about a person’s identiﬁcation, thus, identifying a gender based on a veiled face is among the urgent challenges that must be advocated for in the next decade. Therefore, this study exploited various pre-trained deep learning networks (DenseNet121, DenseNet169, ResNet50, ResNet101, Xception, InceptionV3, MobileNetV2, EfﬁcientNetB0, and VGG16) to analyze the effect of the mask while identifying the gender using facial images of human beings. The study comprises two strategies. First, the experimental part involves the training of models using facial images with and without masks, while the second strategy considers images with masks only, to train the pre-trained models. Experimental results reveal that DenseNet121 and Xception networks performed well for both strategies. Besides this, the Inception network outperformed all others by attaining 98.75% accuracy for the ﬁrst strategy, whereas EfﬁcientNetB0 performed well for the second strategy by securing 97.27%. Moreover, results suggest that facemasks evidently impact the performance of state-of-the-art pre-trained networks for gender classiﬁcation.


Introduction
Gender classification is significant in several contexts. Gender information is part of soft biometrics, which provides extra information about a person's identification. Furthermore, it can increase facial recognition performance, which is considered one of the most useful biometric features and has more benefits than other biometric systems. As a result, it is frequently employed to deliver advanced analysis in human-computer interaction in several applications. Gender classification has been researched for decades and has attracted substantial attention from researchers and expanded fast owing to its usefulness in providing secure and dependable security for enterprises, organizations, face monitoring, airports, etc.
Gender detection can be easily applied in different areas while using different types of data such as voice [1], text data [2], speech [3], and images [4]. For instance, the authors of [1] explored gender classification based on a person's speech using audio preprocessing approaches that can be offered for the creation of efficient software systems for gender detection based on audio recordings. To benefit legal investigation, marketing analysis, advertising, and recommendation areas, Ref. [2] investigated an effective method analysis, advertising, and recommendation areas, Ref. [2] investigated an effective method to classify gender from Twitter's tweet texts using natural language processing (NLP), bag of words, word embedding language processing techniques, and traditional machine learning algorithms such as support vector machine (SVM), naive Bayes (NB), and logistic regression (LG). To improve the accuracy of an emotion recognition system employing speech data, Ref [3] proposed a hybrid scheme by combining the random forest recursive feature elimination (RF-RFE) technique for the selection of features with the gradient boosting machine (GBM) approach for gender categorization. Moreover, studies such as [5,6] where facial recognition is performed using various machine learning classifiers also encourage scientists to expand the models to gender classification.
Coronavirus infection , which is a contagious disease that mostly spreads via direct or indirect contact with an affected individual [7], has driven individuals all over the world to wear face masks to prevent the virus's transmission [8], making it difficult to determine the gender of the person wearing a mask. The analysis of a human face in images without a facial mask is a tedious procedure since the human face in an image might vary owing to changes in position, orientation, and many additional factors such as photo resolution, lighting conditions, etc. To overcome this issue, this study employed deep neural network (DNN) models that produce accurate gender prediction results. Figure 1 shows the workflow of the proposed scheme for gender classification. The pre-processing phase removes duplicate and unnecessary data, eliminates the outliers, and resizes the images to a 299 × 299 fixed ratio. Later, the pre-trained networks utilize this imaging dataset to determine the gender of each individual wearing a face mask. Besides extracting relevant features, DNN models have excellent computer vision capabilities to perform image recognition tasks [9], for example, convolutional neural networks (CNNs) are often employed for visual image evaluation [10]. Deep learning (DL) algorithms, considered as subset of artificial intelligence (AI), focus on computer learning and improving on their own by analyzing several algorithms [11]. The algorithm takes an input image and assigns priority, learnable weights, and biases to distinct areas of the image in order to differentiate it from other images. Compared to other classification algorithms, a ConvNet requires substantially less pre-processing, thus this study exploits DL-based pre-trained models that use facial images of humans with different types of face masks. Broadly, the major contributions are outlined as follows: • An extensive review to show that gender detection using face images with masks is in its infancy.

•
To ensure smoothness, this work employs nine various deep-learning pre-trained networks.

•
To analyze the effect of wearing facemasks on gender classification, the study deploys two strategies: models trained with images where humans wear face masks, and an extended version that incorporates facial images without masks.

•
To lessen the false positivity rate, it exploits a technique to remove outliers. Besides extracting relevant features, DNN models have excellent computer vision capabilities to perform image recognition tasks [9], for example, convolutional neural networks (CNNs) are often employed for visual image evaluation [10]. Deep learning (DL) algorithms, considered as subset of artificial intelligence (AI), focus on computer learning and improving on their own by analyzing several algorithms [11]. The algorithm takes an input image and assigns priority, learnable weights, and biases to distinct areas of the image in order to differentiate it from other images. Compared to other classification algorithms, a ConvNet requires substantially less pre-processing, thus this study exploits DL-based pre-trained models that use facial images of humans with different types of face masks. Broadly, the major contributions are outlined as follows: • An extensive review to show that gender detection using face images with masks is in its infancy.

•
To ensure smoothness, this work employs nine various deep-learning pre-trained networks.

•
To analyze the effect of wearing facemasks on gender classification, the study deploys two strategies: models trained with images where humans wear face masks, and an extended version that incorporates facial images without masks.

•
To lessen the false positivity rate, it exploits a technique to remove outliers.

•
To check the robustness, this work is applied to unseen images and the performance is measured through several performance metrics. The following is the structure of the paper: Section 2 examines the numerous studies that have been conducted on the subject. Section 3 outlines the suggested technique. Section 4 summarizes the experiment's findings and provides a comparison/discussion, while Section 5 discusses the conclusion.

Literature Review
Gender categorization has been studied extensively using a variety of methodologies and approaches. For instance, the authors of [4] presented gender prediction for a facial images or real-time video using CNNs. They carried out face detection, cropping, and resizing as pre-processing steps in this study, and proposed three CNN-based models with different architectures created. They found that using a CNN model with a deeper network (more layers) produces the best results. The authors of [12] aimed to classify a person's gender and emotions in real time or by using the person's image on a smartphone or a hard copy of the picture. Gender detection was achieved by obtaining the softmax value to identify gender using real-time CNNs. It is underlined that during gender identification, the cardinal and essential item to consider is face detection, as well as appropriate and relevant classification of facial features in low lighting and imperfect situations.
Similarly, the authors of [13] also proposed a CNN to classify gender and age by training on ten thousand grayscale human facial images. They exploited a pre-trained ResNet model for training, while the leading process of their study identified faces using the Haar cascade frontal face as the default classifier that then classified gender based on those faces. To find the best features of the iris for classifying gender using NIR images, [14] has proposed five different experiments: using whole features from normalized images, using a transfer learning approach with a VGG19 model, selecting the most predominant blocks using a genetic algorithm (GA), selecting the most predominant pixels using p-values, and encoding the images using a quaternionic code with 4 bits per pixel. Based on the experimental results, they claimed that the quaternionic code-based scheme outperformed others by securing the highest accuracy, 95%, for the right iris and 93% for the left iris, with 2400 selected features.
The authors of [15] examined the performance of gender classification using deeper CNNs trained on different facial components. The results demonstrated that their proposed strategy worked well with larger crop sizes as it achieved promising efficiency. In comparison to eyes, their research showed that the proposed approach can accurately determine gender from the mouth, nose, and face. The authors of [16] provide a gender classification technique that combines image processing techniques and data mining methodologies. The system performs typical image processing procedures such as acquisition, pre-processing, feature extraction using the LBG vector quantization approach, and classification using data mining methods such as naive Bayes, SVM poly kernel, SVM radial basis function (RDF) kernel, and k-nearest neighboring (kNN). All of the classifiers' classification findings demonstrate that the male classification rate is higher than the female rate.
A fast gender classification method from frontal facial images using features selected from the mouth and chin has been proposed by [17] using two standard classifiers: SVM and a probabilistic neural network (PNN). The method performs gender classification by extracting the lower part of frontal face images using the geometric model method, builds a gray level co-occurrence matrix (GLCM) from the retrieved image, extracted features from the GLCM, and classified the face based on gender. The result shows that SVM outperforms the PNN with 94.34% accuracy. Likewise, the authors of [18] investigated a gender classification method based on multi-level local phase quantization (ML-LPQ) features derived from normalized face images using an SVM model with a non-linear kernel (RBF) classifier. As a result, their strategy outperformed other state-of-the-art approaches.
The local directional pattern (LDP) is a distinct texture description that provides a robust feature to characterize facial appearance. For instance, the authors of [19] introduced the LDP to explain a gender recognition facial picture by dividing the facial regions into tiny parts to collect LDP histograms and later combined them into a single feature vector from various sections. They used SVM to perform classification, which outperformed many traditional pattern classifiers in gender classification tasks. In addition, the authors of [20] provided experimental research that applied wavelet transform for gender categorization for the first time. It decomposed facial images using a 2-D discrete wavelet transform (DWT). Moreover, they incorporated fisher linear discriminant (FLD) and principal component analysis (PCA) to decompose coefficients for feature reduction and gender classification. They determined the accuracy rate of their approaches by employing 10-fold cross-validation methodology. As a result, they discovered that the nose on the face is the most distinguishing feature.
The authors of [21] introduced a unique technique based on the spectral angle mapper (SAM) that can efficiently gather spectral information across many spectral bands and classify it using the linear SVM. By measuring the photometric property of the acquired image, they investigated the feasibility of extended multi-spectral imaging for gender categorization. They also tested the approach on the extended multi-spectral face database, made using six different illuminations. A strategy based on the multi-scale facial fusion feature (MS3F) has been presented in [22] to predict gender from faces using SVM as a base classifier and local phase quantization (LPQ) and local binary pattern (LBP) as feature descriptors to extract the feature from facial images. In terms of accuracy, they compared their work to state-of-the-art approaches and attained a superior result. To develop an architecture that can be implemented on mobile devices, smart device developers presented a lightweight multi-task CNN (LMTCNN) architecture in [23] for simultaneous gender classification.
Besides these, [24] offered a single picture gender categorization technique that contained characteristics based on appearance and geometry, such as the LBP, discrete cosine transform (DCT), and the geometrical distance feature (GDF). They tested the approach on two datasets and produced extremely high accuracy in both. The authors of [25] developed a new method for classifying gender from face images by using the LBP as a binary quantization and GLCMs to extract the geometric structure of the faces. Furthermore, they utilized a histogram equalization technique to adjust the contrast of the input image, and SVM as the classifier for gender classification. As an outcome, the use of both LBP and GLCM features showed high classification performance. Similarly, [26] exploited an image processing and AI-based technique to perform age and gender determination using dental X-ray images. They pre-processed the tooth images followed by binary conversion using the M-1, M-2, and M-3 approaches. The dynamic structure divides the images into segments, extracts the features to have vectors, and then feeds them to a multi-layer neural network to determine age and gender.
Generally, people find gender classification to be an easy procedure, however it is still a tough assignment for computers, especially when assessing a human face with a facial mask, as most of the main facial features such as the nose, chin, and mouth are not visible. Additionally, as per our knowledge and research, the challenge of detecting a person's gender while wearing a face mask has not been solved to date. As a result, we offer a technique for performing gender classification using DL networks in this research that not only determines gender using facial images but it also figures out the gender even when a human is wearing a mask.

Proposed Scheme and Dataset Details
In this study, the gender detection technique is divided into three phases: data preprocessing, model training, and gender classification. During the pre-processing step, approaches such as removing duplicate data, eliminating irrelevant data, removing outliers, and resizing images into dimensions of 299 × 299 are exploited. The data, which consist of images of people wearing face masks, are fed into several deep learning-based gender classifiers that have been pre-trained. These models include DenseNet121, DenseNet169, Xception, InceptionV3, ResNet50, ResNet101, VGG16, MobileNetV2, and EfficientNetB0 that are fine-tuned and trained to determine a person's gender based on facial images with and without face masks.

Dataset
The dataset is downloaded from a publicly available Kaggle repository [27]. The dataset used to train our gender detection models contains 40,000 images, which after removing duplicates and unnecessary images contained 11,536 total images for four ways of masks wearing; the types are: the mask is properly worn and covers the nose and mouth, the mask covers the mouth but not the nose, the mask is on but does not cover the nose or mouth, and there is no mask on the face (see Figure 2a). There are 8691 total images for 3 types of masks worn which include all the above mentioned types, except the type with no mask on the face (see Figure 2b). Each item comprises the following information: image size, image type, person's age, gender, and user ID. All photos were gathered using the crowdsourcing site Toloka.ai and confirmed by TrainingData.ru. We use the 4 and 3 types of these images separately with a ratio of 70/30 (see Table 1) for training the classifiers in order to precisely evaluate the accuracy of each model. images of people wearing face masks, are fed into several deep learning-based gender classifiers that have been pre-trained. These models include DenseNet121, DenseNet169, Xception, InceptionV3, ResNet50, ResNet101, VGG16, MobileNetV2, and EfficientNetB0 that are fine-tuned and trained to determine a person's gender based on facial images with and without face masks.

Dataset
The dataset is downloaded from a publicly available Kaggle repository [27]. The dataset used to train our gender detection models contains 40,000 images, which after removing duplicates and unnecessary images contained 11,536 total images for four ways of masks wearing; the types are: the mask is properly worn and covers the nose and mouth, the mask covers the mouth but not the nose, the mask is on but does not cover the nose or mouth, and there is no mask on the face (see Figure 2a). There are 8691 total images for 3 types of masks worn which include all the above mentioned types, except the type with no mask on the face (see Figure 2b). Each item comprises the following information: image size, image type, person's age, gender, and user ID. All photos were gathered using the crowdsourcing site Toloka.ai and confirmed by TrainingData.ru. We use the 4 and 3 types of these images separately with a ratio of 70/30 (see Table 1) for training the classifiers in order to precisely evaluate the accuracy of each model.

Exploratory Data Analysis and Pre-Processing
Before feeding data to DL models, the data samples need to be in order for better performance. Thus, the study performs 4 steps as data pre-processing; the first step includes removing the duplicate images and filtering out the data that have more than 4 images of a single person with the same type of mask wearing. In the second step, we removed unnecessary data, as there were some records that the gender was neither male nor female and this was indicated as NONE so we used the pandas built-in query feature for this step. The third step includes removing outliers of some fields of data samples, for

Exploratory Data Analysis and Pre-Processing
Before feeding data to DL models, the data samples need to be in order for better performance. Thus, the study performs 4 steps as data pre-processing; the first step includes removing the duplicate images and filtering out the data that have more than 4 images of a single person with the same type of mask wearing. In the second step, we removed unnecessary data, as there were some records that the gender was neither male nor female and this was indicated as NONE so we used the pandas built-in query feature for this step. The third step includes removing outliers of some fields of data samples, for instance, the size of images and age of each person by finding the median value of these fields and removing values that exceed the median. Figure 3 depicts the visualization of removing outliers from the size of images by finding the median value which is 4 in this case and removing those records from dataset that exceed the median value. The last step comprises resizing the images to the ratio of 299 × 299 in order to feed them into the pre-trained models of fixed width and height.
instance, the size of images and age of each person by finding the median value of these fields and removing values that exceed the median. Figure 3 depicts the visualization of removing outliers from the size of images by finding the median value which is 4 in this case and removing those records from dataset that exceed the median value. The last step comprises resizing the images to the ratio of 299 × 299 in order to feed them into the pretrained models of fixed width and height.

DenseNet
In a feed-forward approach, dense convolutional network [28] links each layer to every other layer. Each layer generates a feature map that is an input to the next layers. It is composed of two critical components: blocks of dense and transition layers. Each Dense-Net topology has four dense blocks, each having a different number of layers. DenseNet-121 contains four dense blocks with 6, 12, 24, and 16 layers, respectively, whereas Dense-Net-169 has four dense blocks with 6, 12, 32, and 32 layers, respectively, and more than 20 million parameters. DenseNets address the vanishing-gradient issue, increase feature propagation, and improve feature reuse while using fewer parameters than typical CNNs since they do not need to learn unnecessary feature mappings. Equation (1)

ResNet
Residual networks are a type of traditional neural network, utilized as a foundation for many computer vision applications. In 2015, the model won the ImageNet challenge [29]. This innovation empowered experts to effectively train incredibly DNNs having more than 150 layers. Prior to ResNet, building DNN models with a huge number of hidden layers was a challenging task due to vanishing gradients. However, ResNet established the notion of skip connections, which alleviated the problem of vanishing gradients by permitting the gradient to flow through a substitute shortcut direction and allowing the model to learn an identity mapping that ensures the higher layer performs well at the lower layer, if not better. Without the skip connection, the input, , is multiplied by the layers' weights, , followed by addition of a bias term, . The activation function, ( ), is then utilized, to obtain the resultant, ( ), as in (2).

DenseNet
In a feed-forward approach, dense convolutional network [28] links each layer to every other layer. Each layer generates a feature map that is an input to the next layers. It is composed of two critical components: blocks of dense and transition layers. Each DenseNet topology has four dense blocks, each having a different number of layers. DenseNet-121 contains four dense blocks with 6, 12, 24, and 16 layers, respectively, whereas DenseNet-169 has four dense blocks with 6, 12, 32, and 32 layers, respectively, and more than 20 million parameters. DenseNets address the vanishing-gradient issue, increase feature propagation, and improve feature reuse while using fewer parameters than typical CNNs since they do not need to learn unnecessary feature mappings. Equation (1) shows the DenseNet architecture output where [z 0 , z 1 , . . . , z l−1 ] is the combination of the feature maps generated by the [0, 1, . . . , ith] layers.

ResNet
Residual networks are a type of traditional neural network, utilized as a foundation for many computer vision applications. In 2015, the model won the ImageNet challenge [29]. This innovation empowered experts to effectively train incredibly DNNs having more than 150 layers. Prior to ResNet, building DNN models with a huge number of hidden layers was a challenging task due to vanishing gradients. However, ResNet established the notion of skip connections, which alleviated the problem of vanishing gradients by permitting the gradient to flow through a substitute shortcut direction and allowing the model to learn an identity mapping that ensures the higher layer performs well at the lower layer, if not better. Without the skip connection, the input, y, is multiplied by the layers' weights, w, followed by addition of a bias term, b. The activation function, f (y), is then utilized, to obtain the resultant, H(y), as in (2). However, the output has been altered from H(y) to (3) following the advent of the skip connection technique.
H(y) = f (y) + y Furthermore, when using a convolutional layer or pooling layers, the input dimension may differ from the output dimension. Therefore, padding a zero using the skip connection to expand its dimensions, and appending 1 × 1 convolutional layer to the input to meet the dimensions, can handle the problem at hand. Thus, a resultant can be enhanced by adding w 1 as an extra parameter, given in (4).
ResNet50 refers to the variation that can function with 50 neural network layers, whereas ResNet101 refers to a CNN with 101 layers.

InceptionV3
The InceptionV3 is a 48-layer DL model based on CNNs, used for image classification [30]. The InceptionV3 is an improved version of the fundamental model InceptionV1, which was launched in 2014 as GoogLeNet. It is a widely used image recognition model that has shown promising results by achieving more than 78.1% accuracy on the Ima-geNet dataset. It is also intended to function well even under stringent memory and computational budget limitations. The inception layer is a mixture of all 1 × 1, 3 × 3, and 5 × 5 convolutional layers, with their output filter banks concatenated into a unified output vector that serves as the following stage's input.

Xception
Xception is an Inception architecture enhancement that substitutes ordinary Inception modules with depth-wise invertible convolutions [31]. It forms a feature extraction basis of the network with 71 convolutional layers. Except for the first and last modules, the convolutional layers in Xception are organized into modules surrounded by linear residual connections, thus, it is a depth-wise separate convolution layer stack with residual connections. As opposed to InceptionV2 or V3, which are significantly more complicated to specify, this makes the architecture relatively straightforward to define and adapt. Xception's overall architecture consists of three flows: entry, middle, and exit flow. The data first pass via the input flow, then through the middle flow, where they repeat themselves 8 times, and lastly through the exit flow. The limit of convolutions for a single kernel is set according to (5), while the limit for N kernels is set according to (6), where K is the resultant dimension after convolution, which depends on the padding employed. C denotes the number of channels, while d reflects the size of the convolution filter.

VGG16
VGG16 is a CNN-based architecture model that won the 2014 ILSVR (ImageNet) competition [32]. It consists of a 16-layer DNN with around 138 million parameters employing 3 × 3 filters. Rather than a huge number of hyper-parameters, VGG16 concentrates on 3 × 3 filtered convolution layers with a stride of 1 and a 2 × 2 filtered maxpool layer with 2 strides having the same padding. The overall architecture is composed of convolutional layers having different depths followed by three fully connected layers; 4096 channels in first two layers, 1000 channels in the third layer, and the last layer is the soft-max layer.

MobileNetV2
MobileNetV2 is a 53-layer deep network that is particularly good at object recognition and segmentation. It improves the state-of-the-art effectiveness of mobile models in a number of activities and benchmarks, as well as across a range of model sizes [33]. Its foundation is an inverted residual structure with residual connection between bottleneck levels. It enables real-time categorization under processing restrictions in devices such as smartphones. This methodology allows the use of ImageNet transfer learning on our dataset. More detail about the overall architecture of MobileNetV2 can be found in [33]. Each layer has n-times repeating sequences of one or more identical (modulo stride) layers. The number of output channels is the same for all layers in the same sequence. Each sequence's initial layer has a stride, while the rest utilize stride of 1. Moreover, 3 × 3 kernels are used in all spatial convolutions.

EfficientNetB0
EfficientNet is a scaling approach that uses a compound coefficient to consistently scale all depth, resolution, and width dimensions [34]. The base EfficientNetB0 network is built on the inverted bottleneck residual blocks of MobileNetV2, as well as squeeze-andexcite blocks, and has 237 layers made up of 5 modules. It significantly outperformed other convolutional networks, in fact, EfficientNetB7 obtains a new state-of-the-art 84.3% top accuracy. In a nutshell, ϕ is a user-defined coefficient that affects the number of additional resources accessible. The variables α, β, and γ define how these extra resources are distributed among networks in terms of depth d, width ω, and resolution of the input r as shown in (7)-(9), where s·t·α·β 2 ·γ 2 ≈ 2 and α ≥ 1, β ≥ 1, γ ≥ 1.
To conduct gender classification, we feed the final output tensor from the convolutional base into a dense layer to fine-tune these pre-trained models for training with the dataset at hand. The dense layer accepts one-dimensional vectors as input and produces a threedimensional tensor as output. We begin by flattening (or unrolling) the 3D output to 1D, and then, because our data will be categorized into two classes, male and female, we add a final dense layer with two outputs and a softmax activation function. To train and evaluate the models, the initial dataset is separated into training and test sets with a 70/30 ratio. After successful training, the accuracy is computed using all images from the test dataset in each iteration.

Results and Discussion
As this paper emphasizes a solution to gender classification with face masks, it is necessary to evaluate processing and classification performance. The training and testing of models are executed on an AMD Ryzen 5 5600G processor having a RAM of 64 GB. Software includes: Jupyter Notebook having Keras packages with Python 3.6. In addition, the system contains NVIDIA GFORCE GTX 1050. The network was implemented with a TensorFlow framework, fine-tuning nine Keras applications utilizing two different datasets which contain four types and three types of facial images with and without face masks per person, respectively. To evaluate the models' accuracy, the we tested each model with a dataset containing 40,000 images. Moreover, we determined the following six metrics to analyze the effectiveness and performance of the exploited model. • AUCRAC: the term AUC refers to the area under the curve, which is a threshold and scale invariant that determines the rank correlation of predictions and targets. • PRAUC: the mean of the accuracy scores computed for each recall threshold. • Precision: calculates the proportion of cases that are properly categorized. • Recall: calculates the quantity of positive class predictions produced from all positive instances in the dataset. • F1-score: computes the harmonic mean between precision and recall by integrating these into a single measure. Table 2 presents the comparative performance analysis of exploited state-of-the-art DL models for gender classification on the test dataset when trained with facial images of individuals wearing facemasks in four different ways. Besides just testing models trained on a dataset that also contains facial images without facemasks, identical models are also trained with images that contain facial images with masks only (three ways of wearing masks). Table 3 compares the performance of each model for gender classification on the test dataset when trained with images with veiled faces. Among these predicted accuracies using CNN pre-trained models for gender classification, evidently, the most efficient ones are DenseNet121, Xception, EfficientNetB0, and InceptionV3 as these four models show the highest accuracy and lowest loss among all others in both datasets. After successful training, we also computed accuracy and loss using all images from the test dataset in each iteration. Figures 4 and 5 show the visualization of both training and validation accuracies with the increase in the number of training iterations with both type of datasets (four and three types of mask images) for each of our pre-trained models. To gain optimal performance, the experimental work exploited an early stopping, thus the models stop after 10 epochs.    Besides analyzing the trained model on the test dataset, we also tested the exploited models on unseen data. It is noted that the top models (DenseNet121, Xception, Efficient-NetB0, and InceptionV3) successfully classified the given unseen facial images (as shown in Figure 6), however, they misclassified a few unseen images where the background was too complex or more facial features were hidden due to wearing a cap or scarf. The pictures show that the applied method gives satisfactory classification results. Besides analyzing the trained model on the test dataset, we also tested the exploited models on unseen data. It is noted that the top models (DenseNet121, Xception, Efficient-NetB0, and InceptionV3) successfully classified the given unseen facial images (as shown in Figure 6), however, they misclassified a few unseen images where the background was too complex or more facial features were hidden due to wearing a cap or scarf. The pictures show that the applied method gives satisfactory classification results. Processes 2022, 10, x FOR PEER REVIEW 13 of 16 Moreover, we also measured the computational time of each model for both datasets (4 ways and 3 ways of wearing the masks). Table 4 lists the computational time to train and fit the exploited state-of-the-art DL models for gender classification. It is worth noting that the DenseNet121 model efficiently performed the training in less computational time as compared to other models for both datasets, while EfficientNetB0 was successful in training for the dataset with 3 ways of wearing the masks.

Conclusions
One of the most essential biometric characteristics is the face. One can learn a lot about a person's ethnicity, gender, age, expression, identity, and more by evaluating their face. However, gender classification using facial images with masks is one of the most challenging obstacles in classification tasks in the pandemic era due to its intricacy. As per our knowledge, to disentangle the problem, no work based on deep learning and feature selection has been proposed until today. Therefore, this study proposed a gender classification system that determines a person's gender (male/female) based on the face of the individual wearing a mask in a given image. The study analyzed and compared performance of various state-of-the-art deep learning pre-trained networks to identify gender using two strategies. In the first strategy, the models are trained using human facial images where individuals either wear the mask fully, half, partially, or do not wear one at all (four ways of wearing). On the other hand, in the second strategy, the training phase only includes the images with veiled faces (faces with masks in three different styles). Evidently, experimental results conclude that the models' performances are significantly Moreover, we also measured the computational time of each model for both datasets (4 ways and 3 ways of wearing the masks). Table 4 lists the computational time to train and fit the exploited state-of-the-art DL models for gender classification. It is worth noting that the DenseNet121 model efficiently performed the training in less computational time as compared to other models for both datasets, while EfficientNetB0 was successful in training for the dataset with 3 ways of wearing the masks. Table 4. The computational time to fit the models for gender classification over two datasets (4 ways of wearing mask, and 3 ways of wearing mask).

Model
Time (

Conclusions
One of the most essential biometric characteristics is the face. One can learn a lot about a person's ethnicity, gender, age, expression, identity, and more by evaluating their face. However, gender classification using facial images with masks is one of the most challenging obstacles in classification tasks in the pandemic era due to its intricacy. As per our knowledge, to disentangle the problem, no work based on deep learning and feature selection has been proposed until today. Therefore, this study proposed a gender classification system that determines a person's gender (male/female) based on the face of the individual wearing a mask in a given image. The study analyzed and compared performance of various state-of-the-art deep learning pre-trained networks to identify gender using two strategies. In the first strategy, the models are trained using human facial images where individuals either wear the mask fully, half, partially, or do not wear one at all (four ways of wearing). On the other hand, in the second strategy, the training phase only includes the images with veiled faces (faces with masks in three different styles). Evidently, experimental results conclude that the models' performances are significantly reduced in the second strategy, however, the EfficientNetB0 model still managed to perform well by retaining a classification accuracy above 97% in distinguishing male and female identity in both strategies. Therefore, many applications, including smart human-computer interface, can benefit from this gender classification approach. Nevertheless, the proposed scheme can be enhanced in future for more accurate performance by integrating data augmentation techniques, and various computer vision and deep feature extraction schemes.

Informed Consent Statement:
The dataset is publicly available and no image is reproduced/published in this study from this dataset, and thus does not require informed consent. However, some external sample facial images of individuals (placed in this paper) are collected specifically for this study. Therefore, informed consent was obtained from all subjects involved in the study and written informed consent has been obtained from the individuals to publish this paper.