Gender Classiﬁcation Using Proposed CNN-Based Model and Ant Colony Optimization

: Pedestrian gender classiﬁcation is one of the key assignments of pedestrian study, and it ﬁnds practical applications in content-based image retrieval, population statistics, human–computer interaction, health care, multimedia retrieval systems, demographic collection, and visual surveillance. In this research work, gender classiﬁcation was carried out using a deep learning approach. A new 64-layer architecture named 4-BSMAB derived from deep AlexNet is proposed. The proposed model was trained on CIFAR-100 dataset utilizing SoftMax classiﬁer. Then, features were obtained from applied datasets with this pre-trained model. The obtained feature set was optimized with ant colony system (ACS) optimization technique. Various classiﬁers of SVM and KNN were used to perform gender classiﬁcation utilizing the optimized feature set. Comprehensive experimentation was performed on gender classiﬁcation datasets, and proposed model produced better results than the existing methods. The suggested model attained highest accuracy, i.e., 85.4%, and 92% AUC on MIT dataset, and best classiﬁcation results, i.e., 93% accuracy and 96% AUC, on PKU-Reid dataset. The outcomes of extensive experiments carried out on existing standard pedestrian datasets demonstrate that the proposed framework outperformed existing pedestrian gender classiﬁcation methods, and acceptable results prove the proposed model as a robust model.


Introduction
In recent years, researchers' interest in visual surveillance applications has been growing due to the availability of low-cost optical and infrared cameras and advanced computing machines. Digital cameras are widely used nowadays and deployed on roads, in shopping malls, metro lines and train stations, airports, and residential areas. With digital cameras, pedestrian images are captured under a specific field of view (FoV) in controlled environments [1]. These days, object recognition from images and videos captured by digital cameras is being preferred by people for automated tasks related to security monitoring, public safety [2], pedestrian behavior analysis, etc. Different approaches for video object detection based on deep learning were studied in [3]. Pattern classification from images was also carried out in [4][5][6]. Usually, the movement of different types of objects such as pedestrians takes place in images or video frames. Since pedestrians move in public areas for different purposes such as shopping, to go to work, or to go to school, they are a very important object of real life, and pedestrian-relevant tasks such as

•
A new architecture based on 64 layers named 4-BSMAB is proposed to obtain features from images. Due to the non-availability of larger datasets, the training of proposed model is carried out on CIFAR-100 dataset, and then the trained model is utilized to extract features from the testing datasets.

•
The feature optimization approach (ACS) is applied to obtain features for dimension reduction of extracted features. • Various classifiers are tested for PGC, and then the most successful classifier is benchmarked. The classification accuracy achieved with the proposed model shows that the proposed framework is acceptable.
The remaining sections of this manuscript are organized as follows: The Introduction section describes the proposed domain, and next section explains literature review. Section 3 describes the proposed framework. The fourth section provides the results and discusses the details. At the end of this manuscript, this research work is finally concluded.

Related Work
In this section, a summary of relevant existing techniques used for gender classification is presented. The following approaches have been proposed for view-based PGC in the relevant literature.

Traditional/Hand-Crafted Feature-Based Approaches
In this section, a summary of methods that use hand-crafted features for gender classification is highlighted. These approaches use low-level information (features related to shape, color, texture, etc.). For instance, Cao et al. [42] proposed an algorithm named partbased gender recognition (PBGR) utilizing fixed frontal or back views of gender full-body appearance to obtain edge map-based shape information, HOGs, and raw information. They achieved 76.0%, 74.6%, and 75.0% accuracy on front views, back views, and non-fixed views, respectively. Furthermore, Guo et al. [43] utilized front views, back views, and mixed views to investigate biologically inspired features (BIF) from the human body to handle pose variations with support vector machine (SVM). For manifold learning, unsupervised principal component analysis (PCA), supervised orthogonal locality preserving projections (OLPP), marginal Fisher analysis (MFA), and locality-sensitive discriminant analysis (LSDA) were utilized. They achieved 79.5%, 84.0%, and 79.2% accuracy on frontal view with BIF+LSDA, back view with BIF+LSDA, and mixed views with BIF+PCA, respectively, on MIT dataset. Collins et al. [44] extracted features related to spatial pyramid HOGs (PHOGs), local HSV (LHSV) color histograms, spatial pyramid bag of words, etc., and used mixed views from static full-body images to investigate image representations. They obtained 72.2%, 76.0%, and 80.6% overall accuracy on uncropped MIT, cropped MIT, and uncropped VIPeR dataset images, respectively. In addition to above, Geelen et al. [45] first obtained hand-crafted features such as shape, color, and texture from full-body view-based images. Then, a combination of these features was used to perform experiments on MIT CBCL dataset and Datasets A and B for gender classification using SVM and random forest (RF) kernel. They obtained 81.6%, 82.7%, and 80.9% overall accuracy on front views, back views, and mixed views, respectively. They also achieved 79.0%, 79.3%, and 76.6% mean accuracy on front views, back views, and mixed views on MIT dataset. With the above gender classification techniques, although it has been observed that hand-crafted features (low-level feature representations) provide significant resistance against illumination and pose issues, obtaining distinct features from pedestrian full-body views with complex appearances is another challenging issue. Therefore, further investigation of pedestrian full-body views is required to obtain more definite and optimum information for gender classification.

Deep Learning-Based Approaches
To cope with the problems raised by the traditional hand-crafted feature-based gender classification techniques discussed above such as pedestrians' diverse appearances and captured images having a low resolution, deep CNN models have been proposed and are considered more appropriate [46,47]. The CNN architecture is popular because of its significant advances in the accuracy obtained in different classification studies [48][49][50]. Currently, trained deep CNN models have been used in a few existing methods for gender prediction. For instance, Ng et al. [16] utilized a CNN model comprising seven layers for issues related to the domain of gender classification. The training of CNN model was carried out on MIT pedestrian dataset for the prediction of gender classification. Overall accuracies of 80.4% and 79.2% were obtained on both front and rear views with a view classifier and without a view classifier, respectively. The proposed approach performed successfully on homogeneous datasets of a small size. Antipov et al. [17] applied mini-CNN and AlexNet-CNN to learn features and compare them with hand-crafted features (HOG) to solve the issue of image feature selection. They found MAP values of 0.80 and 0.85 and AUC values of 0.88 and 0.91 on familiar datasets, while they found MAP values of 0.75 and 0.79 and AUC values of 0.80 and 0.85 on unfamiliar datasets using mini-CNN and AlexNet-CNN. The results showed that the learned features significantly outperformed hand-crafted features for heterogeneous datasets. Ng et al. [18] utilized grayscale, RGB, and YUV color spaces on pedestrians' full-body images to represent the image for gender prediction with a deep CNN network, which produced significant results on MIT dataset containing pedestrians' front and rear views. An average accuracy of 81.47% was attained on frontal and rear views with the grayscale color space. Ng et al. [19] further utilized labeled low-level training data and a CNN to introduce a strategy for training. Filters were learned with k-means clustering (unsupervised learning), whereas supervised learning performed pre-training on MIT dataset (front and back views). The training strategy generally performed better than random weight initialization. Raza et al. [20] used appearances from the complete and upper portion of body and deep CNN model for the analysis of pedestrian gender. The existing mechanism for pedestrian parsing was applied to parse both full-body as well as upper-half body-based pedestrian objects in CNN model, after which the SoftMax classifier was applied. The authors achieved 82.1%, 81.3%, and 82.0% overall accuracy and 81.1%, 81.7%, and 80.7% mean accuracy on front views, back views, and mixed views with full-body appearances. They also obtained 83.3%, 82.3%, and 82.8% overall accuracy and 80.5%, 82.3%, and 81.4% mean accuracy on front views, back views, and mixed views with upper-body appearances. Furthermore, Raza et al. [51] also used a deep learning approach and a stacked sparse autoencoder (SSAE) to classify gender. The deep neural network method parsed pedestrian images to remove background, and a two-layer SSAE with SoftMax classifier predicted gender as male or female. The researchers achieved 82.9%, 81.8%, and 82.4% accuracy on front views, back views, and mixed views, respectively, on MIT dataset. They also attained a 91.5% AUC mean value on PETA dataset. Cai et al. [52] investigated and obtained deep features and low-level (HOG) features simultaneously from images by a deep CNN model named as deep-learned and hand-crafted features fusion network (DHFFN) using PCA. After extracting features, fusion is applied to mix both features for exploring their full merits. Experiments on numerous public datasets such as MIT, VIPeR, GRID, PRID, and CHUK were performed, and DHFFN produced 0.95 MAP and 0.95 AUC and was declared a better performer than the state-of-the-art gender prediction methods. Cai et al. [40] further introduced HOG-assisted deep feature learning (HDFL), a novel method that uses a deep CNN to cater common challenges such as viewpoint variations, occlusion, and poor quality faced while predicting gender. HDFL efficiently extracted deep-learned features as well as HOG features simultaneously from the pedestrian picture. A feature fusion process is then applied to extract more discriminative features to provide to SoftMax classifier for gender prediction. The proposed HDFL achieved 0.93 MAP and 0.94 AUC with local response normalization (LRN) and 0.94 MAP and 0.95 AUC with LRN. In earlier gender classification studies, CNN architecture has been used only for considering whole-body images, i.e., global information, but Ng. et al. [53] took global as well as local information from full-body images and introduced a novel parts-based framework that uses a combination of local and global information towards PGC. A local and global CNN method was trained on both whole-body images as well as identified areas of body for feature learning and classification. While comparing the accuracy extracted by utilizing different regions of body such as upper, middle, and lower regions after performing experiments on MIT and APiS datasets, the upper-half body region played a more important role in gender classification as compared to the middle or lower half of body. The authors achieved 84.4%, 88.9%, and 86.8% accuracy utilizing a combination of MIT and APis datasets on frontal, non-frontal, and mixed views, respectively. Fayyaz et al. [41] proposed a hybrid approach that produces a combination of low-level information and deep features of pedestrian images and computes information effectively assisted bya joint feature representation (JFR) scheme for better gender classification tasks. Extensive experiments were performed by adopting different classifiers such as SVM, discriminant classification, and k-nearest neighbor (KNN) to observe sufficient low-level (HOG and LOMO features) and deep feature-based contributions for the design of a JFR, and in this way, the proposed approach achieved 96% AUC and 89.3% accuracy on PETA dataset, and 86% AUC and 82% accuracy on MIT dataset. A study was also conducted by Cai et al. [39] in which the cascading scene and viewpoint feature learning (CSVFL) method improved pedestrian gender recognition. In CSVFL, two crucial challenges, namely, scene and viewpoint variations in pedestrian gender recognition, were jointly considered. The authors demonstrated that CSVFL was able to resist both variations (scene and viewpoint) at the same time. The results generated by CSVFL were also compared with various recent relevant tasks, and an excellent performance was observed. They obtained 84.4%, 85.9%, and 85.2% accuracy utilizing MIT dataset on frontal, back, and mixed views, respectively. They also achieved 81.9%, 84.7%, 72.1%, and 80.1% accuracy utilizing VIPeR dataset on frontal, back, side, and mixed views. Further, 92.4%, 94.6%, 88.2%, and 92.7% accuracy was obtained using PETA dataset on frontal, back, side, and mixed views, respectively.
It can be observed from the results of above-mentioned fine-tuned models that these models are robust, but imbalanced distribution of data is also a challenge while obtaining class-wise accuracy. The above discussion also reflects the fact that full-body view-based pedestrian images are widely investigated for gender classification. All the existing techniques equally applied large-scale and small-scale datasets in experimentation. It has been observed that a fusion approach generates a compact representation of gender images for classification. Moreover, a dataset having a small size is an issue for model learning in deep learning-based approaches.

Material and Methods
This section presents the proposed model 4-BSMAB and its major steps for PGC. These steps include pre-training of proposed model, dataset balancing, process of feature extraction from 4-BSMAB model, ACS-based feature optimization, and, at the end, classification. An overview of this model is presented in Figure 1. These steps are elaborated in the upcoming section.

4-BSMAB
A new architecture, 4-BSMAB (4-branch subnets with modified AlexNet backbone), based on CNN architecture is introduced in this work for PGC. This newly developed model is derived from CNN network named AlexNet [54]. AlexNet contains 25 layers including 5 convolutional layers, 3 fully connected layers, 3 pooling layers, 7 rectified linear unit (ReLU) layers, 2 dropout layers, and SoftMax layers and is divided into 3 repeating blocks named here as R1, R2, and R3. The new model contains 64 layers including input and output layers. The architectural view of proposed model 4-BSMAB is presented in Figure 2, and the details of layers are listed in Table 1. The existing network, i.e., AlexNet, was altered with the help of addition of layers, and a new model, 4-BSMAB, is proposed. The batch normalization (BN) layer is included at the end of blocks R1 and R2. Two branched sub-networks are also added along with R blocks. BN1_1 is called first sub-network, and it has three branches. The first branch has a BN tier. The second branch has C2, BN1, C3, and LR1 tiers. The last branch contains layers such as C9, BN8, and LR4. Then, the fusion process is applied on three branches with an addition (ADD) layer. The other sub-network (BN_2) has only two branches. The difference between BN_2 and BN_1 is that BN_1 has only the group BN layer. Two sub-networks such as BN_1 and BN_3 are incorporated at the end of ReLU layer in block R1. Two activation functions are used, i.e., ReLU and Leaky ReLU, at ReLU layer. The ReLU functions are simple and fast and help in speeding up the training phase, and this way, neural networks are also improved. The ReLU functions are easy to compute and do not suffer from vanishing gradients. These functions are implemented with a simple procedure, and due to this characteristic, they are suitable to be used on GPUs. GPUs are popular and can be improved while carrying out matrix operations. Leaky ReLU functions stop ReLU problems from dying. This type of variation of ReLU produces a small positive slope in the negative area; therefore, it produces possible back-propagation, even for negative input values. Leaky ReLU does not provide steady predictions for input values which are negative. Two sub-networks including BN6_1 and BN8_1 are incorporated at the end of block R2.
The layer details of proposed 4-BSMAB are discussed in the upcoming section. The convolution of input vector j−1 is carried out using a filter bank in C_1 layer. The mathematical form of convolution operation represented by * is described as where p j represents various input channels, and p j represents the number of output channels; the value represented by j indicates the number of layers [55]; H denotes the filter of depth p j ; and both symbols M and N indicate nonlinear functions. Layers such as group convolution (GC) are also added in 4-BSMAB model. A GC layer is a combination of many convolutional layers. A GC layer enables the procedure of training around GPUs which are in the form of clusters and have a low capacity related to memory. The filters are divided into many splits in a GC layer. A certain range of collection of 2D convolution is carried out by all groups. The mathematical form of layers related to pooling is i,j,x,v = max l=1...s, m=1...t p,j−1,(w+m)(x+n) (2) where w, x represent the index of matrix of image I p,j−1 , and m, n denote the index of matrix used for the selected pool window. Both norm(s) and BN(s) are utilized in this scheme. BN [56] is a procedure to adjust neurons of the channel over the amount defined for a small batch. Its purpose is to determine both the mean and variance in parts. With the help of a determined mean, the separation of features is carried out with standard deviation. The mean of batch B = I 1 , . . . , I w is calculated as follows: where w shows per batch feature maps. The variance is described per batch (small) and is shown as The below expression is then used forfeature normalization.
where D represents the consistency value, but it remains constant. The norm layer is used for simplification. The norm layer involves scaling pixels with the maximum factor for local prior layers and boosts the spatial-visual quality. The norm equation is where shows per batch feature maps. The variance is described is shown as The below expression is then used forfeature normalization.

= − +
Where represents the consistency value, but it remains constant. T for simplification. The norm layer involves scaling pixels with the local prior layers and boosts the spatial-visual quality. The norm eq where represents the obtained feature map at the end of norm,ℑ square", denotes the size of channel, and ℴ, ℶ, and ℘ represent n The 4-BSMAB model utilizes both R and LR. The standard R transfor zero to zero and is shown in [57] as For values below 0, LR has a small slope instead of becoming 0.01ü when ü is negative. Some other works [58][59][60] are also availa in-depth learning is possible.

Pre-Training of Proposed Model and Feature Extraction
The proposed model 4-BSMAB extracts features from the pipel CNN. The training of proposed model is carried out on a dataset n having 100 classes of images. In this repository, each class of imag images for model learning and 100 images for model validation. For images, as well as validation images, are mixed such that each class The resultant dataset after mixing both types of images of each cla proposed CNN model for training purposes. The finally trained ne to extract features on pedestrian attribute recognition datasets [41], lected to obtain features. A total of 2048 features are extracted from layer. This produces a feature set having 418×2048, 470×2048, and 1 for frontal views, back views, and mixed views, respectively, for M produces a feature set with 590×2048, 331×2048, and 1264×2048 d views, back views, and mixed views for VIPeR dataset, and featu 684×2048, 228×2048, and 1640×2048 for frontal views, back views, PKU-Reid dataset. Some intermediate visualizations of features capt of convolution performed by the proposed model 4-BSMAB are sho where Mathematics 2021, 9, x 10 of 29 where shows per batch feature maps. The variance is described per batch (small) and is shown as The below expression is then used forfeature normalization.
Where represents the consistency value, but it remains constant. The norm layer is used for simplification. The norm layer involves scaling pixels with the maximum factor for local prior layers and boosts the spatial-visual quality. The norm equation is where represents the obtained feature map at the end of norm,ℑ shows the "sum of square", denotes the size of channel, and ℴ, ℶ, and ℘ represent normalization criteria. The 4-BSMAB model utilizes both R and LR. The standard R transforms numbers less than zero to zero and is shown in [57] as ⃦ ü ,ȑ = max (0, ⃦ ü ,ȑ ) For values below 0, LR has a small slope instead of becoming 0. LR would have ȑ = 0.01ü when ü is negative. Some other works [58][59][60] are also available from which CNN in-depth learning is possible.

Pre-Training of Proposed Model and Feature Extraction
The proposed model 4-BSMAB extracts features from the pipeline deeply trained by CNN. The training of proposed model is carried out on a dataset named CIFAR100 [61] having 100 classes of images. In this repository, each class of images is divided into 500 images for model learning and 100 images for model validation. For pre-training, learning images, as well as validation images, are mixed such that each class contains 600 images. The resultant dataset after mixing both types of images of each class is provided to the proposed CNN model for training purposes. The finally trained network is then applied to extract features on pedestrian attribute recognition datasets [41], and FC_1 layer is selected to obtain features. A total of 2048 features are extracted from each image from this layer. This produces a feature set having 418×2048, 470×2048, and 1728×2048 dimensions for frontal views, back views, and mixed views, respectively, for MIT dataset. This also produces a feature set with 590×2048, 331×2048, and 1264×2048 dimensions for frontal views, back views, and mixed views for VIPeR dataset, and feature set dimensions of 684×2048, 228×2048, and 1640×2048 for frontal views, back views, and mixed views for PKU-Reid dataset. Some intermediate visualizations of features captured at various stages of convolution performed by the proposed model 4-BSMAB are shown in Figure 3.
I Z represents the obtained feature map at the end of norm, shows the "sum of square", N denotes the size of channel, and o, , and ℘ represent normalization criteria. The 4-BSMAB model utilizes both R and LR. The standard R transforms numbers less than zero to zero and is shown in [57] as u , where shows per batch feature maps. The variance is described per batch (small) and is shown as The below expression is then used forfeature normalization.
Where represents the consistency value, but it remains constant. The norm layer is used for simplification. The norm layer involves scaling pixels with the maximum factor for local prior layers and boosts the spatial-visual quality. The norm equation is where represents the obtained feature map at the end of norm,ℑ shows the "sum of square", denotes the size of channel, and ℴ, ℶ, and ℘ represent normalization criteria. The 4-BSMAB model utilizes both R and LR. The standard R transforms numbers less than zero to zero and is shown in [57] as For values below 0, LR has a small slope instead of becoming 0. LR would have ȑ = 0.01ü when ü is negative. Some other works [58][59][60] are also available from which CNN in-depth learning is possible.

Pre-Training of Proposed Model and Feature Extraction
The proposed model 4-BSMAB extracts features from the pipeline deeply trained by CNN. The training of proposed model is carried out on a dataset named CIFAR100 [61] having 100 classes of images. In this repository, each class of images is divided into 500 images for model learning and 100 images for model validation. For pre-training, learning images, as well as validation images, are mixed such that each class contains 600 images. The resultant dataset after mixing both types of images of each class is provided to the proposed CNN model for training purposes. The finally trained network is then applied to extract features on pedestrian attribute recognition datasets [41], and FC_1 layer is selected to obtain features. A total of 2048 features are extracted from each image from this layer. This produces a feature set having 418×2048, 470×2048, and 1728×2048 dimensions for frontal views, back views, and mixed views, respectively, for MIT dataset. This also produces a feature set with 590×2048, 331×2048, and 1264×2048 dimensions for frontal views, back views, and mixed views for VIPeR dataset, and feature set dimensions of 684×2048, 228×2048, and 1640×2048 for frontal views, back views, and mixed views for where shows per batch feature maps. The variance is described per batch (small) and is shown as The below expression is then used forfeature normalization.
Where represents the consistency value, but it remains constant. The norm layer is used for simplification. The norm layer involves scaling pixels with the maximum factor for local prior layers and boosts the spatial-visual quality. The norm equation is where represents the obtained feature map at the end of norm,ℑ shows the "sum of square", denotes the size of channel, and ℴ, ℶ, and ℘ represent normalization criteria. The 4-BSMAB model utilizes both R and LR. The standard R transforms numbers less than zero to zero and is shown in [57] as ⃦ ü ,ȑ = max (0, ⃦ ü ,ȑ ) For values below 0, LR has a small slope instead of becoming 0. LR would have ȑ = 0.01ü when ü is negative. Some other works [58][59][60] are also available from which CNN in-depth learning is possible.

Pre-Training of Proposed Model and Feature Extraction
The proposed model 4-BSMAB extracts features from the pipeline deeply trained by CNN. The training of proposed model is carried out on a dataset named CIFAR100 [61] having 100 classes of images. In this repository, each class of images is divided into 500 images for model learning and 100 images for model validation. For pre-training, learning images, as well as validation images, are mixed such that each class contains 600 images. The resultant dataset after mixing both types of images of each class is provided to the proposed CNN model for training purposes. The finally trained network is then applied to extract features on pedestrian attribute recognition datasets [41], and FC_1 layer is selected to obtain features. A total of 2048 features are extracted from each image from this layer. This produces a feature set having 418×2048, 470×2048, and 1728×2048 dimensions for frontal views, back views, and mixed views, respectively, for MIT dataset. This also produces a feature set with 590×2048, 331×2048, and 1264×2048 dimensions for frontal views, back views, and mixed views for VIPeR dataset, and feature set dimensions of 684×2048, 228×2048, and 1640×2048 for frontal views, back views, and mixed views for For values below 0, LR has a small slope instead of becoming 0. LR would have where shows per batch feature maps. The variance is described per batch (small) and is shown as The below expression is then used forfeature normalization. = − + Where represents the consistency value, but it remains constant. The norm layer is used for simplification. The norm layer involves scaling pixels with the maximum factor for local prior layers and boosts the spatial-visual quality. The norm equation is where represents the obtained feature map at the end of norm,ℑ shows the "sum of square", denotes the size of channel, and ℴ, ℶ, and ℘ represent normalization criteria. The 4-BSMAB model utilizes both R and LR. The standard R transforms numbers less than zero to zero and is shown in [57] as ⃦ ü ,ȑ = max (0, ⃦ ü ,ȑ ) For values below 0, LR has a small slope instead of becoming 0. LR would have ȑ = 0.01ü when ü is negative. Some other works [58][59][60] are also available from which CNN in-depth learning is possible.

Pre-Training of Proposed Model and Feature Extraction
The proposed model 4-BSMAB extracts features from the pipeline deeply trained by CNN. The training of proposed model is carried out on a dataset named CIFAR100 [61] having 100 classes of images. In this repository, each class of images is divided into 500 images for model learning and 100 images for model validation. For pre-training, learning images, as well as validation images, are mixed such that each class contains 600 images. The resultant dataset after mixing both types of images of each class is provided to the proposed CNN model for training purposes. The finally trained network is then applied to extract features on pedestrian attribute recognition datasets [41], and FC_1 layer is selected to obtain features. A total of 2048 features are extracted from each image from this layer. This produces a feature set having 418×2048, 470×2048, and 1728×2048 dimensions for frontal views, back views, and mixed views, respectively, for MIT dataset. This also produces a feature set with 590×2048, 331×2048, and 1264×2048 dimensions for frontal views, back views, and mixed views for VIPeR dataset, and feature set dimensions of 684×2048, 228×2048, and 1640×2048 for frontal views, back views, and mixed views for PKU-Reid dataset. Some intermediate visualizations of features captured at various stages of convolution performed by the proposed model 4-BSMAB are shown in Figure 3. = 0.01ü when ü is negative. Some other works [58][59][60] are also available from which CNN in-depth learning is possible.

Pre-Training of Proposed Model and Feature Extraction
The proposed model 4-BSMAB extracts features from the pipeline deeply trained by CNN. The training of proposed model is carried out on a dataset named CIFAR100 [61] having 100 classes of images. In this repository, each class of images is divided into 500 images for model learning and 100 images for model validation. For pre-training, learning images, as well as validation images, are mixed such that each class contains 600 images. The resultant dataset after mixing both types of images of each class is provided to the proposed CNN model for training purposes. The finally trained network is then applied to extract features on pedestrian attribute recognition datasets [41], and FC_1 layer is selected to obtain features. A total of 2048 features are extracted from each image from this layer. This produces a feature set having 418 × 2048, 470 × 2048, and 1728 × 2048 dimensions for frontal views, back views, and mixed views, respectively, for MIT dataset. This also produces a feature set with 590 × 2048, 331 × 2048, and 1264 × 2048 dimensions for frontal views, back views, and mixed views for VIPeR dataset, and feature set dimensions of 684 × 2048, 228 × 2048, and 1640 × 2048 for frontal views, back views, and mixed views for PKU-Reid dataset. Some intermediate visualizations of features captured at various stages of convolution performed by the proposed model 4-BSMAB are shown in Figure 3.

Feature Selection Based on ACS Optimization
The entropy operation [62] is used to code the obtained features. Entropy

Feature Selection Based on ACS Optimization
The entropy operation [62] is used to code the o ally applied to score the features. Mathematically, en show the features, ( , … , ) ( , … , ) calculate the probability. ACS is a learn optimization. When it is combined with entropy-ba embedded approach. The obtained entropy-coded sc optimization. The ACS is related to ants' activities a between places and diffuse material called "phe strength decreases gradually. The ants follow the wa pheromones. This helps ants to select the least expen between places is similar to the movement that takes graph vertex indicates a feature, and edges from a v lection of features. The strategy repeats to find the be minimum number of vertices is traversed and a set rangement of vertices is similar to a mesh. An ant se a given point at any specified time, and this is writte where ё( , … , ) are entropy-based features, ℎ shows the empirical value, ⃛ is the cost of p knowledge, and denotes the time limit. ℎ (Ϯ) and is important to mention here that this will be consi tures are not studied up to this point.

Dataset Balancing
The MIT dataset contains a total of 888 images, o 288 are female images. The number of male and fem therefore, this dataset is an imbalanced dataset, le problems: (1) class imbalance problem, which results sample space problem, which affects the training of enhance the size of dataset and balancing of class-w and horizontal flipping functions are applied. As a re is usually applied to score the features. Mathematically, entropy has the following form:

Feature Selection Based on ACS Optimization
The entropy operation [62] is used to code the obtained features. Entropy ℮ is usually applied to score the features. Mathematically, entropy has the following form: where ⃦ , … , ⃦ show the features, ( , … , ) represent random variables, and ( , … , ) calculate the probability. ACS is a learning-based approach used for feature optimization. When it is combined with entropy-based feature selection, it becomes an embedded approach. The obtained entropy-coded scores are provided to ACS for feature optimization. The ACS is related to ants' activities and their movements [63]. Ants move between places and diffuse material called "pheromones". With time, the material strength decreases gradually. The ants follow the way while calculating the probability of pheromones. This helps ants to select the least expensive path. Therefore, ants' movement between places is similar to the movement that takes place between vertices of a graph. A graph vertex indicates a feature, and edges from a vertex to another vertex show the selection of features. The strategy repeats to find the best features. The approach stops when minimum number of vertices is traversed and a set criterion is satisfied. The linking arrangement of vertices is similar to a mesh. An ant selects features on a probability basis at a given point at any specified time, and this is written as where ё( , … , ) are entropy-based features, ℎ ( ) is the value of pheromones, Ɛ shows the empirical value, ⃛ is the cost of pheromones, represents rational knowledge, and denotes the time limit. ℎ (Ϯ) and Ɛ are attached to the j th feature. It is important to mention here that this will be considered an incomplete response if features are not studied up to this point.

Dataset Balancing
The MIT dataset contains a total of 888 images, out of which 600 are male images and 288 are female images. The number of male and female images is equal in MIT dataset; therefore, this dataset is an imbalanced dataset, leading to the following two research problems: (1) class imbalance problem, which results in poor performance, and (2) small sample space problem, which affects the training of model. Data balancing is selected to enhance the size of dataset and balancing of class-wise data. For this purpose, mirroring and horizontal flipping functions are applied. As a result, 264 male images and 576 female 1 , . . . , where 1 , . . . , n show the features, (f 1 , . . . , f n ) represent random variables, and P(f 1 , . . . , f n ) calculate the probability. ACS is a learning-based approach used for feature optimization. When it is combined with entropy-based feature selection, it becomes an embedded approach. The obtained entropy-coded scores are provided to ACS for feature optimization. The ACS is related to ants' activities and their movements [63]. Ants move between places and diffuse material called "pheromones". With time, the material strength decreases gradually. The ants follow the way while calculating the probability of pheromones. This helps ants to select the least expensive path. Therefore, ants' movement between places is similar to the movement that takes place between vertices of a graph. A graph vertex indicates a feature, and edges from a vertex to another vertex show the selection of features. The strategy repeats to find the best features. The approach stops when minimum number of vertices is traversed and a set criterion is satisfied. The linking arrangement of vertices is similar to a mesh. An ant selects features on a probability basis at a given point at any specified time, and this is written as

Feature Selection Based on ACS Optimization
The entropy operation [62] is used to code the obtained features. Entropy ℮ is usually applied to score the features. Mathematically, entropy has the following form: where ⃦ , … , ⃦ show the features, ( , … , ) represent random variables, and ( , … , ) calculate the probability. ACS is a learning-based approach used for feature optimization. When it is combined with entropy-based feature selection, it becomes an embedded approach. The obtained entropy-coded scores are provided to ACS for feature optimization. The ACS is related to ants' activities and their movements [63]. Ants move between places and diffuse material called "pheromones". With time, the material strength decreases gradually. The ants follow the way while calculating the probability of pheromones. This helps ants to select the least expensive path. Therefore, ants' movement between places is similar to the movement that takes place between vertices of a graph. A graph vertex indicates a feature, and edges from a vertex to another vertex show the selection of features. The strategy repeats to find the best features. The approach stops when minimum number of vertices is traversed and a set criterion is satisfied. The linking arrangement of vertices is similar to a mesh. An ant selects features on a probability basis at a given point at any specified time, and this is written as where ё( , … , ) are entropy-based features, ℎ ( ) is the value of pheromones, Ɛ shows the empirical value, ⃛ is the cost of pheromones, represents rational knowledge, and denotes the time limit. ℎ (Ϯ) and Ɛ are attached to the j th feature. It is important to mention here that this will be considered an incomplete response if features are not studied up to this point.

Dataset Balancing
The MIT dataset contains a total of 888 images, out of which 600 are male images and 288 are female images. The number of male and female images is equal in MIT dataset; therefore, this dataset is an imbalanced dataset, leading to the following two research problems: (1) class imbalance problem, which results in poor performance, and (2)

Feature Selection Based on ACS Optimization
The entropy operation [62] is used to code the obtained features. Entropy ℮ is usually applied to score the features. Mathematically, entropy has the following form: calculate the probability. ACS is a learning-based approach used for feature optimization. When it is combined with entropy-based feature selection, it becomes an embedded approach. The obtained entropy-coded scores are provided to ACS for feature optimization. The ACS is related to ants' activities and their movements [63]. Ants move between places and diffuse material called "pheromones". With time, the material strength decreases gradually. The ants follow the way while calculating the probability of pheromones. This helps ants to select the least expensive path. Therefore, ants' movement between places is similar to the movement that takes place between vertices of a graph. A graph vertex indicates a feature, and edges from a vertex to another vertex show the selection of features. The strategy repeats to find the best features. The approach stops when minimum number of vertices is traversed and a set criterion is satisfied. The linking arrangement of vertices is similar to a mesh. An ant selects features on a probability basis at a given point at any specified time, and this is written as where ё( , … , ) are entropy-based features, ℎ ( ) is the value of pheromones, Ɛ shows the empirical value, ⃛ is the cost of pheromones, represents rational knowledge, and denotes the time limit. ℎ (Ϯ) and Ɛ are attached to the j th feature. It is important to mention here that this will be considered an incomplete response if features are not studied up to this point.

Dataset Balancing
The MIT dataset contains a total of 888 images, out of which 600 are male images and 288 are female images. The number of male and female images is equal in MIT dataset; therefore, this dataset is an imbalanced dataset, leading to the following two research problems: (1) class imbalance problem, which results in poor performance, and (2) small sample space problem, which affects the training of model. Data balancing is selected to enhance the size of dataset and balancing of class-wise data. For this purpose, mirroring and horizontal flipping functions are applied. As a result, 264 male images and 576 female

Feature Selection Based on ACS Optimization
The entropy operation [62] is used to code the obtaine ally applied to score the features. Mathematically, entropy ё ⃦ , … , ⃦ = − … ( , … , ) where ⃦ , … , ⃦ show the features, ( , … , ) repre ( , … , ) calculate the probability. ACS is a learning-ba optimization. When it is combined with entropy-based fe embedded approach. The obtained entropy-coded scores a optimization. The ACS is related to ants' activities and the between places and diffuse material called "pheromon strength decreases gradually. The ants follow the way whi pheromones. This helps ants to select the least expensive pa between places is similar to the movement that takes place graph vertex indicates a feature, and edges from a vertex lection of features. The strategy repeats to find the best feat minimum number of vertices is traversed and a set criter rangement of vertices is similar to a mesh. An ant selects fe a given point at any specified time, and this is written as ℎ ( ) Ɛ

Feature Selection Based on ACS Optimization
The entropy operation [62] is used to code the obtained features. Entropy ℮ is usually applied to score the features. Mathematically, entropy has the following form: where ⃦ , … , ⃦ show the features, ( , … , ) represent random variables, and ( , … , ) calculate the probability. ACS is a learning-based approach used for feature optimization. When it is combined with entropy-based feature selection, it becomes an embedded approach. The obtained entropy-coded scores are provided to ACS for feature optimization. The ACS is related to ants' activities and their movements [63]. Ants move between places and diffuse material called "pheromones". With time, the material strength decreases gradually. The ants follow the way while calculating the probability of pheromones. This helps ants to select the least expensive path. Therefore, ants' movement between places is similar to the movement that takes place between vertices of a graph. A graph vertex indicates a feature, and edges from a vertex to another vertex show the selection of features. The strategy repeats to find the best features. The approach stops when minimum number of vertices is traversed and a set criterion is satisfied. The linking ar-

Feature Selection Based on ACS Optimization
The entropy operation [62] is used to code the obtained feat ally applied to score the features. Mathematically, entropy has th where ⃦ , … , ⃦ show the features, ( , … , ) represent ( , … , ) calculate the probability. ACS is a learning-based ap optimization. When it is combined with entropy-based feature embedded approach. The obtained entropy-coded scores are pro optimization. The ACS is related to ants' activities and their mov between places and diffuse material called "pheromones". W strength decreases gradually. The ants follow the way while calcu pheromones. This helps ants to select the least expensive path. Th between places is similar to the movement that takes place betwe graph vertex indicates a feature, and edges from a vertex to ano lection of features. The strategy repeats to find the best features. T minimum number of vertices is traversed and a set criterion is rangement of vertices is similar to a mesh. An ant selects features a given point at any specified time, and this is written as where ё( , … , ) are entropy-based features, ℎ ( ) is the v shows the empirical value, ⃛ is the cost of pheromones, knowledge, and denotes the time limit. ℎ (Ϯ) and Ɛ are atta is important to mention here that this will be considered an inc tures are not studied up to this point.

Dataset Balancing
The MIT dataset contains a total of 888 images, out of which 288 are female images. The number of male and female images therefore, this dataset is an imbalanced dataset, leading to the problems: (1) class imbalance problem, which results in poor perf sample space problem, which affects the training of model. Data enhance the size of dataset and balancing of class-wise data. For and horizontal flipping functions are applied. As a result, 264 ma

Feature Selection Based on ACS Optimization
The entropy operation [62] is used to code the obtained features. Entropy ℮ is usually applied to score the features. Mathematically, entropy has the following form: where ⃦ , … , ⃦ show the features, ( , … , ) represent random variables, and ( , … , ) calculate the probability. ACS is a learning-based approach used for feature optimization. When it is combined with entropy-based feature selection, it becomes an embedded approach. The obtained entropy-coded scores are provided to ACS for feature optimization. The ACS is related to ants' activities and their movements [63]. Ants move between places and diffuse material called "pheromones". With time, the material strength decreases gradually. The ants follow the way while calculating the probability of pheromones. This helps ants to select the least expensive path. Therefore, ants' movement between places is similar to the movement that takes place between vertices of a graph. A graph vertex indicates a feature, and edges from a vertex to another vertex show the selection of features. The strategy repeats to find the best features. The approach stops when minimum number of vertices is traversed and a set criterion is satisfied. The linking arrangement of vertices is similar to a mesh. An ant selects features on a probability basis at a given point at any specified time, and this is written as where ё( , … , ) are entropy-based features, ℎ ( ) is the value of pheromones, Ɛ shows the empirical value, ⃛ is the cost of pheromones, represents rational knowledge, and denotes the time limit. ℎ (Ϯ) and Ɛ are attached to the j th feature. It is important to mention here that this will be considered an incomplete response if features are not studied up to this point.

Dataset Balancing
The MIT dataset contains a total of 888 images, out of which 600 are male images and 288 are female images. The number of male and female images is equal in MIT dataset; therefore, this dataset is an imbalanced dataset, leading to the following two research problems: (1) class imbalance problem, which results in poor performance, and (2) small sample space problem, which affects the training of model. Data balancing is selected to

Feature Selection Based on ACS Optimization
The entropy operation [62] is used to code the obtained features. Entropy ℮ is usually applied to score the features. Mathematically, entropy has the following form: where ⃦ , … , ⃦ show the features, ( , … , ) represent random variables, and ( , … , ) calculate the probability. ACS is a learning-based approach used for feature optimization. When it is combined with entropy-based feature selection, it becomes an embedded approach. The obtained entropy-coded scores are provided to ACS for feature optimization. The ACS is related to ants' activities and their movements [63]. Ants move between places and diffuse material called "pheromones". With time, the material strength decreases gradually. The ants follow the way while calculating the probability of pheromones. This helps ants to select the least expensive path. Therefore, ants' movement between places is similar to the movement that takes place between vertices of a graph. A graph vertex indicates a feature, and edges from a vertex to another vertex show the selection of features. The strategy repeats to find the best features. The approach stops when minimum number of vertices is traversed and a set criterion is satisfied. The linking arrangement of vertices is similar to a mesh. An ant selects features on a probability basis at a given point at any specified time, and this is written as where ё( , … , ) are entropy-based features, ℎ ( ) is the value of pheromones, Ɛ shows the empirical value, ⃛ is the cost of pheromones, represents rational knowledge, and denotes the time limit. ℎ (Ϯ) and Ɛ are attached to the j th feature. It is important to mention here that this will be considered an incomplete response if features are not studied up to this point.

Dataset Balancing
The It is important to mention here that this will be considered an incomplete response if features are not studied up to this point.

Dataset Balancing
The MIT dataset contains a total of 888 images, out of which 600 are male images and 288 are female images. The number of male and female images is equal in MIT dataset; therefore, this dataset is an imbalanced dataset, leading to the following two research problems: (1) class imbalance problem, which results in poor performance, and (2) small sample space problem, which affects the training of model. Data balancing is selected to enhance the size of dataset and balancing of class-wise data. For this purpose, mirroring and horizontal flipping functions are applied. As a result, 264 male images and 576 female images are added to total 864 male and 864 female images. In this way, the size of MIT dataset is increased and, hence, class-wise data are also balanced.

Classification
After the selection of features, these are provided to the selected classifiers of SVM [64] and KNN [65] to perform classification. The SVM classifiers include linear variant [66] (LSVM), quadratic variant (QSVM) [67], fine Gaussian variant (FGSVM) [68], medium Gaussian variant (MGSVM), coarse Gaussian (CGSVM), and cubic variant (CSVM) [69]. The details of the kernels of these SVM classifiers can be found in [70][71][72]. The classifiers chosen from KNN include coarse variant (CRKNN), fine variant (FKNN), and cosine variant (COKNN) [73]. The details of these variants are available in [73][74][75][76]. The evaluation of classifiers is conducted on various performance evaluation metrics. Keeping in view the results obtained from experiments, it was observed that CSVM, QSVM, and FKNN classifiers produced better results. FKNN produced highest accuracy with MIT dataset, and CSVM was observed as best classifier in case of PKU-Reid dataset. The details of experiments performed and results produced are presented in the Results section.

Results and Discussion
This research was aimed to introduce a novel CNN network based on deep learning to classify pedestrian image datasets. A robust feature set was extracted with the proposed 4-BSMAB CNN-based network, and then various SVM and KNN classifiers were applied to obtain feature sets for evaluating the performance of system. The analysis and outcomes of proposed framework are presented in this section. In the first part, the details of experimental setup along with the datasets used and evaluation protocols applied are provided, and second part explains the experiments performed, which were carried out using a core i5 machine with Windows 10 platform, 8GB memory, and GPU (NVIDIA GTX 1070) with 8GB RAM (inbuilt). The MATLAB2020a tool was selected for programming purposes.

Datasets
Challenging datasets including viewpoint invariant pedestrian recognition (VIPeR) [77], pedestrian attribute (PETA) [78], cross-dataset [40], MIT [42], and Peking University reidentification (PKU-Reid) [79] were selected to test the proposed approach. Table 2 shows the details of these selected testing datasets. These datasets are also publicly available on the internet for experimental and research work. Different tasks related to pedestrian analysis such as person attribute analysis and PGC have been performed on these datasets. The challenges that exist in these datasets include inter-and intraclass variations (IICV), and environment recording settings (ERS). The IICV consist of speed and style of the movement of pedestrians, and ERS include pose variations, illumination changes, viewpoint changes, recording rates, camera settings, complex backgrounds, object deformation, shadow, and occlusion. Table 3 shows view-based information of testing datasets on which the evaluation of proposed model was carried out. Pedestrian attribute analysis and re-identification Table 3. View-based sample in MIT, VIPER, PKU-Reid, and PETA datasets used for testing of proposed model. In MIT dataset, 305 male images and 113 female images were selected for front viewbased evaluation, 296 male images and 174 female images were selected for backview-based evaluation, and 864 male and 864 female images were selected in case of mixed view-based evaluation. The VIPeR dataset contains 339 male images and 251 female images for front views, 198 male and 133 female images for back views, and 721 male images and 543 female images for mixed views. In PKU-Reid dataset, front views include 420 male and 264 female images, back views include 140 male and 88 female images, and mixed views include 1120 and 520 images of males and females, respectively. Figure 4 shows sample images of males as well as females taken from these datasets.

Performance Evaluation Protocols
The evaluation of PGC problems directly relates to different accuracies and AUC. In this work, generally used performance evaluation metrics, i.e., accuracy (ACC), receiver operating characteristic (ROC) curve, F-measure (FM), G-measure (GM), area under the curve (AUC), true positive rate (TPR), and false positive rate (FPR), were selected for the measurement of performance of different PGC methods. Table 4 shows these metrics with their mathematical representation. Five-fold-type cross-validation was adopted for training and testing. Accuracy (ACC)

Performance Evaluation of Proposed Framework
Experiments were performed with the proposed framework on MIT, VIPeR, and PKU-Reid testing datasets to achieve best results. For this purpose, various experiments were carried out with several variations of optimized feature subsets. A major analysis of these experiments is presented in this section. Table 5 shows the summary of experiments performed and accuracies produced by them with five feature subsets on front views, back views, and mixed views of selected datasets. Table 5. Optimized feature subsets with dimensions and best accuracy on MIT, VIPeR, and PKU-Reid datasets.

Feature Subset
No.

Best ACC (%) Achieved on MIT Dataset
Best ACC (%) Achieved on VIPeR Dataset The fitness value graphs obtained by ACS on mixed views of MIT dataset are depicted in Figure 5. The fitness was maintained after 49th iteration, with a value of 0.29, by using 1000-feature subset.

Performance Evaluation of MIT Dataset
In this section, the results generated by experiments performed using front views, back views, and mixed view images of MIT testing dataset are mentioned. Five-fold-type cross-validation was utilized on all feature matrices of MIT dataset obtained from frontal views, back views, and mixed views of MIT dataset and provided to classifiers that are variants of KNN and SVM for automatic labeling. The details of the evaluation of proposed model with five different feature subsets on MIT testing dataset are presented below.
Evaluation of frontal views of MIT dataset: The best result achieved in terms of accuracy was 74.9% by LSVM with 1000-feature subset, CSVM with 750-feature subset, and QSVM with 1000-feature subset, while the second best result attained in terms of accuracy was 74.7% by LSVM with 500-feature subset, as shown in Table 6. The training time and prediction speed of proposed model on front views of MIT dataset are presented in Figure 6.   Evaluation of back views of MIT dataset: The best accuracy achieved on MIT dataset was 73.8% by CSVM classifier with 1000-feature subset, and QSVM with 750-feature subset, while the second best result obtained in terms of accuracy was 73.4% by QSVM with 250-feature subset, as shown in Table 7. The training time and prediction speed of proposed model on back views of MIT dataset are presented in Figure 7.  Evaluation of mixed views of MIT dataset: The best result achieved in terms of accuracy was 85.3% by FKNN with 1000-feature subset, while the second best result attained in terms of accuracy was 85.1% by FKNN with 750-feature subset, as shown in Table 8. The training time and prediction speed of proposed model on mixed views of MIT dataset are presented in Figure 8.  The best ROC outcomes on MIT dataset are presented in Figure 9.

Performance Evaluation of VIPeR Dataset
In this section, the results generated by experiments performed using front views, back views, and mixed view images of VIPeR testing dataset are mentioned. Five-fold-type cross-validation was utilized on all feature matrices of MIT dataset obtained from frontal views, back views, and mixed views of VIPeR dataset and provided to classifiers that are variants of KNN and SVM for automatic labeling. The details of the evaluation of proposed model with five different feature subsets on VIPeR testing dataset are presented below.
Evaluation of frontal views of VIPeR dataset: The best result achieved in terms of accuracy was 72.9% by QSVM with 1000-feature subset, while the second best result obtained in terms of accuracy was 70.7% by CSVM with 1000-feature subset, as shown in Table 9. The training time and prediction speed of proposed model on front views of VIPeR dataset are presented in Figure 10.  Evaluation of back views of VIPeR dataset: The best result achieved in terms of accuracy was 72.5% by QSVM with 750-feature subset, while the second best result attained in terms of accuracy was 70.7% by LSVM with 1000-feature subset, as shown in Table 10. The training time and prediction speed of proposed model on back views of VIPeR dataset are presented in Figure 11.  Evaluation of mixed views of VIPeR dataset: The best accuracy achieved was 70.3% by CSVM with 1000-feature subset, while the second best result obtained in terms of accuracy was 69.5% by LSVM with 750-feature subset, as shown in Table 11. The training time and prediction speed of proposed model on mixed views of VIPeR dataset are presented in Figure 12. The best ROC outcomes on VIPeR dataset are presented in Figure 13.

Performance Evaluation of PKU-Reid Dataset
This section describes the results generated by experiments performed using front views, back views, and mixed view images of PKU-Reid testing dataset. Five-fold-type cross-validation was utilized on all feature matrices of MIT dataset obtained from frontal views, back views, and mixed views of PKU-Reid dataset and provided to classifiers that are variants of KNN and SVM for automatic labeling. The details of the evaluation of proposed model with five different feature subsets on PKU-Reid testing dataset are presented below.
Evaluation of frontal views of PKU-Reid dataset: The best accuracy achieved was 85.7% by CSVM with 250-featuresubset, while the second best accuracy attained was 85.5% by CSVM with 1000-feature subset, as shown in Table 12. The training time and prediction speed of proposed model on front views of PKU-Reid dataset are presented in Figure 14.  Evaluation of back views of PKU-Reid dataset: The best accuracy achieved was 93.0% by CSVM with 1000-feature subset, while the second best result obtained in terms of accuracy was 92.5% by QSVM with 1000-feature subset, as shown in Table 13. The training time and prediction speed of proposed model on back views of PKU-Reid dataset are presented in Figure 15.  Evaluation of mixed views of PKU-Reid dataset: The best accuracy achieved was 91.2% by CSVM with 750-and 1000-feature subsets, while the second best result achieved in terms of accuracy was 90.4% by QSVM with 1000-feature subset, as shown in Table 14.
The training time and prediction speed of proposed model on mixed views of PKU-Reid dataset are presented in Figure 16.  The best ROC outcomes on the PKU-Reid dataset are presentedin Figure 17. Regarding PKU-Reid dataset, it is pertinent to mention here that the relevant literature has been studied thoroughly to find existing methods in which PKU-Reid dataset is used and results are obtained for PGC purpose, but the literature lacks such methods; hence, a comparison of the results produced by the proposed approach is not possible. Although this dataset was introduced in 2016, researchers have not utilized it for PGC tasks.

Performance Comparison between Proposed Approach and Existing Studies
The proposed model was evaluated using frontal views, back views, and mixed views of MIT, VIPeR, and PKU-Reid testing datasets, and details of the results obtained are presented from Tables 5-14 in the previous section. A performance comparison between the proposed and existing classical and state-of-the-art methods is shown in the upcoming text. The results produced were compared with various methods such as CNN [16], HOG [45], HOG -LBP-HSV [45], CNN-e [19], Full-Body (CNN) [20], HDFL [40], SSAE [51], and recent best performers such as J-LDFR [41] and CSVFL [39] used in PGC for validation of the proposed framework.
These methods were selected for comparison because they have produced results in terms of accuracies on MIT dataset. Table 15 shows the comparison of accuracies achieved by existing pedestrian recognition methods and the proposed approach on MIT dataset. The highest accuracy, i.e., 85.4%, obtained by the proposed framework on MIT dataset was generated by FKNN variant of KNN classifier. It can be observed from Table 15 that the proposed approach shows better accuracy; hence, it outperformed all the existing PGC methods. It accomplished 0.2% better ACC than the latest existing method, CSVFL [39], and 3.3% and 2.9% improvements as compared to the recent best performers J-LDFR [41] and SSAE [51], respectively. By comparing 74.3% accuracy produced by HDFL [40] method with the proposed approach, it can be observed that the proposed method achieved 11.0% higher accuracy. A comparison of the results obtained with the existing and proposed methods in terms of accuracy is shown in Figure 18. Table 15. Performance comparison of results of proposed and existing PGC methods on MIT dataset.

Year ACC (%) Using Mixed Views
Proposed 4-BSMAB Proposed 85.4 To revalidate the worth of proposed method, the results obtained with the presented approach utilizing AUC evaluation protocol were also compared with existing methods. As per findings from relevant literature, J-LDFR [41] is the only technique that has computed an AUC on mixed views of MIT dataset. Table 16 shows the comparison in terms of AUC, and obtained results show that the proposed approach outperformed the existing method, J-LDFR [41], with a 6.0% improvement.  Figure 19 shows AUC obtained by the proposed method using various classifiers with 1000-feature subset on mixed views of MIT dataset, and it can be observed that CSVM variant of SVM classifier produced highest AUC of 92.0% on mixed views of MIT dataset. Figure 19. Comparison in terms of AUC with various classifiers on mixed views of MIT dataset using 1000-feature subset.

Discussion
In this manuscript, PGC problem was addressed, and for this purpose, pedestrian attribute recognition datasets such as MIT, VIPeR, and PKU-Reid were tested. Extensive experiments were performed to develop the proposed approach named as 4-BSMAB, having 64 layers for increased performance. First, CIFAR dataset of 100 classes was used for the training of the proposed model, and then features were extracted from three datasets such as MIT, VIPeR, and PKU-Reid using a pre-trained network. A feature optimization scheme based on ACS was selected to optimize the obtained features. The classification was carried out by performing experiments and selecting various optimal feature subsets, and the outcome of proposed framework was noted with the help of performance evaluation metrics.
At the time of feature selection, different variations of features were defined, and results were obtained applying these variations. The same classifiers were used in all experiments performed. Keeping in view the results obtained with the same classifiers on three different datasets, it was observed that the performance of most utilized classifiers increased as the number of optimized features in feature subsets increased, but, at the same time, the difference in accuracies provided by the classifiers became very small. On the other hand, it was also found that the performance of some classifiers decreased after the first iteration of 100 features and remained the same or increased at a very low rate between second and fourth iterations with feature subsets of 250 to 750, but in the fifth iteration with 1000 features, the performance increased at a very low rate. It was also noted that FKNN, CSVM, QSVM, and sometimes LSVM performed better than the other variants of KNN and SVM, whereas CRKNN, FGSVM, and CGSVM variants showed poor performances in most of the experiments, with a minimum accuracy of nearly 50.0%. The experiments also showed that the performance of most classifiers is better when 500-, 750-, and 1000-feature subsets are used. Overall, 1000-feature subset can be considered best feature subset. It was also seen that the performance of all variants of KNN related to the training time and prediction speed was found to be higher in comparison with SVM.

Conclusions
A novel CNN-based framework, 4-BSMAB, was assessed for feature extraction, and ACS was used for the selection of optimized feature sets. The SoftMax classifier was utilized to train 4-BSMAB model on the existing CIFAR-100 dataset, and features were obtained from common pedestrian datasets. An optimized feature set obtained with ACS optimization technique was provided to various classifiers of SVM and KNN for PGC. Five-fold-type cross-validation was carried out to train and test the pedestrian datasets. Extensive experimentation was carried out with various feature subsets, and the details of only five experiments conducted on each dataset were mentioned. It was observed from the experimentation results that the optimized feature subset with 100 features produced a lower accuracy of 81.3%, whereas 1000-featuresubset performed better and achieved 85.4% accuracy with FKNN classifier, and 92% AUC with CSVM classifier, on MIT dataset. A comparison of the results of proposed model and existing state-of-the-art methods on MIT dataset was presented, and it was observed that the proposed method outperformed existing gender classification approaches. It was also noted that CSVM classifier performed better on PKU-Reid dataset and generated 93% accuracy and 96% AUC. The experimentation results also show that most of the classifiers produced better results with 1000-optimized feature subset and obtained second best results with an optimized feature subset of 100 features. As per findings, results on PKU-Reid are not available in the relevant literature, and a performance comparison in this regard is not possible. Although the proposed framework produced satisfactory results, the accuracy can still be improved further. In future work, other approaches such as LSTMs, manifold learning, and quantum deep learning may be explored for better performance.

Conflicts of Interest:
The authors declare no conflict of interest.