An Efficient Pedestrian Gender Recognition Method Based on Key Area Feature Extraction and Information Fusion

Zhang, Ye; Yan, Weidong; Liu, Guoqi; Jin, Ning; Han, Lu

doi:10.3390/app16031298

Open AccessArticle

An Efficient Pedestrian Gender Recognition Method Based on Key Area Feature Extraction and Information Fusion

by

Ye Zhang

^1,*,

Weidong Yan

^1,*,

Guoqi Liu

²,

Ning Jin

^2,3 and

Lu Han

⁴

¹

School of Civil Engineering, Shenyang Jianzhu University, Shenyang 110168, China

²

School of Computer Science and Engineering, Shenyang Jianzhu University, Shenyang 110168, China

³

Key Laboratory of Digital Village Technology, Ministry of Agriculture and Rural Affairs, Beijing 100097, China

⁴

Ministry of E-Government, Liaoning Province Data Centre, Shenyang 110168, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1298; https://doi.org/10.3390/app16031298

Submission received: 30 December 2025 / Revised: 21 January 2026 / Accepted: 23 January 2026 / Published: 27 January 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Aiming to address the problems of scale uncertainty, feature extraction difficulty, model training difficulty, poor real-time performance, and sample imbalance in low-resolution images for gender recognition, this study proposes an efficient pedestrian gender recognition model based on key area feature extraction and fusion. First, a discrete cosine transform (DCT)-based local super-resolution preprocessing algorithm is developed for facial image gender recognition. Then, a key area feature extraction and information fusion model is designed, using additional appearance features to assist in gender recognition and improve accuracy. The proposed model preprocesses images using the DCT image fusion and super-resolution methods, dividing pedestrian images into three regions: face, hair, and lower body (legs) regions. Features are separately extracted from each of the three image regions. Finally, a multi-region local gender recognition classifier is designed and trained, employing decision-level information fusion. The results of the three local classifiers are fused using a Bayesian computation-based fusion strategy to obtain the final recognition result of a pedestrian’s gender. This study uses surveillance video data to create a dataset for experimental comparison. Experimental results demonstrate the superiority of the proposed approach. The facial model (DCT-PFSR-CNN) achieved the best accuracy of 89% and an F1-Score of 0.88. Furthermore, the complete pedestrian model (MPGRM) attained an mAP of 0.85 and an AUC of 0.86, surpassing the strongest baseline (HDFL) by 2.4% in mAP and 2.3% in AUC. These results confirm the high application potential of the proposed method for gender recognition in real-world surveillance scenarios.

Keywords:

super-resolution processing; key area feature; information fusion; gender recognition

1. Introduction

Gender is a fundamental human attribute playing a crucial role in identity retrieval and authentication. Accurate recognition of a pedestrian’s gender from images has become a highly important and valuable research topic in the fields of computer vision and artificial intelligence in recent years. With technological advancements and updates in application devices, this recognition continues to improve and has become increasingly popular. A successful gender recognition process is a constructive and stimulating task, contributing to intelligent recognition, intelligent human–computer interaction, facial expression recognition, emotional expression, customer-oriented intelligent recommendations, and other tasks [1,2].

Since the advent of convolutional neural networks (CNNs), they [3] have demonstrated structural advantages in processing two-dimensional images and, thus, have been widely used in image recognition and classification tasks. Gender, as a typical binary attribute of people, has been extensively recognized in many fields using various models combining feature extraction and classification through CNNs. For example, Rajeev et al. [4] studied the relationships between multiple tasks and found that leveraging the synergy between different tasks could improve the overall performance of the model. They also proposed an algorithm that uses features of CNNs’ intermediate layers for the simultaneous analysis of multiple facial attributes. Lin et al. [5] improved the accuracy of facial-based gender recognition by implementing feature fusion and parameter optimization using a dual-input CNN. Chen et al. [6] added a squeeze-and-excitation network (SE-Net) structure between the CNN and classifier, thus optimizing the features extracted by the CNN. The classifier used the error minimization extreme learning machine (EM-ELM) instead of extreme learning machine (ELM) to improve classifier performance.

Although facial-based gender recognition has matured considerably, its heavy reliance on con-strained image datasets limits its performance in practical applications. To address this problem, Dong [7] proposed the SRCNN model for super-resolution technology, considering an “end-to-end” mapping relationship between low- and high-resolution images. By using a CNN model for training, this relationship was directly learned to reconstruct images, improving image resolution. Furthermore, Dong [8] proposed a method of replacing the preprocessing upsampling operation in the SRCNN model with a deconvolution layer, which significantly reduced the number of parameters through dimension reduction and subsequent upsampling, providing noticeable improvements in both the operational speed and effectiveness. Tian et al. [9] studied methods for improving the blurred license plate resolution and proposed an optimized super-resolution reconstruction method based on the FSRCNN. Comparative experiments with other algorithms showed that this model could not only improve image quality in both data evaluation and visual observation but also enhance the overall training efficiency.

The aforementioned studies have achieved excellent experimental results on well-constrained datasets, but gender recognition from low-resolution images still faces the problems of scale un-certainty, difficulty in feature extraction, challenging model training, low real-time performance, sample imbalance, and an inability to obtain images that meet experimental requirements [10]. To address all these problems, this study proposes a pedestrian gender recognition model based on key area feature extraction and fusion. The main contributions of this work can be summarized as follows:

A discrete cosine transform (DCT) image fusion algorithm [11] and super-resolution technology-based FSRCNN model [12] are developed, using two consecutive frames from video surveillance as dual inputs to the model. The input images are divided into blocks, the DCT calculations are conducted to select high-quality blocks, and super-resolution processing is performed on high-frequency information blocks. Finally, all image blocks are fused to output a high-resolution image.
The facial gender recognition algorithm is optimized. In addition, by selecting and fusing the intermediate layer feature values of the CNN, more comprehensive information on the feature values is obtained. Based on the EfficientNet model [13], the output features of the intermediate and high layers are unified in dimension and fused. The fused feature values are then classified to achieve facial-based gender recognition.
A method that can recognize a pedestrian’s gender based on the entire body is proposed. This method first preprocesses an image; then extracts features from the face, hair, and clothing (lower body) regions; and finally trains different local classifiers for different regions. A Bayesian-based fusion strategy [14] is used to fuse the results of local classifiers to output the pedestrian’s gender recognition result.

2. Related Work

2.1. Gender Recognition in Low-Resolution Surveillance

Most existing gender recognition methods are designed for high-quality, near-frontal facial images. Models such as LNets + ANet [15] perform well on in-the-wild datasets but suffer significant degradation under the very low-resolution (VLR) and blurry conditions typical of surveillance footage [16]. This decline stems from the loss of discriminative facial details. To address this, research has pursued robust feature learning directly from poor-quality data. Approaches include fusing handcrafted and deep features (e.g., HDFL [17]), employing attention mechanisms [18], or learning resolution-invariant representations via adversarial training [19]. However, these methods primarily operate at the feature level, attempting to build a robust classifier for suboptimal data, rather than addressing the data-level bottleneck of missing pixel-wise information. This creates a gap for solutions incorporating front-end, task-aware image enhancement, which our work targets.

2.2. Image Super-Resolution for Enhancement

Deep learning-based Single Image Super-Resolution (SISR) has advanced significantly, from pioneering works like SRCNN [7] and FSRCNN [8] to more complex networks employing residual learning and attention mechanisms [20]. Generative models like SRGAN [21] further improve perceptual quality. However, applying these models globally to entire surveillance frames is computationally inefficient. Video SR methods leverage temporal information but increase complexity [22]. Crucially, most SR methods are optimized for generic perceptual metrics rather than for recovering features critical to downstream tasks like gender recognition. Some task-specific SR works exist for face recognition [23]. Building on this, our approach innovates by integrating DCT-based block selection with an efficient SR network (FSRCNN) to perform localized, on-demand enhancement. This focuses computational resources on enhancing only high-frequency blocks likely to contain discriminative details, balancing quality and efficiency for real-time systems.

2.3. Multi-Region and Attribute-Based Recognition

To overcome the fragility of face-only systems, soft biometrics utilize attributes like hair, clothing, and gait [24]. For gender recognition, features from hair [25] and clothing [26] have proven effective. A common pipeline involves precise body part localization (e.g., via pose estimation [27]) before feature extraction. However, in low-resolution scenarios, these precise localization models themselves become unreliable, leading to cascading errors. Therefore, a practical framework for surveillance must prioritize robustness over precise localization. Our method adopts a heuristic yet effective strategy: using a fast Haar [28] cascade for primary face detection and falling back on statistically informed region estimation when precise detection fails. This ensures operational stability in challenging conditions where complex parsers would be error-prone.

2.4. Information Fusion Strategies

Fusing information from multiple cues can occur at the feature, score, or decision level [29]. Decision-level fusion, which combines abstract classifier outputs, is particularly robust to heterogeneous sources. Bayesian fusion provides a principled probabilistic framework for this, weighting individual decisions based on prior probabilities and classifier reliabilities (e.g., false positive/negative rates) [14]. This makes it ideal for scenarios involving classifiers of differing and uncertain reliability. Given that our multi-region classifiers (face, hair, clothing) vary in accuracy and may be conditionally active (e.g., face classifier inactive), we adopt Bayesian fusion. It provides an optimal, theoretically grounded strategy for combining outputs from this heterogeneous ensemble, superior to heuristic rules like simple voting or averaging.

2.5. Theoretical Foundations of Key Techniques

2.5.1. Discrete Cosine Transform (DCT) and Its Inverse

The Discrete Cosine Transform (DCT) is pivotal for image analysis due to its strong energy compaction property. For an

N \times N

image block

f (i, j)

, the 2D-DCT coefficients

F (u, v)

are calculated as:

F (u, v) = c (u) c (v) \sum_{i = 0}^{N - 1} \sum_{j = 0}^{N - 1} f (i, j) \cos [\frac{(2 i + 1) π u}{2 N}] \cos [\frac{(2 j + 1) π v}{2 N}]

(1)

for

u, v = 0,1, \dots, N - 1

, with

c (u) = \sqrt{1 / N}

for

u = 0

and

c (u) = \sqrt{2 / N}

otherwise. The corresponding Inverse DCT (IDCT) is:

F (u, v) = c (u) c (v) \sum_{i = 0}^{N - 1} \sum_{j = 0}^{N - 1} f (i, j) \cos [\frac{(2 i + 1) π u}{2 N}] \cos [\frac{(2 j + 1) π v}{2 N}]

(2)

A key property is that the variance of DCT coefficients of a block correlates with its high-frequency content and detail richness [11]. We leverage this to select the sharper block from consecutive video frames, forming the basis of our selective enhancement in Section 4.1.

2.5.2. Bayesian Decision Fusion Framework

For binary classification (e.g., gender:

H_{0}

= Female,

H_{1}

= Male), Bayesian decision fusion offers a statistically optimal framework under conditional independence. It combines decisions by weighting them according to prior probabilities

P (H_{j})

and classifier likelihoods

P (x_{i} ∣ H_{j})

. For binary decisions

g_{i} \in {- 1, + 1}

, the fused decision

G

can be expressed as a weighted sum:

G = s g n (ω_{0} + \sum_{i = 1}^{N} (ω_{i} \cdot g_{i}))

(3)

where weights

ω_{i}

are derived from the log-likelihood ratios of each classifier’s error rates (false positive

P_{F P_{i}}

and false negative

P_{F N_{i}}

):

ω_{i} = \{\begin{matrix} l o g \frac{1 - P_{F N_{i}}}{P_{F P_{i}}}, & if g_{i} = + 1 \\ l o g \frac{1 - P_{F P_{i}}}{P_{F N_{i}}}, & if g_{i} = - 1 \end{matrix}

(4)

This framework is chosen for our model (Section 5.4) as it elegantly handles classifiers of heterogeneous reliability and can manage inactive classifiers (e.g., when no face is detected).

2.5.3. Evaluation Metrics for Classification

A comprehensive evaluation requires metrics robust to class imbalance. Based on the confusion matrix counts—True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN)—we employ:

a c c = \frac{T P + T N}{T P + T N + F P + F N}

(5)

p r e = \frac{T P}{T P + F P}

(6)

r e c a l l = \frac{T P}{T P + F N}

(7)

F_{1} - S c o r e = 2 \times \frac{p r e \times r e c a l l}{p r e + r e c a l l}

(8)

For the full pedestrian model, we also report mean Average Precision (mAP) and the Area Under the ROC Curve (AUC), which provide threshold-independent evaluation of ranking quality and are standard in attribute recognition. These metrics form our evaluation protocol in Section 6.

In summary, prior work often addresses individual challenges—enhancement, robust features, multi-cue extraction, or fusion—in isolation. Our key insight is that for practical low-resolution surveillance, these challenges are interdependent. The main contribution of this paper is the novel integration of: (1) a DCT-based local enhancement mechanism, (2) a robust multi-region feature extraction strategy tolerant to localization uncertainty, (3) a principled Bayesian decision fusion framework into a cohesive, end-to-end pipeline. This integrated approach, evaluated via a standard metrics suite, is designed to fill the gap for accurate and efficient pedestrian gender recognition in real-world surveillance systems.

3. Proposed Algorithm Framework

This study aims to solve the problem of low-resolution images in existing surveillance video equipment, which cannot accurately recognize gender through clear facial images, and overcome the inability to recognize gender for pedestrians whose morphology does not meet recognition conditions. The proposed pedestrian gender recognition model is based on multi-region local classifier information fusion. The overall framework includes four main parts: an image preprocessing block, a multi-region local feature extraction block, a gender recognition classifier based on the YOLO model [30], and a decision-level information fusion strategy-based block. First, the DCT image fusion algorithm is integrated with a super-resolution technology-based FSRCNN model. Two consecutive frames of pedestrian images obtained from a surveillance video are used as input. The high-frequency information blocks are enhanced in terms of resolution by calculating the image blocks at the same positions in the two frames using the DCT, employing super-resolution technology to obtain high-resolution image blocks. This process provides a high-resolution pedestrian model. In addition, the EfficientNet network is used for further optimization and improvement, where the feature values are optimized by integrating the network’s intermediate layers to enhance the classifier’s accuracy. Furthermore, a high-resolution pedestrian image is processed by a target detection module to identify and extract features from three body regions of a pedestrian: hair, face, and legs. The pedestrian image is checked for facial recognition using the Haar face detection algorithm [31]. If a human face is recognized, it is cropped out to generate a facial image for subsequent face-based gender recognition. In addition to the facial gender classifier, two YOLO classifiers are trained for different types of features, namely a hair gender classifier and a lower-body (legs) gender classifier. Finally, the accuracy of the local classifiers is used to assign weights to each local classifier in the pedestrian gender recognition result. Fusing the local classifiers’ results replaces multi-region feature fusion, and a decision-level gender classifier is used to recognize a pedestrian’s gender.

4. DCT-PFSR-CNN Model

4.1. DCT-Based Best-Quality Block Selection

Image fusion involves the processes of extracting and selecting representative data from two or more images, which are then fused into a complete image while retaining necessary and effective information. In the DCT-PFSR-CNN model, the process of image fusion is combined with super-resolution technology. First, the dual-input low-resolution images are divided into 8 × 8 square image blocks. For the blocks at the same positions in the two images, a DCT-based comparison selection is performed to select high-quality image blocks, which are then processed by the block-based super-resolution methods instead of performing global super-resolution processing on the input images.

Consider two adjacent images denoted by P_A and P_B, with image dimensions of

p \times q

and a standardized dimensions set

P \times Q

. The input images are divided into 8 × 8 image blocks

F = (f_{i, j})

, where

i = 0, \dots \dots, N - 1, a n d j = 0, \dots \dots, M - 1

. Let

B_{n} = (b_{n, k, l})

be the nth 8 × 8 image block, where

n = 0, \dots, P - 1,

l = 0, \dots, 7,

and

k = 0, \dots, 7

. After performing the DCT on block

B_{n}

, the output

Y_{n} = (y_{n, k, l})

is obtained. The DCT output of image

F

can be represented as a set

D = D_{0}, D_{1}, \dots \dots, D_{P - 1}

, The DCT calculation for the Rth input can be represented by

D^{R} = D_{0}^{R}, D_{1}^{R}, \dots \dots, D_{p - 1}^{R}

. The block selection flowchart is displayed in Figure 1.

The specific steps of the block selection algorithm are as follows:

Step 1: Divide Input Images into Blocks:

The two adjacent source images are divided into k 8 × 8 image blocks. The kth block of image A is denoted by

B_{k}^{A}

, and the kth block of image B is denoted by

B_{k}^{B}

. Then, each block undergoes a two-dimensional (2D) DCT (Equation (1) in Section 2.5.1):

D_{k}^{R} = d c t (B_{k}^{R})

(9)

Step 2: Normalize and Calculate Variance:

Normalize

D_{k}^{R}

and compute the mean to obtain the variance:

v a r (D_{k}^{R})

(10)

Step 3: Compare and Select Blocks

Compare variances

v a r (D_{k}^{A})

and

v a r (D_{k}^{B})

, select the block with the higher variance, and update the

c v M a p

value as follows:

d c t S u b = \{\begin{matrix} D_{k}^{A}, i f i m g 1 V a r > i m g 2 V a r \\ D_{k}^{B}, e l s e \end{matrix}

(11)

v a r S u b = \{\begin{matrix} D_{k}^{A}, i f i m g 1 V a r > i m g 2 V a r \\ D_{k}^{B}, e l s e \end{matrix}

(12)

v M a p = \{\begin{matrix} - 1, i f i m g 1 V a r > i m g 2 V a r \\ + 1, e l s e \end{matrix}

(13)

Step 4: Inverse DCT for Blocks Selection:

The dctSub denotes a high-quality block obtained through the DCT calculation and comparison. Because the image blocks have undergone the DCT process, they need to be inverse-DCT (IDCT) transformed to obtain the final image block, IDCT as defined in Equation (2) of Section 2.5.1:

i d c t (d c t S u b)

(14)

Step 5: Output High-quality Blocks:

The final idct(dctSub) is an 8 × 8 image block, which represents a selected high-quality image block.

4.2. Block-Based Super-Resolution Technology

The SRCNN model has a simple structure consisting of a few layers but requires using bicubic interpolation to upscale the input image to the target size before image block extraction and feature representation [32]. However, this model has some limitations, including a single-scale factor and a small receptive field in the convolution process, which can result in overly local features and poor detail recovery. The introduction of the FSRCNN model can address the aforementioned problems, with almost no loss in mapping accuracy. The framework of the FSRCNN model is shown in Figure 2.

The core of super-resolution technology is to learn the end-to-end mapping function relationship between a low-resolution image X and a high-resolution image Y and to reconstruct image Y from image X. By using the DCT to select high-quality blocks from adjacent frames, these blocks are divided into high- and low-frequency information blocks that are then integrated based on a preset threshold. Namely, if the variance of an image block is equal to or larger than the threshold, the block undergoes super-resolution processing; otherwise, the block is treated as a block to be integrated. The high-frequency information blocks are processed using super-resolution technology to obtain high-resolution image blocks. The high-resolution image blocks are then integrated with the low-frequency information blocks. After super-resolution processing, the high-frequency information blocks undergo double deconvolution in the deconvolution layer. To maintain consistent block specifications, the unprocessed low-frequency information blocks are also upsampled two times. Finally, the image fusion operation places the corresponding image blocks back in their positions in the image, and consistency detection, performed using the cvMap, yields the final high-quality DCT-fused image. Block-based super-resolution processing avoids unnecessary operations on low-frequency information image blocks, thus effectively improving super-resolution processing speed.

4.3. CNN-Based Face Gender Classification

In this study, the EfficientNet model is selected for the gender classification task to enhance classification performance. The EfficientNet model is an efficient CNN model that uses compound coefficients to scale the network’s depth, width, and resolution, achieving optimal performance with given resources [33]. This allows the EfficientNet model to excel in various tasks, including image classification, especially in scenarios with limited computational resources, such as real-time image processing and analysis on local devices.

In this study, the EfficientNet network is optimized by fusing features from the network’s intermediate and high layers to obtain richer and more diverse feature information, thus improving the gender classifier’s accuracy.

The detailed steps are as follows:

(1): Feature Extraction and Fusion

Extract features from the intermediate layer (e.g., MBConv4 layer) and the high layer (e.g., MBConv6 layer) of the EfficientNet model. Assume that the output dimensions of the MBConv4 and MBConv6 layers are 14 × 14 × 112 and 7 × 7 × 320, respectively.

Next, perform global average pooling on the MBConv4 and MBConv6 layers’ outputs, obtaining feature maps with sizes of 1 × 1 × 112 and 1 × 1 × 320, respectively. Then, use 1 × 1 convolutions to adjust the channel numbers of the feature maps to the same size, such as 64, resulting in feature maps with a size of 1 × 1 × 64. Concatenate the two feature maps to obtain a fused feature map with a size of 1 × 1 × 128. Finally, apply another 1 × 1 convolution layer to adjust the channel number of the fused feature map to the final output size, for example, 256.

(2): Classifier Design

Use a two-layer fully connected network for classification, where the first layer has 256 neurons and uses the ReLU activation function to introduce non-linearity, and the second layer is the output layer with two neurons and employs the Softmax activation function, which is responsible for final classification.

(3): Model Training

Standardize the input super-resolution images, use the Adam optimizer with an initial learning rate of 0.001, and apply a cosine annealing learning rate schedule. Employ the cross-entropy loss function for supervised learning.

5. Multi-Region Pedestrian Gender Recognition Model (MPGRM)

5.1. Pedestrian Gender Recognition Model

The existence of various shapes of pedestrians in surveillance videos makes it challenging to obtain clear, frontal facial images. As a result, gender recognition based only on the face cannot always be feasible, requiring to use of other appearance features in gender recognition. This study proposes a method for pedestrian gender identification that considers the entire pedestrian body, which uses a gender recognition model based on multi-region feature extraction and result fusion.

The proposed model consists of three main components: image preprocessing, multi-region local gender recognition classification, and result fusion. First, the model performs DCT-based local super-resolution processing as a preprocessing step of pedestrian gender recognition. This step focuses on enhancing blurred pedestrian images. In addition, the color images of pedestrians are converted to grayscale to mitigate the effects of factors such as lighting intensity on image quality. Next, the pedestrian image is divided into three regions: face, hair, and lower body (legs) regions. Features are extracted from each region, and then gender recognition based on multi-region features is performed using decision-level information fusion. Finally, gender classifiers are designed for all regions, and their results are fused using a Bayesian-based fusion strategy to obtain the final gender recognition result.

5.2. Feature Extraction

Based on the obtained high-resolution pedestrian images, gender classification is performed using features extracted from the face, hair, and lower body (legs) regions. The feature extraction process from multiple regions in a portrait area, including the face, hair, and clothing, is illustrated in Figure 3. The feature extraction methods for the three regions are explained in detail in the following sections.

5.2.1. Facial Feature Extraction

A pedestrian image is processed using the Haar face detection algorithm to determine whether a face can be identified. If a face is detected, the face region is cropped to create a facial image for gender recognition based on facial features. The face detection algorithm identifies the face position and returns the coordinates of the top-left corner and of facial image dimensions, denoted by

Ω_{f a c e} {x, y, w, h}

, where x and y are the coordinates of the top-left corner, and w and ℎ indicate the width and height of the face region, respectively.

5.2.2. Hair Feature Region

It should be noted that the pedestrian images input to our MPGRM have been pre-processed by a pedestrian detector (e.g., YOLO-series model). This detection and cropping step ensures that the pedestrian is roughly centered within the image frame, which provides a valid basis for the subsequent fixed-region estimation described below.

Considering that hair grows around the face and surrounds it, for images where the face is detectable, the hair region is denoted by

Ω_{h a i r} = Ω_{H L} \cup Ω_{H R}

. The size of

Ω_{H L}

and

Ω_{H R}

should be carefully selected because if it is too small, the edges of the hair region can be missed, and if it is too large, the hair region might include excessive noise. Based on prior knowledge and multiple experiments, the following parameter settings are used in this study: the height of

Ω_{H L}

is equal to that of the face (h), and its width is one-quarter of the face width (

\frac{1}{4} w

). Similar settings are used for

Ω_{H R}

.

For images where the face cannot be detected, the hair region is estimated based on prior knowledge. In these cases, the hair region is predicted to be approximately in the upper third of the pedestrian image. This region is denoted by

Ω_{h a i r} = {x, y, w, h}

; the center pixel of the region is labeled as

\{x, y\} = {\frac{1}{2} n, \frac{1}{6} m}

, and the region’s dimensions are denoted by

\{w, h\} = {\frac{1}{3} n, \frac{1}{3} n}

, where n and m indicate the dimensions of the pedestrian image.

The clothing classifier for pedestrian images focuses primarily on extracting and classifying features of the lower-body clothing. The clothing region is located in the lower part of a pedestrian image. By segmenting the pedestrian image, a clothing region

Ω_{b o t t o m s}

is cropped out in this study. The strategy for dividing the clothing region is as follows.

Assume that the dimensions of a pedestrian image are m × n; with the pedestrian image dimensions used as a reference, the clothing region is defined as

Ω_{b o t t o m s} = \{x, y, w, h\}

, where the central pixel of the region is

\{x, y\} = {\frac{1}{2} n, \frac{3}{4} m}

, and the region size is

\{w, h\} = {\frac{1}{2} m, \frac{1}{2} n} .

5.3. Gender Classifier Based on YOLOv8

To improve the accuracy and efficiency of gender classification, this study employs the YOLOv8 model for the binary classification task. The backbone of the YOLOv8 model consists of convolutional blocks, a C2f block, and a spatial pyramid pooling-fast (SPPF) block. The convolutional blocks consist of convolutional layers, the SiLU activation function, and BatchNorm2d normalization layers. The SPPF block is used for feature extraction, and the C2f block performs lightweight processing. The YOLOv8 model emphasizes important features using the attention mechanism and reduces computational waste through a carefully designed network path. This design ensures high detection accuracy and maintains high efficiency in real-time tasks. The model trains two YOLO classifiers: a hair gender classifier and a lower-body (clothing) gender classifier. The features from the hair and lower body (legs) are combined with the local YOLO classifiers’ results, and the face gender classifier described in Section 3 is used for the face. The outputs of all local classifiers are then fused to obtain the result of pedestrian gender.

In this study, the YOLOv8 model is selected as a gender classifier for hair and clothing, and these classifiers are trained separately. Considering the YOLOv8 model’s structure, the bi-level routing attention mechanism (BiFormer) is introduced, inserted after the original YOLOv8 model’s SPPF block, replacing this block with the final output layer.

5.4. Information Fusion Strategy

Given the heterogeneous nature and varying reliability of our region-specific classifiers (face, hair, clothing), we adopt a decision-level fusion strategy. The design of our fusion center is based on the Bayesian decision fusion framework introduced in Section 2.5.2. This principled probabilistic approach is chosen for its optimality in weighting the outputs from classifiers of differing confidences, and its ability to gracefully handle cases where a classifier may be inactive.

Considering the dimensions and attributes of features, local classifier result fusion is used instead of multi-region feature fusion. In addition, a decision-level gender classifier is employed to recognize pedestrian gender. Gender, as a binary attribute, is represented by

H_{0}

for female and

H_{1}

for male, and their prior probabilities are defined as

P (H_{0}) = P_{0}

and

P (H_{1}) = P_{1}

, respectively, where

P_{0} + P_{1} = 1

. Assuming that each local classifier’s results are independent, the conditional probability is denoted by

P (x_{i}| H_{j})

, where

i = 1,2, \dots, N

,

j = 0,1

, and

N

is the number of local classifiers. A local decision g_i of each classifier can be expressed as follows:

g_{i} = \{\begin{matrix} - 1, H_{0} \\ + 1, H_{1} \end{matrix}

(15)

The false positive rate

{F P}_{i}

of a local classifier represents the female data samples mistakenly identified as male by the classifier. The false negative rate

{F N}_{i}

represents the male data samples mistakenly identified as female;

{F P}_{i}

and

{F N}_{i}

can be obtained from the training data. These error rates correspond directly to the classifier likelihood terms

P_{F P_{i}}

and

P_{F N_{i}}

within the Bayesian framework of Section 2.5.2. After classification by each local classifier, the classification results are transmitted to the result fusion center. The fusion result is obtained using the Bayesian criterion, following the derivation culminating in Equation (3) of Section 2.5.2, and a fusion decision gender result G is calculated by

G = s g n (ω_{0} + \sum_{i = 1}^{N} (ω_{i} \cdot g_{i}))

(16)

where

\{\begin{matrix} ω_{0} = l o g \frac{P_{1}}{P_{0}} \\ ω_{i} = δ (g_{i} - 1) \cdot l o g \frac{1 - P_{{F N}_{i}}}{P_{{F P}_{i}}} + δ (g_{i} + 1) \cdot l o g \frac{1 - P_{{F P}_{i}}}{P_{{F N}_{i}}} \end{matrix}

(17)

where

s g n (\cdot)

is the sign function, and

δ (x)

is the impulse function.

If the face cannot be detected in a pedestrian image, the face gender recognition classifier is set to an inactive state, meaning it will not contribute to the pedestrian gender recognition process. This scenario highlights a key advantage of the chosen Bayesian fusion strategy, as it seamlessly integrates only the outputs from the active, available classifiers based on their known reliabilities. In this case, gender recognition is performed only by the hair and clothing classifiers, whose results are fused to determine pedestrian gender.

6. Experimental Results and Analysis

6.1. Evaluation Metrics

The performance of all models is evaluated using the standard classification metrics defined in Section 2.5.3, namely Accuracy, Precision, Recall, and the F1-Score. These metrics provide a comprehensive view of the classifier’s performance across both gender classes.

For the evaluation of our full-body pedestrian gender recognition model (MPGRM), we additionally report the mean Average Precision (mAP) and the Area Under the ROC Curve (AUC), as recommended in Section 2.5.3 for a more robust assessment against potential class imbalance in our real-world test set. These metrics offer a threshold-independent evaluation of the model’s ranking quality.

6.2. Experimental Setup

Dataset: The proposed models are evaluated on a real-world surveillance dataset comprising 1200 pedestrian images (700 male, 500 female), standardized to 180 × 300 pixels. Approximately 60% contain detectable faces. The dataset is divided into ten folds, maintaining the original gender ratio in each subset. A 9:1 train-test split is used for all experiments.

Hardware: All experiments are conducted on a system with an AMD R5 processor (Santa Clara, CA, USA) and an 8 GB NVIDIA GPU (Santa Clara, CA, USA).

6.3. Baseline Models and Comparison Fairness

The proposed models are compared against several state-of-the-art methods:

For facial gender recognition: RCNN-Gender, CNN-ELM [34], CNN-SE-ELM [6], and LNets-Anet [15].

For pedestrian gender recognition: Mini-CNN, VGGNet16 [35], GoogleNet [36], ResNet50 [37], and HDFL [16].

To ensure a fair comparison, all baseline models are trained and tested directly on the original low-resolution images from the same dataset splits without any dedicated super-resolution preprocessing. This isolates the contribution of the classification architecture. The proposed DCT-PFSR-CNN model is evaluated with its integrated preprocessing to demonstrate the end-to-end advantage.

6.4. Results of Facial Gender Recognition (DCT-PFSR-CNN)

6.4.1. DCT-Based Super-Resolution Experiment

In this experiment, two preprocessed facial images were used as inputs for DCT image fusion, resulting in a fused output image. The results of the comparative experiments demonstrated that consistency verification in the DCT image fusion algorithm yielded better performance. The experimental results are shown in Figure 4.

The average peak signal-to-noise ratio (PSNR) of 10 subsets was used as a comparison parameter, as shown in Figure 5. The experimental results indicated that the image PSNR significantly improved with the introduction of the PFSR-CNN module.

The observed PSNR improvement indicates that the proposed method effectively reconstructs high-frequency details and reduces noise. For facial gender recognition, this translates into clearer visual cues in critical regions such as the eyes, eyebrows, and mouth contour, which are essential for accurate classifier discrimination. Therefore, the enhanced image quality directly contributes to the robust feature learning and final classification performance reported in the following sections.

6.4.2. Comparison with State-of-the-Art Models

The facial gender recognition performance of the proposed DCT-PFSR-CNN model is benchmarked against four advanced methods (RCNN-Gender, CNN-ELM, CNN-SE-ELM, LNets-ANet). For a fair comparison focused on classification under low-resolution conditions, all baseline models are trained and tested directly on the original low-resolution images. The quantitative results, summarized in Table 1, demonstrate that the proposed model achieves the highest scores across all metrics, with a leading accuracy of 0.89. This performance notably surpasses the 0.85 accuracy of the strongest baseline (LNets-ANet). Visual examples of correct and incorrect classifications from different models are provided in Figure 6, offering intuitive evidence of the improved robustness of the proposed method. In Figure 6, a blue dot in the upper right corner of the image indicates a male recognition result, and a pink dot indicates a female recognition result.

6.5. Results of Full-Body Pedestrian Gender Recognition (MPGRM)

6.5.1. Overall Performance and Case Analysis

The model was used to perform gender recognition on pedestrian images, and some of the recognition results are shown in Figure 7. In Figure 7, a blue dot in the upper right corner of the image indicates a male recognition result, and a pink dot indicates a female recognition result. Furthermore, a green circle in the upper left corner indicates that the face was detectable, whereas a red “×” indicates that the face could not be detected.

In this study, 516 pedestrian images were selected from the surveillance video dataset randomly to test the proposed model. Among the selected images, 324 images were of pedestrians whose faces could be detected, and 192 images were of pedestrians whose faces could not be detected; their recognition accuracy was 0.9 and 0.88, respectively, with a mixed recognition accuracy of 0.89. The experimental results are shown in Table 2 and Table 3.

6.5.2. Benchmarking Against State-of-the-Art

The proposed MPGRM is comprehensively compared against several state-of-the-art pedestrian recognition models using the mean Average Precision (mAP) and Area Under the ROC Curve (AUC), as shown in Table 4. The MPGRM achieves the highest scores on both metrics (mAP: 0.85, AUC: 0.86), outperforming all compared methods. Notably, it surpasses the strongest baseline, HDFL, by 2.4% in mAP and 2.3% in AUC. This superior performance underscores the advantage of the end-to-end framework that synergistically combines local image enhancement, robust multi-region feature extraction, and principled Bayesian decision fusion for the challenging task of gender recognition in low-resolution surveillance footage.

6.6. Comprehensive Discussion

6.6.1. Interpretation of Results

The experimental findings consistently validate the core hypotheses of this work. The significant performance gain of the DCT-PFSR-CNN model over facial-only baselines highlights the indispensable role of targeted, local super-resolution preprocessing in low-resolution gender recognition. More importantly, the success of the MPGRM framework demonstrates that a multi-cue, decision-level fusion strategy is not merely complementary but essential for robust pedestrian analysis in unconstrained surveillance environments. The Bayesian fusion mechanism proves effective in weighting the contributions from classifiers of differing reliability and activation states, which is a common scenario in real-world applications.

6.6.2. Failure Analysis and Limitations

Analysis of misclassified cases provides insight into the system’s boundaries. Primary failure modes include: (1) Severe Occlusion or Extreme Viewpoints: Where key anatomical regions (face, hair, lower-body) are largely obscured. (2) Ambiguous Apparel and Hairstyles: Cases that challenge gender stereotypes, confusing the region-specific classifiers. A further inherent limitation stems from the region extraction heuristics, which assume pedestrians are roughly centered after detection. Performance may degrade for off-center individuals, indicating a dependency on the upstream detector’s output quality.

7. Conclusions

To address the problem of low image resolution, this study has proposed a DCT-based local super-resolution gender recognition model for blurry faces. By enhancing the resolution of high-frequency information image blocks and integrating them with low-frequency information blocks, the proposed model achieves local super-resolution of an input image. In addition, feature value optimization through fusion network intermediate layers further enhances local classifiers’ accuracy.

Furthermore, aiming to improve the accuracy of pedestrian gender recognition in surveillance video data, a multi-region feature information fusion pedestrian gender recognition model has been designed. The collected data are preprocessed using the DCT-based local super-resolution model to enhance the resolution of pedestrian images. Then, pedestrian images are divided into face, hair, and clothing regions, and gender classifiers constructed for each part are employed. The results of the three local gender classifiers are input into the information fusion center, and the pedestrian gender probability is calculated based on the Bayesian criteria to obtain the final pedestrian gender result. The experimental results have demonstrated that the proposed pedestrian gender recognition model performs well on real-world datasets and can effectively solve the problem of low gender recognition accuracy caused by blurry pedestrian images in low-resolution surveillance video.

Author Contributions

Conceptualization, W.Y.; methodology, G.L.; validation, Y.Z. and N.J.; formal analysis, Y.Z.; investigation, N.J.; resources, G.L.; data curation, G.L.; writing—original draft preparation, Y.Z.; writing—review and editing, W.Y.; visualization, L.H.; funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (under Grant Number U23A20603) and Department of Science & Technology of Liaoning Province (2024-MSLH-399).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original contributions of the study are included in the article, and further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cao, M.; Tian, Q.; Ma, T.H.; Chen, S.C. Human Facial Attributes Estimation: A Survey. J. Softw. 2019, 30, 2188–2207. [Google Scholar]
Fatih, E. An Effective Gender Recognition Approach Using Voice Data via Deeper LSTM Networks. Appl. Acoust. 2019, 156, 351–358. [Google Scholar] [CrossRef]
O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks. arXiv 2015, arXiv:1511.08458. [Google Scholar] [CrossRef]
Rajeev, R.; Vishal, M.P.; Rama, C.; Hyper, F. A Deep Multi-task Learning Framework for Face Detection, Landmark Localization, Pose Estimation and Gender Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 121–135. [Google Scholar]
Lin, C.J.; Lin, C.H.; Jeng, S.Y. Using Feature Fusion and Parameter Optimization of Dual-Input Convolutional Neural Network for Face Gender Recognition. Appl. Sci. 2020, 10, 3166. [Google Scholar] [CrossRef]
Chen, W.B.; Li, Y.L.; Chen, Y.J. An Age and Gender Recognition Model Based on CNN-SE-ELM. Comput. Eng. Sci. 2021, 43, 872–882. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a Deep Convolutional Network for Image Super-Resolution. In Computer Vision–ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V; Springer International Publishing: Cham, Switzerland, 2014; pp. 184–199. [Google Scholar]
Chao, D.; Chen, C.L.; Xiaoou, T. Accelerating the Super-Resolution Convolutional Neural Network. Comput. Vis. 2016, 9906, 391–407. [Google Scholar]
Tian, Y.; Jia, R.S.; Deng, M.D.; Zhao, C.Y. A Super-Resolution Reconstruction Method Based on Convolutional Neural Network in the Field of Fuzzy License Plate Image. Comput. Appl. Softw. 2020, 37, 159–164+228. [Google Scholar]
Deshpande, A.; Razmjooy, N.; Estrela, V.V. Introduction to Computational Intelligence and Super-Resolution. In Computational Intelligence Methods for Super-Resolution in Image Processing Applications; Springer International Publishing: Cham, Switzerland, 2021; pp. 3–23. [Google Scholar]
Raviraja, H.D.S. Enhancing Laryngeal Spinocellular Carcinoma Image Security with DCT. Indian J. Otolaryngol. Head Neck Surg. 2023, 76, 695–701. [Google Scholar] [CrossRef]
Zhang, H.C.; Ji, F.; Zhong, X.X. Super-Resolution Reconstruction Algorithm of Single Image Based on CNN with Gaussian Blur. Comput. Appl. Softw. 2022, 39, 231–235+295. [Google Scholar]
Luo, H.W.; Liu, B.; Yao, H.; Wang, J.B.; Yuan, H.Q.; Liu, G.H. Lightweight Gender and Age Estimation Algorithm Based on Improved Efficient Net. Transducer Microsyst. Technol. 2023, 42, 114–118. [Google Scholar] [CrossRef]
Li, B.Y.; Liu, Q.G. Study on Decision-making of Safety Risk Factors of High-speed Railway Station Based on Fuzzy Fault Tree and BN. Railw. Stand. Des. 2024, 68, 145–152. [Google Scholar] [CrossRef]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep Learning Face Attributes in the Wild. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [Google Scholar]
Cheng, Z.; Zhu, X.; Gong, S. Low-resolution face recognition. In Proceedings of the 14th Asian Conference on Computer Vision (ACCV 2018), Perth, Australia, 2–6 December 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 605–621. [Google Scholar]
Cai, L.; Zhu, J.Q.; Zeng, H.Q.; Chen, J.; Cai, C.H.; Ma, K.K. HOG-assisted deep feature learning for pedestrian gender recognition. J. Frankl. Inst. 2017, 28, 13–30. [Google Scholar] [CrossRef]
Cao, K.; Rong, Y.; Li, C.; Tang, X.; Loy, C.C. Pose-robust face recognition via deep residual equivariant mapping. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5187–5196. [Google Scholar]
Lu, Y.; Ebrahimi, T. Cross-resolution face recognition via identity-preserving network and knowledge distillation. In Proceedings of the 2023 IEEE International Conference on Visual Communications and Image Processing (VCIP), Jeju, Republic of Korea, 4–7 December 2023; pp. 1–5. [Google Scholar]
Yue, L.; Shen, H.; Li, J.; Yuan, Q.; Zhang, H.; Zhang, L. Image super-resolution: The techniques, applications, and future. Signal Process. 2016, 128, 389–408. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.P.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Zhou, L.; Li, Y.; Feng, Y.; Shen, D.; Wang, H.; Dong, F. Super-Resolution Task Inference Acceleration for In-Vehicle Real-Time Video via Edge–End Collaboration. Appl. Sci. 2025, 15, 11828. [Google Scholar] [CrossRef]
Xiang, X.; Morton, J.; Reda, F.A.; Young, L.; Perazzi, F.; Ranjan, R.; Kumar, A.; Colaco, A.; Allebach, J. HIME: Efficient Headshot Image Super-Resolution with Multiple Exemplars. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 1694–1704. [Google Scholar]
Abdelwhab, A.; Viriri, S. A survey on soft biometrics for human identification. In Machine Learning and Biometrics; IntechOpen: Rijeka, Croatia, 2018; p. 37. [Google Scholar]
Li, B.; Lian, X.C.; Lu, B.L. Gender classification by combining clothing, hair and facial component classifiers. Neurocomputing 2012, 76, 18–27. [Google Scholar] [CrossRef]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Computer Vision—ECCV 2018, Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 480–496. [Google Scholar]
Chen, Y.; Duffner, S.; Stoian, A.; Dufour, J.-Y.; Baskurt, A. Pedestrian attribute recognition with part-based CNN and combined feature representations. In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Funchal, Portugal, 27–29 January 2018. [Google Scholar]
Viola, P.; Jones, M. Robust real-time face detection. Int. J. Comput. Vis. 2004, 57, 137–154. [Google Scholar] [CrossRef]
Liggins, M.E.; Hall, D.L.; Llinas, J. Handbook of Multisensor Data Fusion: Theory and Practice; CRC press: Boca Raton, FL, USA, 2017. [Google Scholar]
Mudawi, A.N.; Qureshi, A.M.; Abdelhaq, M.; Alshahrani, A.; Alazeb, A.; Alonazi, M.; Algarni, A. Vehicle Detection and Classification via YOLOv8 and Deep Belief Network over Aerial Image Sequences. Sustainability 2023, 15, 14597. [Google Scholar] [CrossRef]
Madan, A. Face Recognition using Haar Cascade Classifier. Int. J. Mod. Trends Sci. Technol. 2021, 7, 85–87. [Google Scholar] [CrossRef]
Du, Y.F.; Zhao, H.N.; Wang, Z.Y. Super-resolution stress imaging for terahertz-elastic based on SRCNN. J. Exp. Mech. 2022, 37, 323–331. [Google Scholar]
Arjun, A.P.; Suryanarayan, S.; Viswamanav, S.R.; Abhishek, S.; Anjali, T. Unveiling Underwater Structures: MobileNet vs. EfficientNet in Sonar Image Detection. Procedia Comput. Sci. 2024, 233, 518–527. [Google Scholar] [CrossRef]
Duan, M.X.; Li, L.L.; Yang, C.Q.; Li, K.Q. A Hybrid Deep Learning CNN-ELM for Age and Gender Classification. Neurocomputing 2018, 275, 448–461. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.; Liu, W.; et al. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Buyukyilmaz, M.; Cibikdiken, A.O. Voice Gender Recognition Using Deep Learning. In Proceedings of the 2016 International Conference on Modeling, Simulation and Optimization Technologies and Applications (MSOTA2016), Xiamen, China, 18–19 December 2016; Atlantis Press: Dordrecht, The Netherlands, 2016; pp. 409–411. [Google Scholar]

Figure 1. Flowchart of the DCT-based block selection method.

Figure 2. FSRCNN model diagram.

Figure 3. Illustration of the pedestrian image regions.

Figure 4. DCT image fusion comparison experiment: (a) Source image 1; (b) Source image 2; (c) “DCT + Variance” fusion result; (d) “DCT + Variance + CV” fusion result.

Figure 5. Comparison of the PSNR values before and after super-resolution implementation.

Figure 6. Experimental results obtained by different models: (a) Experimental results of the proposed model; (b) Misclassification of male results by other models; (c) Correct classification of male results by the proposed model. Misclassification of female results by other models; (d) Correct classification of female results by the proposed model.

Figure 7. Results of the pedestrian gender recognition model.

Table 1. Comparison results of different models.

Algorithm Model	$a c c$	$R e c a l l$	F1-Score
RCNN-Gender	0.76	0.8	0.78
CNN-ELM	0.81	0.83	0.82
CNN-SE-ELM	0.83	0.82	0.82
LNets-ANet	0.85	0.86	0.85
DCT-PFSR-CNN	0.89	0.88	0.88

Table 2. Results for images where the pedestrians’ faces could be detected.

Real Situation	Predicted Situation
Real Situation	Male	Female
Male	174	19
Female	13	118

Table 3. Results for images where the pedestrians’ faces could not be detected.

Real Situation	Predicted Situation
Real Situation	Male	Female
Male	97	11
Female	12	72

Table 4. Comparison results of different models on the real-world dataset.

Algorithm Model	mAP	AUC
Mini-CNN	0.6	0.65
VGGNet16	0.73	0.79
GoogleNet	0.76	0.80
ResNet50	0.79	0.82
HDFL	0.83	0.84
MPGRM	0.85	0.86

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Yan, W.; Liu, G.; Jin, N.; Han, L. An Efficient Pedestrian Gender Recognition Method Based on Key Area Feature Extraction and Information Fusion. Appl. Sci. 2026, 16, 1298. https://doi.org/10.3390/app16031298

AMA Style

Zhang Y, Yan W, Liu G, Jin N, Han L. An Efficient Pedestrian Gender Recognition Method Based on Key Area Feature Extraction and Information Fusion. Applied Sciences. 2026; 16(3):1298. https://doi.org/10.3390/app16031298

Chicago/Turabian Style

Zhang, Ye, Weidong Yan, Guoqi Liu, Ning Jin, and Lu Han. 2026. "An Efficient Pedestrian Gender Recognition Method Based on Key Area Feature Extraction and Information Fusion" Applied Sciences 16, no. 3: 1298. https://doi.org/10.3390/app16031298

APA Style

Zhang, Y., Yan, W., Liu, G., Jin, N., & Han, L. (2026). An Efficient Pedestrian Gender Recognition Method Based on Key Area Feature Extraction and Information Fusion. Applied Sciences, 16(3), 1298. https://doi.org/10.3390/app16031298

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient Pedestrian Gender Recognition Method Based on Key Area Feature Extraction and Information Fusion

Abstract

1. Introduction

2. Related Work

2.1. Gender Recognition in Low-Resolution Surveillance

2.2. Image Super-Resolution for Enhancement

2.3. Multi-Region and Attribute-Based Recognition

2.4. Information Fusion Strategies

2.5. Theoretical Foundations of Key Techniques

2.5.1. Discrete Cosine Transform (DCT) and Its Inverse

2.5.2. Bayesian Decision Fusion Framework

2.5.3. Evaluation Metrics for Classification

3. Proposed Algorithm Framework

4. DCT-PFSR-CNN Model

4.1. DCT-Based Best-Quality Block Selection

4.2. Block-Based Super-Resolution Technology

4.3. CNN-Based Face Gender Classification

5. Multi-Region Pedestrian Gender Recognition Model (MPGRM)

5.1. Pedestrian Gender Recognition Model

5.2. Feature Extraction

5.2.1. Facial Feature Extraction

5.2.2. Hair Feature Region

5.3. Gender Classifier Based on YOLOv8

5.4. Information Fusion Strategy

6. Experimental Results and Analysis

6.1. Evaluation Metrics

6.2. Experimental Setup

6.3. Baseline Models and Comparison Fairness

6.4. Results of Facial Gender Recognition (DCT-PFSR-CNN)

6.4.1. DCT-Based Super-Resolution Experiment

6.4.2. Comparison with State-of-the-Art Models

6.5. Results of Full-Body Pedestrian Gender Recognition (MPGRM)

6.5.1. Overall Performance and Case Analysis

6.5.2. Benchmarking Against State-of-the-Art

6.6. Comprehensive Discussion

6.6.1. Interpretation of Results

6.6.2. Failure Analysis and Limitations

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI