Adaptive Decision Fusion in Probability Space for Pedestrian Gender Recognition

Cai, Lei; Zheng, Huijie; Ruan, Fang; Chen, Feng; Xiang, Wenjie; Lin, Qi; Shi, Yifan

doi:10.3390/app16083640

Open AccessArticle

Adaptive Decision Fusion in Probability Space for Pedestrian Gender Recognition

by

Lei Cai

^1,†

,

Huijie Zheng

^2,†,

Fang Ruan

¹,

Feng Chen

¹,

Wenjie Xiang

^1,*,

Qi Lin

² and

Yifan Shi

¹

College of Engineering, Huaqiao University, Quanzhou 362021, China

²

College of Information Science and Engineering, Huaqiao University, Xiamen 361021, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2026, 16(8), 3640; https://doi.org/10.3390/app16083640

Submission received: 4 February 2026 / Revised: 27 March 2026 / Accepted: 3 April 2026 / Published: 8 April 2026

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Pedestrian gender recognition plays an important role in pedestrian analysis and intelligent video applications, for example, in demographic statistics, soft biometric analysis, and context-aware person retrieval. However, it remains a challenging task owing to viewpoint variations, illumination changes, occlusions, and low image quality in real-world imagery. To address these issues, an effective adaptive decision fusion framework, termed the Decision Fusion Learning Network (DFLN), is proposed in this paper. The key novel aspect of DFLN is that it effectively explores both an appearance-centered view that emphasizes detailed texture and clothing information and a structure-centered view that captures rich contour and structural information for pedestrian gender recognition. To realize DFLN, a Parallel CNN Prediction Probability Learning Module (PCNNM) is first constructed to independently learn modality-specific probabilities from color image and edge maps. Subsequently, a learnable Decision Fusion Module (DFM) is designed to fuse the modality-specific probabilities and explore their complementary merits for realizing accurate pedestrian gender recognition. The DFM can be easily coupled with the PCNNM, forming an end-to-end decision fusion learning framework that simultaneously learns the feature representations and carries out adaptive decision fusion. Experiments on two pedestrian benchmark datasets, named PETA and PA-100K, show that DFLN achieves competitive or superior performance compared with several state-of-the-art pedestrian gender recognition methods. Extensive experimental analysis further confirms the effectiveness of the proposed decision fusion strategy and its favorable generalization ability under domain shift.

Keywords:

pedestrian gender recognition; parallel CNN prediction probability learning module; adaptive decision fusion; probability space

1. Introduction

Since gender is often used as a soft biometric cue, pedestrian gender recognition, which aims to classify pedestrian images into predefined gender categories, has become a widely studied problem in smart-city-related computer vision applications [1,2,3]. It is expected to have broad application prospects, such as the following [4,5,6,7,8]: (1) Human–computer interaction: Intelligent machines (e.g., autonomous vehicles) can leverage gender information to provide more appropriate interaction and communication. (2) Intelligent video applications: restricting pedestrians’ access to some areas (e.g., dormitories) that allow only a specific gender is often required for public safety. (3) Multimedia retrieval systems: Gender is an important label for indexing, annotating and searching the massive supply of visual materials (images/videos). (4) Demographic collection: Automated shopping statistics and behavior analysis of males and females are very useful for various marketing and business scenarios (e.g., advertisement). In this work, we follow the common setting in existing pedestrian datasets and focus on binary gender labels.

In practical scenarios, pedestrian images are usually captured by different kinds of cameras (e.g., high/low resolution) under various environments (e.g., long distance, indoor/outdoor.) without pedestrians’ cooperation. Some examples of pedestrian images selected from pedestrian attribute datasets, such as PA-100K [9] and PETA [10], are shown in Figure 1. It can be easily observed that some crucial challenges, such as illumination changes, viewpoint variations, diverse appearances, and arbitrary postures, are frequently encountered in pedestrian images. Therefore, it is still difficult to develop an effective method for accurately recognizing the gender of a pedestrian from images captured in practical scenarios. To address this problem, several pedestrian attribute datasets, such as the abovementioned PA-100K [9] and PETA [10], have been constructed to facilitate the learning and evaluation of various pedestrian gender recognition methods. Promoted by these datasets, multiple gender recognition methods have been developed by exploring various features (e.g., HOG [11], PixelHOG [12], biologically inspired features [13], gait features [14,15], and elliptic Fourier descriptors [16]) and a wide variety of methodologies (e.g., AdaBoost [11], SVM [12,13], sparse reconstruction [15], and deep learning [17,18,19,20]). The related works will be organized and highlighted in Section 2.

For pedestrian gender recognition, different image modalities for the same pedestriancan provide complementary information. Intuitively, a color pedestrian image (see Figure 1) provides an appearance-centered view that emphasizes detailed information such as texture and clothing, while an edge map (see Figure 2) provides a structure-centered view that captures rich contour and structural information on the human body. These two views can be regarded as complementary to one another. However, existing deep learning–based approaches mostly focus on learning discriminative feature representations from a single image modality [17,21], typically color images. Furthermore, although several works [18,22] have begun to explore different fusion methods for pedestrian gender recognition, these approaches predominantly focus on feature representation learning, and fusion is often implemented using simple strategies such as feature concatenation. Moreover, in recent years, the research community has increasingly explored more advanced structural modalities such as saliency maps [23], skeleton representations [24], pose estimation [25], and semantic segmentation [26]. These modalities indeed offer richer high-level semantic priors. However, compared to the abovementioned modalities, edge maps were chosen for the present study because they are more methodologically aligned with the scope of this work. First, edge maps provide a structure-centered representation that is relatively insensitive to illumination, color variation, and background clutter, which are common challenges in pedestrian imagery. Second, unlike pose or segmentation models, which require additional supervision and complex pre-trained detectors, edge extraction does not introduce task-specific bias or additional annotation dependencies.

In this paper, with these considerations, we present an adaptive decision fusion framework for pedestrian gender recognition, called the Decision Fusion Learning Network (DFLN). The proposed DFLN starts with a Parallel CNN Prediction Probability Learning Module (PCNNM), which consists of two parallel convolutional neural networks that independently learn posterior class probability vectors from color pedestrian images and their corresponding edge maps. In addition to the PCNNM, a learnable Decision Fusion Module (DFM) is designed to take the modality-specific prediction probabilities as input, concatenate them into a joint probability vector, and adaptively fuse them via a fully connected layer and a softmax operation to achieve accurate pedestrian gender recognition. The proposed DFLN can simultaneously learn the feature representations and carry out adaptive decision fusion in an end-to-end manner, in contrast to existing multimodal fusion strategies, where the decision fusion and the predictions of different branches are optimized independently or combined through pre-defined fusion rules. Furthermore, the backpropagation-based derivation was formalized to demonstrate how the predictions from PCNNM-U and PCNNM-L interact and thus indicate their complementary behavior, which provides essential support for the feasibility and interpretability of our proposed adaptive decision fusion framework.

In summary, the main contributions of this article are as follows:

We present an adaptive decision fusion framework for pedestrian gender recognition. By operating directly on the prediction probabilities learned in parallel from color pedestrian images and their corresponding edge maps, the proposed framework can fully exploit the complementary merits of the modality-specific probabilities learned from color pedestrian images and their edge maps for accurate pedestrian gender recognition.
We mathematically analyze the complementary behavior of the proposed decision fusion mechanism (see Section 3), which demonstrates the feasibility and interpretability of our proposed DFLN for pedestrian gender recognition.
Extensive experimental results on two large-scale pedestrian attribute benchmarks, namely, PETA and PA-100K, show that the proposed DFLN achieves competitive or superior performance compared with multiple state-of-the-art pedestrian gender recognition methods. Moreover, by inserting the proposed DFM into several well-known neural networks (e.g., VGGNet [27], GoogleNet [28], ResNet [29], and PiT [30]), we achieve consistent performance improvements for pedestrian gender recognition.

The remainder of this paper is organized as follows. Section 2 introduces related works. Section 3 describes the proposed DFLN in detail. Section 4 presents the experimental results and analyses. Finally, Section 5 concludes the paper.

2. Related Work

2.1. Pedestrian Gender Recognition Based on Hand-Crafted Features

To recognize the genders of pedestrians, the straightforward solution is first to extract some hand-crafted features and then to apply a classical binary classifier based on the traditional object recognition framework. For example, Cao et al. [11] investigated gender recognition by considering the body structure of humans. This method divided the image of the human body into small patches, represented using the Histogram of Oriented Gradients (HOG) feature, and then adopted the classifier AdaBoost for gender classification. Collins et al. [12] developed a feature descriptor (namely, PixelHOG+LocalHSV) for gender recognition by combining dense HOG features computed from a custom edge map and color features computed based on the hue and saturation values of pixels. Guo et al. [13] employed biologically inspired features derived from Gabor filters for gender recognition. Furthermore, this method considered the viewing angle (i.e., front, back, or side) in pedestrian gender classification to improve performance. Hu et al. [14] proposed a supervised modeling method that integrated the shape appearance and temporal dynamics of both the male and female genders into a so-called mixed conditional random field. Lu et al. [15] extracted human silhouettes from a gait sequence collected from arbitrary walking directions, after which the cluster-based averaged gait images were computed as features. Then, a sparse reconstruction based metric learning method was developed to obtain discriminative information for gender recognition. Isaac et al. [16] investigated gait-based gender recognition by combining several hand-crafted features, including elliptic Fourier descriptors, consolidated vectors of row–column summation, and depth gradient histograms.

2.2. Pedestrian Gender Recognition Based on Deep Learning–Derived Features

Unlike the traditional object recognition framework, deep learning integrates feature extraction and classification in a unified framework, which is capable of automatically learning features from large-scale input data. Consequently, deep learning has been successfully applied in various object recognition/classification tasks, such as handwritten digit recognition [31], face recognition [32,33], person re-identification [34,35,36], vehicle re-identification [37,38,39,40], and epileptic signal classification [41].

Regarding pedestrian gender recognition, Ng et al. [42] proposed a CNN model for pedestrian gender recognition on the MIT dataset. Their CNN model contains two convolutional layers, two subsampling layers, and one fully-connected layer with 25 neuron units. Antipov et al. [21] presented a neural network architecture, denoted as Mini-CNN, for gender recognition on the PETA [10] dataset. Moreover, they fine-tuned the well-known AlexNet, designed by Krizhevsky et al. [43], to effectively improve gender recognition performance. Raza et al. [17] exploited a decompositional neural network [44] to convert the original pedestrian image to the corresponding parsed image, which was further fed into a stacked sparse auto-encoder with a softmax classifier to identify pedestrian gender. To fully explore the advantages of both deep learning–based and hand-crafted features, Cai et al. [18] developed an effective HOG-assisted Deep Feature Learning (HDFL) method, in which the deep learning–based and weighted HOG features are simultaneously extracted for the input pedestrian image and then fused together to obtain a more discriminative feature representation. Ng et al. [45] tackled pedestrian gender recognition by designing CNNs based on combined global and local parts. Their study revealed that the upper half region of the human body is more informative than the middle region or the lower half of the human body for this gender identification task.

Bei et al. [46] developed a two-stream network considering bodily appearance fused with temporal variations in gait sequences to recognize gender. Abbas et al. [47] investigate pedestrian gender classification using a vision transformer–based framework that incorporates locality self-attention and shifted patch tokenization to enhance both local inductive bias and global context modeling. Lee et al. [26] developed a rough body segmentation–based framework for long-distance gender recognition using infrared imagery. Mbongo et al. [48] conducted a comprehensive cross-domain evaluation of six recently developed deep learning architectures across multiple benchmarks for pedestrian gender recognition.

2.3. Decision-Level Fusion

Pedestrian gender recognition has been extensively studied using both handcrafted features and deep convolutional neural networks (CNNs). In recent years, several approaches (e.g., [18,22]) have relied mostly on feature-level fusion, where heterogeneous visual cues such as appearance, shape, or texture descriptors are concatenated to improve discriminative capability. Despite their effectiveness, these approaches predominantly focus on feature representation learning, while the feature-level fusion is often implemented using simple strategies such as feature concatenation or fixed-weight averaging.

To address the limitations of feature-level fusion, several studies have explored decision-level fusion mechanisms that combine the outputs of multiple classifiers. Ensemble learning and probability-based fusion strategies [49,50] have been widely adopted to aggregate predictions from heterogeneous models (or modalities), including averaging-based fusion [51,52], logit-level fusion [53], Bayesian-based fusion [54], attention-based fusion [55], uncertainty-aware fusion [56], and shallow classifier–based fusion [57]. Among them, averaging-based fusion computes an equally weighted average of output probabilities and makes decisions accordingly, implicitly assuming equal reliability across all prediction branches and thereby failing to fully leverage their complementary behaviors. Logit-level fusion operates in the logit space by aggregating unnormalized prediction scores prior to softmax normalization. As a result, it is inherently sensitive to variations in the logit scale, which can undermine its ability to effectively exploit their complementarity. Bayesian decision fusion is based on Bayesian inference. Its core idea is to treat the prediction results of multiple models as independent evidence and obtain the optimal decision through posterior probability inference. Uncertainty-aware fusion dynamically weights predictions according to their estimated uncertainty, enhancing robustness but relying heavily on accurate uncertainty estimation. Attention-based fusion dynamically learns the importance weights of different branches on the basis of the input samples, thereby achieving selective feature or decision fusion. Shallow classifier–based fusion uses the predictions from different base models as feature inputs and then trains a shallow classifier (such as LR, SVM, or MLP) to learn the optimal fusion mapping. However, the shallow classifier is typically trained separately from these base models, resulting in a decoupled optimization process that limits the interactions between feature learning and decision fusion. In contrast, the proposed DFLN can simultaneously learn the feature representations and carry out adaptive decision fusion in an end-to-end manner, so as to more effectively explore the complementary merits of the modality-specific probabilities learned from the color pedestrian image and its edge map, thereby delivering greater accuracy in gender recognition.

3. Proposed Decision Fusion Learning Network

In this paper, we present an adaptive decision fusion framework, called the Decision Fusion Learning Network (DFLN), which operates in the probability space to enhance pedestrian gender recognition. Figure 3 provides a schematic illustration of the proposed DFLN, which mainly consists of two learning-capable modules: the Parallel CNN Prediction Probability Learning Module (PCNNM) and the Decision Fusion Module (DFM), which will be individually introduced in the following two subsections.

3.1. Parallel CNN Prediction Probability Learning Module

Considering that the gender prediction probabilities derived from a color pedestrian image and its corresponding edge map offer two complementary views, i.e., an appearance-centered view emphasizing texture and clothing attributes and a structure-centered view capturing body contours and global shape, a Parallel CNN Prediction Probability Learning Module (PCNNM) is first constructed to independently learn the prediction probabilities from the original color pedestrian image and its edge map, respectively. As shown in Figure 3, the PCNNM is composed of two compact and lightweight CNNs. Each CNN is composed of six convolutional layers (i.e., C1, C2, C4, C5, C7, and C8), two max-pooling layers (i.e., M3 and M6), one average-pooling layer (i.e., A9), one fully connected layer (i.e., F10) with 128 neuron units, and one fully connected layer (i.e., F11) with two neuron units corresponding to gender (i.e., male/female) classes, respectively.

In the input layer, the color pedestrian images (three channels) and their corresponding edge maps (a single channel) with dimensions of

128 \times 48

are fed to the upper and lower parts of the PCNNM, respectively. The edge map of the input pedestrian image is computed as the root mean square of image directional gradients in the horizontal and vertical directions [58], that is:

M (i) = \sqrt{{(I \otimes G_{x})}^{2} (i) + {(I \otimes G_{y})}^{2} (i)},

(1)

where ⊗ represents the convolution operation;

I

means the raw pixel pedestrian image;

G_{x}

and

G_{y}

are the

3 \times 3

horizontal and vertical Prewitt filters, which can be defined as follows:

G_{x} = (\begin{matrix} 1 / 3 & 0 & - 1 / 3 \\ 1 / 3 & 0 & - 1 / 3 \\ 1 / 3 & 0 & - 1 / 3 \end{matrix}), G_{y} = (\begin{matrix} 1 / 3 & 1 / 3 & 1 / 3 \\ 0 & 0 & 0 \\ - 1 / 3 & - 1 / 3 & - 1 / 3 \end{matrix}) .

(2)

Some examples of the extracted edge map corresponding to the pedestrian images in Figure 1 are presented in Figure 2.

In convolutional layers (i.e., C1, C2, C4, C5, C7, and C8), the number of filters is set to 32, as keeping the number of filters low is beneficial to avoid overfitting. Moreover, small filters with dimensions of

3 \times 3

are applied to save the filter parameters. To retain the image details as much as possible, the stride of each of the convolutional layers is set to 1. To retain the edge information and the size of feature maps, a padding operation is applied for the first four convolutional layers (i.e., C1, C2, C4, and C5). Batch normalization (BN) [59] operation is used after all convolution operations to accelerate the network training. In the pooling layers (i.e., M3, M6, and A9), the

2 \times 2

max pooling and

4 \times 4

average pooling operations are applied, respectively, and the stride is uniformly set to 2. In the F10 layer, 128 neuron units fully connected to the A9 layer. Next, dropout [43] is enabled to set the output of each neuron to zero with 0.5 probability to avoid overfitting. The output layer (i.e., F11) has two neuron units fully connected to the F10 layer, and softmax activation units are applied in the F11 layer to form a binary classifier. Each softmax activation unit is associated with a class label, and the output value gives the posterior probability of the corresponding gender class. More details of the recommended configuration of the PCNNM can be referred to Table 1.

3.2. Decision Fusion Module

The output prediction probabilities (posterior class probabilities) from the upper part of the PCNNM (denoted as PCNNM-U) and the lower part of the PCNNM (denoted as PCNNM-L) naturally define a probability space where different prediction streams can be fused at the decision level. For that purpose, on the basis of the PCNNM, a learning-capable Decision Fusion Module (DFM) is designed; this module takes the modality-specific probabilities as input, concatenates them into a joint probability vector, and fuses them via a fully connected layer and softmax operation, yielding an adaptive decision fusion mechanism in the probability space.

As shown in Figure 3, the probabilities of the gender classes derived from the PCNNM are fed into the DFM, in which a Concat layer is used to combine them. Consequently, the output probability vector,

P^{C}

, is composed of four elements and can be formulated as follows:

P^{C} = [p (C_{1} | I), p (C_{2} | I), p (C_{1} | M), p (C_{2} | M)],

(3)

where

p (C_{1} | I)

and

p (C_{2} | I)

denote the output posterior probabilities of the corresponding gender classes (i.e.,

C_{1}

,

C_{2}

) from PCNNM-U with the color pedestrian image

I

as input, while

p (C_{1} | M)

and

p (C_{2} | M)

denote those from PCNNM-L with the edge map

M

as input. Then, the outputs of Concat layer are fused together by performing a fully connected operation and a softmax activation to obtain the final probability vector

P^{F}

, which contains two elements:

\begin{matrix} P^{F} = [p (C_{1} | P^{C}), p (C_{2} | P^{C})], \end{matrix}

(4)

where

\begin{matrix} p (C_{1} | P^{C}) = \frac{exp (W_{1} P^{C} + b_{1})}{\sum_{i = 1}^{2} exp (W_{i} P^{C} + b_{i})}, \end{matrix}

(5)

\begin{matrix} p (C_{2} | P^{C}) = \frac{exp (W_{2} P^{C} + b_{2})}{\sum_{i = 1}^{2} exp (W_{i} P^{C} + b_{i})}, \end{matrix}

(6)

in which

b

denotes biases and

W

represents the learnable fusion weights.

3.3. Complementary Behavior Analysis

Built upon the Decision Fusion Module (DFM), the proposed DFLN can explore the complementary merits of modality-specific prediction probabilities learned by PCNNM-U and PCNNM-L, which individually learn the prediction probabilities from the color pedestrian image and the edge map. This can be observed from both forward propagation and backpropagation processes in the proposed DFLN method, which can be mathematically demonstrated as follows.

First, according to Equations (3)–(6), it can be clearly seen that the final probability vector

P^{F} = [p (C_{1} | P^{C}), p (C_{2} | P^{C})]

will certainly be subject to the output probability vectors

[p (C_{1} | I), p (C_{2} | I)]

and

[p (C_{1} | M), p (C_{2} | M)]

resulting from PCNNM-U and PCNNM-L in the forward propagation process.

Second, in the back propagation process, the proposed DFLN is designed to minimize the objective function

J

, which is solved by employing Stochastic Gradient Decent (SGD) [60]. Let

W_{i j}^{n}

be the weight connecting the jth neural unit in the (n − 1)th layer and the ith neural unit in the nth layer, while

γ_{i}^{n} = \sum_{i} W_{i j}^{n} z_{j}^{n - 1}

, where

z_{j}^{n - 1} = h (γ_{j}^{n - 1})

and h represents an activation function. Here, taking layer F11 in PCNNM-U as an example, the update step of the weight

W_{i j}^{11}

can be formulated as

W_{i j, n e w}^{11} = W_{i j}^{11} - λ \frac{\partial J}{\partial W_{i j}^{11}},

(7)

where

λ > 0

denotes the learning rate, and

\frac{\partial J}{\partial W_{i j}^{11}} = z_{j}^{10} δ_{i}^{11},

(8)

where

z_{j}^{10}

is the jth element of output feature from layer F10 in the PCNNM-U, while

δ_{i}^{11} = (\sum_{m} W_{m i}^{12} δ_{m}^{12}) h^{'} (γ_{i}^{11}),

(9)

δ_{m}^{12} = \frac{\partial J}{\partial γ_{m}^{12}},

(10)

where

γ_{m}^{12} = \sum_{m} W_{m i}^{12} P_{i}^{C}

, in which

P_{i}^{C}

is the ith element in

P^{C}

. Therefore, it can be found that through

P_{i}^{C} \to γ_{m}^{12} \to δ_{m}^{12} \to δ_{i}^{11} \to \frac{\partial J}{\partial W_{i j}^{11}} \to W_{i j, n e w}^{11}

, the value of

W_{i j, n e w}^{11}

in PCNNM-U will be affected by the final probability vector

P^{C}

in the backpropagation process. Note that the same conclusion can be drawn for PCNNM-L.

The above analyses indicate that the PCNNM-U and PCNNM-L, which individually learn the modality-specific prediction probabilities from the pedestrian images and edge maps, are capable of playing complementary roles in the forward propagation and backpropagation processes through the proposed DFM. Consequently, the proposed DFLN method is able to deliver better performance in pedestrian gender recognition.

4. Experimental Results and Analyses

4.1. Dataset and Evaluation Protocol

To validate their performance, the proposed DFLN approach and multiple recently developed pedestrian gender recognition methods are evaluated exclusively on two publicly available benchmark pedestrian attribute datasets, namely, PETA [10] and PA-100K [9], which are widely used in the research community. The study focuses on methodological development and performance evaluation in controlled academic settings, without involving real-world deployment or personal identification. One can see from some examples shown in Figure 1 that these two datasets contain different kinds of pedestrian images, greatly varying in postures, scenes, resolutions, and camera viewpoints (e.g., front, back and side).

PETA consists of 19,000 pedestrian images from multiple datasets, namely, 3DPeS, CAVIAR, CUHK, GRID, i-LIDS, PRID, SARC3D, TownCentre, VIPeR, and MIT. By following the same practice in [18,21], we discarded some unsuitable images, such as those containing severe object occlusion or an unidentified target, as well as those with very low resolution. After those images were eliminated, 14,759 pedestrian images remained in PETA. Tenfold cross-validation is used to report the final performance on the PETA dataset.

PA-100K consists of 100,000 pedestrian images and is currently the largest pedestrian image dataset with gender attribute labels, to the best of our knowledge. For our experiments, the PA-100K dataset was randomly split into training, validation, and test sets with a ratio of 8:1:1, that is, 80,000 pedestrian images for training, 10,000 pedestrian images for validation, and 10,000 pedestrian images for testing. This division was repeated five times to report the final performance on the PA-100K dataset.

The final results are reported on the test sets according to the following evaluation metrics: accuracy, area under ROC curve (AUC), and receiver operating characteristic (ROC) curve.

4.2. Implementation Details

For the supervised training of the proposed DFLN, the weights were initialized randomly from a normal distribution

N (0, 0.01)

, and the bias was set to 0. The weights were updated by a backpropagation algorithm that applied mini-batch Stochastic Gradient Descent (SGD) [60] to the training dataset in randomized order. A mini-batch size of 128 was chosen. The initial learning rate was set as

l = 0.01

and decreased by

l \times 0.1

after every 20,000 iterations on the PA-100K dataset and every 5000 iterations on the PETA dataset; the momentum was set to 0.9. All experiments were completed on a workstation configured with a single NVIDIA GeForce RTX 3090.

4.3. Performance Comparison and Analysis

4.3.1. Comparison with State of the Art

Table 2 shows a performance comparison between the proposed DFLN and two recently developed methods, namely, MAMBA [48] and YinYang-Net [61], on the test sets of PETA [10] and PA-100K [9], respectively. From Table 2, one can see that the proposed DFLN method achieves superior or competitive performance in terms of accuracy and AUC, meaning that the proposed DFLN outperforms other pedestrian gender recognition methods.

To further analyze how much is contributed by the color pedestrian image and by the edge map, respectively, the performance results from PCNNM-U and PCNNM-L individually are also shown in Table 2. We can observe that PCNNM-U and PCNNM-L also achieve relatively good performance. Nonetheless, the proposed DFLN still significantly achieves superior performance in terms of accuracy and AUC. For example, on the PETA [10] dataset, the proposed DFLN realizes an accuracy improvement of

+ 2.8 %

(from

88.7 %

to

91.5 %

) and an AUC improvement of

+ 2 %

(from

0.95

to

0.97

) with respect to PCNNM-U, as well as an accuracy improvement of

+ 4.2 %

(from

87.3 %

to

91.5 %

) and an AUC improvement of

+ 3 %

(from

0.94

to

0.97

) with respect to PCNNM-L. This is because color pedestrian images provides an appearance-centered view that contains more image details, such as texture and clothing, while edge maps provide a structure-centered view that captures rich contour and structure information of the human body. Consequently, the prediction probabilities learned by PCNNM-U and PCNNM-L, respectively, could play complementary roles to each other through our designed adaptive Decision Fusion Module (DFM), so as to effectively improve pedestrian gender recognition performance. The same conclusion can also be obtained from the confusion matrices shown in Figure 4 and Figure 5, in which each confusion matrix has two rows and two columns, representing the counts of true negatives, false positives, false negatives, and true positives. For example, on the PETA [10] dataset, the proposed DFLN correctly predicted 602 instances of female class 0 (i.e., true negatives) and 748 instances of male class 1 (i.e., true positives), while incorrectly predicting 51 instances as male class 1 (i.e., false positives) and 74 instances as female class 0 (i.e., false negatives). Notably, compared to PCNNM-U and PCNNM-L, the proposed DFLN made fewer errors, highlighting its effectiveness in improving the gender recognition process. To evaluate the significance of this difference, a statistical significance test based on a paired t-test was performed to compare the proposed DFLN and PCNNM-U/PCNNM-L. The corresponding results are documented in Table 3. We can see that the p-values for both PCNNM-U vs. DFLN and PCNNM-L vs. DFLN are less than

0.05

. This demonstrates that the performance improvements achieved by DFLN over PCNNM-U and PCNNM-L are statistically significant rather than due to random variation.

In addition, to further demonstrate the superiority of our fusion method, two standard late-fusion baselines, namely, averaging and logistic regression on logits, were used for pedestrian gender recognition. Of the two baselines, the averaging-based fusion method takes an equally weighted average of the output probabilities of the PCNNM-U and PCNNM-L branches and then makes a decision based on the average probability. Logistic regression fusion is a learnable fusion strategy that first concatenates the logits of the two branches into a single feature vector, then trains a binary logistic regression model to optimally combine these features to predict the final gender. The corresponding accuracy and AUC results are reported in Table 4. One can see that the proposed method consistently outperforms the averaging-based and logistic regression-based fusion baselines. Furthermore, a paired t-test was conducted to evaluate the statistical significance of the performance differences between averaging/logistic regression baselines and the proposed DFLN on the PETA [10] and PA-100K [9] datasets. The corresponding results are reported in Table 5. One can see that the p-values for both averaging fusion vs. DFLN and logistic regression vs. DFLN are less than 0.05 in the comparisons of accuracy and AUC. This paired t-test shows that the performance gains achieved by DFLN over the averaging fusion and logistic regression baselines are statistically significant.

4.3.2. Justification of Edge Extraction

Since edge maps are regarded as another modality in addition to the color images used in the proposed DFLN, it is necessary to investigate the justification of the edge extraction method. For that purpose, in addition to the Prewitt operator used to extract the edge maps in this work, another three edge extraction operators, namely, Sobel, Canny, and TEED [62], are introduced for comparison with the Prewitt operator. To make a fair comparison, we first adopted Sobel, Canny, and TEED [62] to extract the edge maps, and we then used them and their corresponding color images to train the PCNNM-L and DFLN models. The corresponding results are reported in Table 6. As can be seen, introducing edge maps in our proposed DFLN framework can consistently enhance performance in pedestrian gender recognition. This further demonstrates that edge maps, which capture rich contour and structural information on the human body, can indeed complement color pedestrian images in our proposed Decision Fusion Learning Network (DFLN). Additionally, from Table 6, one can see that the proposed DFLN trained with edge maps obtained by the Prewitt operator outperforms those trained with edge maps provided by Sobel, Canny, and TEED [62]. Thus, we chose Prewitt as the edge operator in the proposed DFLN.

4.3.3. Effectiveness of Decision Fusion Module

We further comprehensively evaluate the effectiveness of the proposed Decision Fusion Module (DFM) and show its contribution to performance improvement. For that purpose, the performance resulting from different deep learning networks with and without the DFM is investigated. Specifically, in addition to the proposed Parallel CNN Prediction Probability Learning Module (PCNNM), the proposed DFM is incorporated into multiple well-known deep learning networks (namely, VGGNet16 [27], GoogleNet [28], ResNet18/ResNet50 [29], and PiT [30]) to show the universality of the proposed DFM. The corresponding AUCs and parameter counts for different deep learning networks are shown in Table 7. Moreover, the relevant ROC comparisons are illustrated in Figure 6.

The following can be clearly observed: (1) The proposed DFM cooperates well with different classic deep learning networks (i.e., AlexNet [43], VGGNet16 [27], GoogleNet [28], and ResNet18/ResNet50 [29]). Compared to the same classic deep learning networks alone, the proposed DFM provides a performance improvement of 1–5% AUC. This indicates that the proposed DFM is universal and can be applied to different deep learning networks to further improve their performance of pedestrian gender recognition. (2) These deep learning networks with or without the DFM not only contain more parameters than ours but also have consistently lower performance than our proposed DFLN. This implies that these networks may not be the optimal neural network architectures for pedestrian gender recognition.

4.3.4. Inference Time Analysis

Aside from gender recognition performance, inference efficiency is also an important factor for real-time surveillance systems. Accordingly, the inference times of the proposed DFLN, PCNNM-U, and PCNNM-L are documented in Table 8. The following can be observed: (1) PCNNM-U has comparable inference time to PCNNM-L, since they have the same network architecture. (2) The inference time of the proposed DFLN is larger than those of PCNNM-U and PCNNM-L, since it is composed of PCNNM-U, PCNNM-L, and the adaptive Decision Fusion Module (DFM). Therefore, it can be inferred that the inference time of the DFM should be the inference time of the DFLN minus the inference time of PCNNM-U and PCNNM-L, that is,

4.158

ms.

4.3.5. Misclassified Sample Analysis

To sufficiently analyze what kinds of cases can be well recognized by edge maps rather than color pedestrian images, we conduced misclassified sample analysis on the PETA [10] dataset. Figure 7 showcases some examples of pedestrian images that are misclassified by PCNNM-U but correctly classified by both PCNNM-L and the proposed DFLN; Table 9 provides their corresponding gender prediction probabilities. Note that, in our experiments, we set the gender label of female to zero and the gender label of male to one. Therefore, when the prediction probability is less than 0.5, the classifier will identify the corresponding input pedestrian as female. Conversely, when the prediction probability exceeds 0.5, the prediction outcome of the classifier is male.

By analyzing the samples shown in Figure 7 and the prediction probabilities documented in Table 9, we obtain some interesting insights. First, under low illumination conditions, PCNNM-U is more likely to misclassify the gender of pedestrians, while pedestrians’ gender can still be effectively recognized by PCNNM-L under these conditions. This could be due to the fact that image details (e.g., texture and clothing color) under low illumination conditions are unclear (see the first through third pedestrian images in the first row of Figure 7), while the contour and structure information of the human body is well preserved in the edge maps. Second, for pedestrians walking toward the side of the frame (see the fourth through sixth pedestrian images in the first row of Figure 7), PCNNM-U is also prone to misclassify pedestrian gender, because in such cases, fewer gender features, such as hair length and clothing style, are visible in the color pedestrian images, while the edge maps, which can capture the gait features of a pedestrian, are more beneficial for pedestrian gender recognition. Third, even in the case of PCNNM-U making a misclassification, the proposed DFLN is still able to achieve an accurate inference of pedestrian gender with the help of PCNNM-L; see the prediction probabilities derived by PCNNM-U, PCNNM-L, and the DFLN in Table 9 for a comparison. This demonstrates that the Decision Fusion Module (DFM) can indeed adaptively fuse the modality-specific pedestrian probabilities and thereby deliver more accurate pedestrian gender recognition.

4.3.6. Robustness Analysis

To further verify the effectiveness of the proposed DFLN, robustness analysis with respect to class imbalance was conducted on the PETA [10] dataset because this dataset is imbalanced in gender distribution. In our experiments, we removed some unsuitable images from the PETA dataset, leaving a total of 14,759 pedestrian images. According to statistics, among these 14,759 pedestrian images, males accounted for

54.9 %

(8095 images) and females accounted for

45.1 %

(6662 images), which shows a gender imbalance. Thus, to balance the gender proportion, we expanded the proportion of females to be consistent with that of males. Specifically, we augmented the number of female images to 8095 using horizontal image flipping, and we then used the augmented data to re-train PCNNM-U, PCNNM-L, and the DFLN. The corresponding results are shown in Table 10. As can be seen, the performance gain of PCNNM-U trained after gender balancing (termed PCNNM-U#) is

+ 0.9 %

and

+ 0.1

in terms of accuracy and AUC, while the performance gain of PCNNM-L trained after gender balancing (termed PCNNM-L#) is

+ 0.6 %

in terms of accuracy, with no increase in AUC. This shows that, compared with PCNNM-U, PCNNM-L is more robust to gender class imbalance. Regarding the proposed DFLN, its performance gain after gender balancing (termed DFLN#) is

+ 0.7 %

in terms of accuracy, indicating that its robustness to class imbalance is also greater than that of PCNNM-U.

In addition to class imbalance, we analyzed the robustness of pedestrian gender recognition under the condition of occlusion. In order to imitate occlusion, we erased the upper or lower half of the human body separately, as shown in Figure 8. Next, these processed pedestrian images were used to re-train the PCNNM-U, PCNNM-L, and DFLN models in order to analyze their robustness to body occlusion. The corresponding results are documented in Table 11. From Table 11, we obtain some interesting insights. First, the performance of pedestrian gender recognition consistently decreased regardless of whether the upper or lower half of the body was occluded. Second, occluding the upper body caused a more significant decrease in pedestrian gender recognition performance than occluding the lower body. This may be because the upper body contains more gender characteristics, such as hair and facial features. Third, despite the pedestrian’s body being occluded, the proposed DFLN still performs better in gender recognition than the other two branches, i.e., PCNNM-U and PCNNM-L. This further demonstrates that the proposed DFLN incorporating the adaptive decision fusion mechanism can fully leverage the complementary merits of the modality-specific probabilities learned from color pedestrian images and their edge maps, so as to deliver more robust pedestrian gender recognition than the other models with an occluded view.

4.4. Cross-Dataset Evaluation

To evaluate the generalization ability of our proposed method, a cross-dataset evaluation was performed among the proposed DFLN and two recently-developed competing methods, i.e., YinYang-Net [61] and MAMBA [48]. In this cross-dataset evaluation, these compared methods were either trained on the PA-100K [9] dataset and then tested on PETA [10] or vice versa—trained on PETA [10] and then tested on PA-100K [9]. The corresponding results are shown in Table 12.

From these results, we can observe that all the methods suffered a significant performance degradation in the cross-dataset evaluation, compared with the performances documented in Table 2. This is because these models, when trained on a specific dataset, tend to capture dataset-dependent characteristics, including particular distributions of pedestrian appearance, illumination conditions, viewpoints, and backgrounds, while, when evaluated on a different dataset, the models fail to generalize effectively because the target dataset may exhibit distinct visual statistics and data distributions. Such domain discrepancies lead to a distribution shift between the training and testing data, which substantially undermines the discriminative capability of the models and results in notable performance degradation in cross-dataset scenarios. Nevertheless, from Table 2, one can see that the proposed DFLN method had less performance loss than competing methods. For example, when trained on the PA-100K dataset and tested on the PETA dataset, YinYang-Net [61] loses

25.3 %

accuracy and

0.25

AUC, and MAMBA [48] loses

27.0 %

accuracy and

0.28

AUC compared to the results reported in Table 2, while the proposed DFLN method loses

23.9 %

accuracy and

0.23

AUC and still yields the best performance. This cross-dataset evaluation indicates that the proposed DFLN method has better generalization ability than the two competing pedestrian gender recognition methods.

5. Conclusions

In this paper, a learnable adaptive decision fusion method, called the Decision Fusion Learning Network (DFLN), is proposed for pedestrian gender recognition. The superior performance of the DFLN is mainly attributable to an adaptive decision fusion strategy that operates in the probability space and effectively exploits the complementary merits of the modality-specific probabilities learned from color pedestrian images and their edge maps. Specifically, the DFLN is composed of a Parallel CNN Prediction Probability Learning Module (PCNNM) and a Decision Fusion Module (DFM). The PCNNM consists of two parallel convolutional neural networks that independently learn posterior class probability vectors from pedestrian images and its their corresponding edge maps, while the learning-capable DFM takes these modality-specific probabilities as input, concatenates them into a joint probability vector, and adaptively fuses them via a fully connected layer and softmax operation to achieve accurate pedestrian gender recognition. Extensive experiments on two large-scale pedestrian attribute benchmarks, namely, PETA and PA-100K, demonstrate that the DFLN achieves competitive or superior performance compared with multiple state-of-the-art pedestrian gender recognition methods. Moreover, when the proposed DFM was inserted into several well-known neural networks, it was verified to consistently improve their performance for pedestrian gender recognition.

Although the proposed DFLN yields promising performance in pedestrian gender recognition, pedestrian gender recognition remains challenging in scenarios characterized by high inter-class similarity, such as similar clothing styles and body shapes. These factors may obscure discriminative cues and increase gender recognition ambiguity. To mitigate the above limitations, the proposed DFLN could be further refined by incorporating joint feature–probability interaction. For example, instead of operating solely on posterior probabilities, the model could incorporate intermediate semantic embeddings to guide the fusion weights, enabling context-aware modulation when structural or appearance cues become less informative. On the other hand, although the introduction of complementary behavior analysis in this paper provides essential evidence of the feasibility and interpretability of our proposed adaptive decision fusion approach, no solid theoretical justification is established to explain why fusion in probability space should be preferable to alternative fusion spaces for pedestrian gender recognition. Therefore, our future research will focus on how to resolve this limitation.

Author Contributions

Methodology: L.C. and H.Z.; software: L.C., F.C. and F.R.; validation: H.Z. and W.X.; writing—original draft preparation: L.C. and H.Z.; writing—review and editing: W.X.; formal analysis: F.R. and Q.L.; investigation: W.X.; data curation: F.R. and Q.L.; visualization: H.Z.; supervision: Y.S.; funding acquisition: L.C. and W.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Science Foundation of Fujian Province under Grant 2024J01098; in part by the Natural Science Foundation of Xiamen, China, under Grants 3502Z202472011 and 3502Z202571031; and in part by the Scientific Research Funds of Huaqiao University under Grants 24BS109 and 25BS107.

Institutional Review Board Statement

Not applicable. This study used only publicly available datasets and did not involve any direct interaction with human participants or animal subjects.

Informed Consent Statement

Not applicable. All data employed in this study are publicly accessible and were obtained in full compliance with the ethical guidelines established by the respective dataset creators.

Data Availability Statement

The datasets analyzed in this study were two datasets derived by Hydraplus-net: “Attentive deep features for pedestrian analysis” [9] and “Pedestrian attribute recognition at far distance” [10].

Acknowledgments

We are extremely grateful for the valuable suggestions provided by the editors and reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yücesoy, E. Automatic age and gender recognition using ensemble learning. Appl. Sci. 2024, 14, 6868. [Google Scholar] [CrossRef]
Cabric, F.; Bjarnadóttir, M.; Ling, M.; Rafnsdóttir, G.; Isenberg, P. Eleven years of gender data visualization: A step towards more inclusive gender representation. IEEE Trans. Vis. Comput. Graph. 2024, 30, 316–326. [Google Scholar] [CrossRef] [PubMed]
Rezgui, Z.; Strisciuglio, N.; Veldhuis, R. Gender privacy angular constraints for face recognition. IEEE Trans. Biom. Behav. Identity Sci. 2024, 6, 352–363. [Google Scholar] [CrossRef]
Lv, Z.; Poiesi, F.; Dong, Q.; Lloret, J.; Song, H. Deep learning for intelligent human–computer interaction. Appl. Sci. 2022, 12, 11457. [Google Scholar] [CrossRef]
Nikpour, B.; Sinodinos, D.; Armanfard, N. Deep reinforcement learning in human activity recognition: A survey and outlook. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 4267–4278. [Google Scholar] [CrossRef]
Ma, C.; Ding, Y.; Zhang, H.; Ma, J.; Zhang, Z.; Wang, Y.; Tao, F. Fine-Grained Offline Ranking for Short Video Advertisements CTR via Visual-and Audio-Based Features. IEEE Trans. Consum. Electron. 2025, 71, 9544–9556. [Google Scholar] [CrossRef]
Yücesoy, E. Gender Recognition based on the stacking of different acoustic features. Appl. Sci. 2024, 14, 6564. [Google Scholar] [CrossRef]
Sun, Y.; Zhang, M.; Sun, Z.; Tan, T. Demographic analysis from biometric data: Achievements, challenges, and new frontiers. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 332–351. [Google Scholar] [CrossRef]
Liu, X.; Zhao, H.; Tian, M.; Sheng, L.; Shao, J.; Yi, S.; Yan, J.; Wang, X. Hydraplus-net: Attentive deep features for pedestrian analysis. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 350–359. [Google Scholar]
Deng, Y.; Luo, P.; Loy, C.L.; Tang, X. Pedestrian attribute recognition at far distance. In Proceedings of the ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 789–792. [Google Scholar]
Cao, L.; Dikmen, M.; Fu, Y.; Huang, T. Gender recognition from body. In Proceedings of the ACM International Conference on Multimedia, Vancouver, BC, Canada, 26–31 October 2008; pp. 725–728. [Google Scholar]
Collins, M.; Zhang, J.; Miller, P.; Wang, H. Full body image feature representations for gender profiling. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Kyoto, Japan, 29 September–2 October 2009; pp. 1235–1242. [Google Scholar]
Guo, G.; Mu, G.; Fu, Y. Gender from body: A biologically-inspired approach with manifold learning. In Proceedings of the Asian Conference on Computer Vision, Xi’an, China, 23–27 September 2009; pp. 236–245. [Google Scholar]
Hu, M.; Wang, Y.; Zhang, Z.; Zhang, D. Gait-based gender classification using mixed conditional random field. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2011, 41, 1429–1439. [Google Scholar]
Lu, J.; Wang, G.; Moulin, P. Human identity and gender recognition from gait sequences with arbitrary walking directions. IEEE Trans. Inf. Forensics Secur. 2014, 9, 51–61. [Google Scholar] [CrossRef]
Isaac, E.R.; Elias, S.; Rajagopalan, S.; Easwarakumar, K. Multiview gait-based gender classification through pose-based voting. Pattern Recognit. Lett. 2019, 126, 41–50. [Google Scholar] [CrossRef]
Raza, M.; Sharif, M.; Yasmin, M.; Khan, M.A.; Saba, T.; Fernandes, S.L. Appearance based pedestrians: Gender recognition by employing stacked auto encoders in deep learning. Future Gener. Comput. Syst. 2018, 88, 28–39. [Google Scholar] [CrossRef]
Cai, L.; Zhu, J.; Zeng, H.; Chen, J.; Cai, C.; Ma, K.K. HOG-Assisted Deep Feature Learning for Pedestrian Gender Recognition. J. Frankl. Inst. 2018, 355, 1991–2008. [Google Scholar] [CrossRef]
Kong, L.; Liu, K.; Hu, X.; Zhang, N.; Qi, L.; Li, X.; Zhou, X. Gender classification based on spatio-frequency feature fusion of OCT fingerprint images in the IoT environment. IEEE Internet Things J. 2024, 11, 25731–25743. [Google Scholar] [CrossRef]
Li, X.; Makihara, Y.; Xu, C.; Yagi, Y. GaitAGE: Gait Age and Gender Estimation Based on an Age-and Gender-Specific 3D Human Model. IEEE Trans. Biom. Behav. Identity Sci. 2024, 7, 47–60. [Google Scholar] [CrossRef]
Antipov, G.; Berrani, S.; Ruchaud, N.; Dugelay, J.L. Learned vs. hand-crafted features for pedestrian gender recognition. In Proceedings of the ACM International Conference on Multimedia, Brisbane, Australia, 26–30 October 2015; pp. 1263–1266. [Google Scholar]
Jiang, S.; Ji, Q.; Shi, H.; Chen, C.; Xu, Y. Spatial correlation guided cross scale feature fusion for age and gender estimation. Sci. Rep. 2025, 15, 20873. [Google Scholar] [CrossRef]
Somani, A.; Sekh, A.; Prasad, D. Explain with Confidence: Fusing Saliency Maps for Faithful and Interpretable Weakly-Supervised Model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–20 October 2025; pp. 290–299. [Google Scholar]
Bekhouch, A.; Barhoumi, W.; Doghmane, N. Few Skeleton Features For Gate-Based Gender Recognition From Landmark Frames. In Proceedings of the IEEE 6th International Conference on Image Processing, Applications and Systems, Lyon, France, 9–11 January 2025; pp. 1–6. [Google Scholar]
Zheng, C.; Wu, W.; Chen, C.; Yang, T.; Zhu, S.; Shen, J.; Kehtarnavaz, N.; Shah, M. Deep learning-based human pose estimation: A survey. ACM Comput. Surv. 2023, 56, 1–37. [Google Scholar] [CrossRef]
Lee, D.; Jeong, M.; Jeong, S.; Jung, S.; Park, K. Estimation of fractal dimension and segmentation of body regions for deep learning-based gender recognition. Fractal Fract. 2024, 8, 551. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Heo, B.; Yun, S.; Han, D.; Chun, S.; Choe, J.; Oh, S. Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11936–11945. [Google Scholar]
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
Yin, X.; Liu, X. Multi-Task Convolutional Neural Network for Pose-Invariant Face Recognition. IEEE Trans. Image Process. 2017, 27, 964–975. [Google Scholar] [CrossRef] [PubMed]
He, R.; Wu, X.; Sun, Z.; Tan, T. Wasserstein cnn: Learning invariant features for nir-vis face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1761–1773. [Google Scholar] [CrossRef] [PubMed]
Zhu, J.; Zeng, H.; Liao, S.; Lei, Z.; Cai, C.; Zheng, L. Deep Hybrid Similarity Learning for Person Re-identification. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 3183–3193. [Google Scholar] [CrossRef]
Cho, Y.J.; Yoon, K.J. Pamm: Pose-aware multi-shot matching for improving person re-identification. IEEE Trans. Image Process. 2018, 27, 3739–3752. [Google Scholar] [CrossRef]
Zhu, J.; Zeng, H.; Huang, J.; Zhu, X.; Lei, Z.; Cai, C.; Zheng, L. Body Symmetry and Part-Locality-Guided Direct Nonparametric Deep Feature Enhancement for Person Reidentification. IEEE Internet Things J. 2020, 7, 2053–2065. [Google Scholar] [CrossRef]
Hou, J.; Zeng, H.; Cai, L.; Zhu, J.; Chen, J.; Ma, K.K. Multi-label Learning with Multi-label Smoothing Regularization for Vehicle Re-Identification. Neurocomputing 2019, 345, 15–22. [Google Scholar] [CrossRef]
Hou, J.; Zeng, H.; Zhu, J.; Hou, J.; Chen, J.; Ma, K.K. Deep Quadruplet Appearance Learning for Vehicle Re-Identification. IEEE Trans. Veh. Technol. 2019, 68, 8512–8522. [Google Scholar] [CrossRef]
Zhu, J.; Zeng, H.; Liao, S.; Lei, Z.; Cai, C.; Zheng, L. Vehicle Re-identification Using Quadruple Directional Deep Leaning Features. IEEE Trans. Intell. Transp. Syst. 2020, 21, 410–420. [Google Scholar] [CrossRef]
Lin, X.; Zeng, H.; Hou, J.; Cao, J.; Zhu, J.; Chen, J. Joint Pyramid Feature Representation Network for Vehicle Re-identification. Mob. Netw. Appl. 2020, 25, 1781–1792. [Google Scholar] [CrossRef]
Cao, J.; Zhu, J.; Hu, W.; Kummert, A. Epileptic Signal Classification with Deep EEG Features by Stacked CNNs. IEEE Trans. Cogn. Dev. Syst. 2019, 12, 709–722. [Google Scholar] [CrossRef]
Ng, C.B.; Tay, Y.H.; Goi, B.M. A convolutional neural network for pedestrian gender recognition. In International Symposium on Neural Networks; Springer: Berlin/Heidelberg, Germany, 2013; pp. 558–564. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
Luo, P.; Wang, X.; Tang, X. Pedestrian parsing via deep decompositional network. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 2648–2655. [Google Scholar]
Ng, C.B.; Tay, Y.H.; Goi, B.M. Pedestrian gender classification using combined global and local parts-based convolutional neural networks. Pattern Anal. Appl. 2019, 22, 1469–1480. [Google Scholar] [CrossRef]
Bei, S.; Deng, J.; Zhen, Z.; Shaojing, S. Gender Recognition via Fused Silhouette Features Based on Visual Sensors. IEEE Sens. J. 2019, 19, 9496–9503. [Google Scholar] [CrossRef]
Abbas, F.; Yasmin, M.; Fayyaz, M.; Asim, U. ViT-PGC: Vision transformer for pedestrian gender classification on small-size dataset. Pattern Anal. Appl. 2023, 26, 1805–1819. [Google Scholar] [CrossRef]
Mbongo, N.; Hambarde, K.; Proença, H. From One Domain to Another: The Pitfalls of Gender Recognition in Unseen Environments. Sensors 2025, 25, 4161. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; Cheng, H.; Chen, H.; Zhang, Y.; Li, Y.; Shen, Y.; Wang, P.; Wang, W. Envelope rotation forest: A novel ensemble learning method for classification. Neurocomputing 2025, 618, 129059. [Google Scholar] [CrossRef]
Wang, Y.; Lei, Y.; Li, N.; Feng, K.; Wang, Z.; Tan, Y.; Li, H. Machinery multimodal uncertainty-aware RUL prediction: A stochastic modeling framework for uncertainty quantification and informed fusion. IEEE Internet Things J. 2025, 12, 31643–31653. [Google Scholar] [CrossRef]
Yin, S.; Jiang, L. Multi-method integration with confidence-based weighting for zero-shot image classification. arXiv 2024, arXiv:2405.02155. [Google Scholar]
Wang, C.; Xie, H. FCD-Net: Feature Decorrelation and Confidence-Driven Dynamic Fusion for Robust Pedestrian Recognition in Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2025, 26, 19468–19480. [Google Scholar] [CrossRef]
Fan, C.; Lu, Z.; Wei, W.; Tian, J.; Qu, X.; Chen, D.; Cheng, Y. On giant’s shoulders: Effortless weak to strong by dynamic logits fusion. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; pp. 29986–30014. [Google Scholar]
Yang, W.; Chen, B.; Yu, L. Bayesian-wavelet-based multisource decision fusion. IEEE Trans. Instrum. Meas. 2021, 70, 2511810. [Google Scholar] [CrossRef]
Cai, J.; Zhang, M.; Yang, H.; He, Y.; Yang, Y.; Shi, C.; Zhao, X.; Xun, Y. A novel graph-attention based multimodal fusion network for joint classification of hyperspectral image and LiDAR data. Expert Syst. Appl. 2024, 249, 123587. [Google Scholar] [CrossRef]
Chen, Q.; Li, S.; Liu, Y.; Pan, S.; Webb, G.; Zhang, S. Uncertainty-aware graph neural networks: A multihop evidence fusion approach. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 18463–18477. [Google Scholar] [CrossRef]
Shaffi, N.; Subramanian, K.; Vimbi, V.; Hajamohideen, F.; Abdesselam, A.; Mahmud, M. Performance evaluation of deep, shallow and ensemble machine learning methods for the automated classification of Alzheimer’s disease. Int. J. Neural Syst. 2024, 34, 2450029. [Google Scholar] [CrossRef]
Xue, W.; Zhang, L.; Mou, X.; Bovik, A.C. Gradient Magnitude Similarity Deviation: A Highly Efficient Perceptual Image Quality Index. IEEE Trans. Image Process. 2014, 23, 684–695. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Bottou, L. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 2012; pp. 421–436. [Google Scholar]
Roxo, T.; Proença, H. YinYang-Net: Complementing face and body information for wild gender recognition. IEEE Access 2022, 10, 28122–28132. [Google Scholar] [CrossRef]
Soria, X.; Li, Y.; Rouhani, M.; Sappa, A. Tiny and efficient model for the edge detection generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 1364–1373. [Google Scholar]

Figure 1. Examples of pedestrian images selected from pedestrian attribute datasets: PA-100K [9] and PETA [10].

Figure 2. Examples of the extracted edge maps corresponding to the pedestrian images shown in Figure 1.

Figure 3. Schematic illustration of the proposed DFLN for pedestrian gender recognition. It is mainly composed of two modules, namely, the Parallel-CNN Prediction Probability Learning Module (PCNNM) and the Decision Fusion Module (DFM).

Figure 4. Comparison of confusion matrices among PCNNM-U, PCNNM-L and DFLN on the PETA [10] dataset.

Figure 5. Comparison of confusion matrices among PCNNM-U, PCNNM-L and DFLN on the PA-100K [9] dataset.

Figure 6. ROC comparison of different deep learning networks with and without DFM on the PA-100K [9] dataset. (a) VGGNet16-DFM vs. VGGNet16 [27]; (b) GoogleNet-DFM vs. GoogleNet [28]; (c) ResNet18-DFM vs. ResNet18 [29]; (d) ResNet50-DFM vs. ResNet50 [29]; (e) PiT-DFM vs. PiT [30]; (f) DFLN vs. PCNNM-U.

Figure 7. Examples of pedestrian images by misclassified PCNNM-U but correctly classified by PCNNM-L and the proposed DFLN. The first row shows the color pedestrian images, and the second row showcases the corresponding edge maps.

Figure 8. Examples of occlusion of the upper half (b) and the lower half (c) of the human body.

Table 1. The parameter configuration of the proposed DFLN.

Layer	Type	Filter Size	Number of Filters	Stride	Output Size
-	Input	-	-	-	(3∣1) × 128 × 48
C1	Convolution	3 × 3	32	1	(32∣32) × 128 × 48
	BN	-	-	-	(32∣32) × 128 × 48
	ReLU	-	-	-	(32∣32) × 128 × 48
C2	Convolution	3 × 3	32	1	(32∣32) × 128 × 48
	BN	-	-	-	(32∣32) × 128 × 48
	ReLU	-	-	-	(32∣32) × 128 × 48
M3	Maxpool	2 × 2	-	2	(32∣32) × 64 × 24
C4	Convolution	3 × 3	32	1	(32∣32) × 64 × 24
	BN	-	-	-	(32∣32) × 64 × 24
	ReLU	-	-	-	(32∣32) × 64 × 24
C5	Convolution	3 × 3	32	1	(32∣32) × 64 × 24
	BN	-	-	-	(32∣32) × 64 × 24
	ReLU	-	-	-	(32∣32) × 64 × 24
M6	Maxpool	2 × 2	-	2	(32∣32) × 32 × 12
C7	Convolution	3 × 3	32	1	(32∣32) × 30 × 10
	BN	-	-	-	(32∣32) × 30 × 10
	ReLU	-	-	-	(32∣32) × 30 × 10
C8	Convolution	3 × 3	32	1	(32∣32) × 28 × 8
	BN	-	-	-	(32∣32) × 28 × 8
	ReLU	-	-	-	(32∣32) × 28 × 8
A9	AVEpool	4 × 4	-	2	(32∣32) × 13 × 3
F10	FC	1 × 1	128	-	(128∣128) × 1 × 1
	ReLU	-	-	-	(128∣128) × 1 × 1
	Dropout	-	-	-	(128∣128) × 1 × 1
F11	FC	1 × 1	2	-	(2∣2) × 1 × 1
	Softmax	-	-	-	(2∣2) × 1 × 1
Concat	Concat	-	-	-	4 × 1 × 1
F12	FC	1 × 1	2	-	2 × 1 × 1
	Softmax	-	-	-	2 × 1 × 1

Table 2. Performance comparison of various pedestrian gender recognition methods on the PETA [10] and PA-100K [9] datasets. Best results are marked in bold.

Method	Dataset	PETA [10]		PA-100K [9]
Method	Year	Accuracy (%) ↑	AUC ↑	Accuracy (%) ↑	AUC ↑
YinYang-Net [61]	2022	90.9	0.96	90.6	0.96
MAMBA [48]	2025	90.1	0.96	79.8	0.87
PCNNM-U	-	$88.7 \pm 0.81$	$0.95 \pm 0.0067$	$88.4 \pm 0.49$	$0.94 \pm 0.0048$
PCNNM-L	-	$87.3 \pm 0.78$	$0.94 \pm 0.0050$	$87.1 \pm 0.41$	$0.92 \pm 0.0035$
$Proposed DFLN$	-	$91.5 \pm 0.43$	$0.97 \pm 0.0029$	$91.2 \pm 0.39$	$0.95 \pm 0.0030$

Table 3. Paired t-test comparisons of PCNNM-U vs. DFLN and PCNNM-L vs. DFLN on the PETA [10] and PA-100K [9] datasets.

p-Values	PETA [10]		PA-100K [9]		Significance
p-Values	Accuracy	AUC	Accuracy	AUC	Significance
PCNNM-U vs. DFLN	0.000584	0.007993	0.000465	0.001185	Yes
PCNNM-L vs. DFLN	0.000120	0.000798	0.006457	0.012338	Yes

Table 4. Performance comparison of different late-fusion methods on the PETA [10] and PA-100K [9] datasets. Best results are marked in bold.

Method	PETA [10]		PA-100K [9]
Method	Accuracy (%) ↑	AUC ↑	Accuracy (%) ↑	AUC ↑
Averaging Fusion	$89.8 \pm 0.25$	$0.95 \pm 0.0021$	$87.5 \pm 0.19$	$0.93 \pm 0.0011$
Logistic Regression	$90.3 \pm 0.32$	$0.96 \pm 0.0019$	$87.7 \pm 0.30$	$0.94 \pm 0.0024$
$Proposed DFLN$	$91.5 \pm 0.43$	$0.97 \pm 0.0029$	$91.2 \pm 0.39$	$0.95 \pm 0.0030$

Table 5. Paired t-test comparison of averaging fusion vs. DFLN and logistic regression vs. DFLN on the PETA [10] and PA-100K [9] datasets.

p-Values	PETA [10]		PA-100K [9]		Significance
p-Values	Accuracy	AUC	Accuracy	AUC	Significance
Averaging fusion vs. DFLN	0.006066	0.003891	0.001263	0.000499	Yes
Logistic regression vs. DFLN	0.015693	0.005978	0.002319	0.000893	Yes

Table 6. Performance comparison of different edge extraction methods on the PETA [10] and PA-100K [9] datasets. Best results are marked in bold.

Method	Dataset	PETA [10]		PA-100K [9]
Method	Edge Operator	Accuracy (%) ↑	AUC ↑	Accuracy (%) ↑	AUC ↑
PCNNM-L	Sobel	$86.8 \pm 0.75$	$0.94 \pm 0.0049$	$78.7 \pm 0.20$	$0.86 \pm 0.0021$
	Canny	$80.9 \pm 0.97$	$0.88 \pm 0.0067$	$75.1 \pm 0.31$	$0.82 \pm 0.0036$
	TEED [62]	$87.5 \pm 0.90$	$0.94 \pm 0.0059$	$84.7 \pm 0.32$	$0.92 \pm 0.0035$
	Prewitt	$87.3 \pm 0.78$	$0.94 \pm 0.0050$	$87.1 \pm 0.56$	$0.92 \pm 0.0045$
$Proposed DFLN$	Sobel	$90.5 \pm 0.57$	$0.96 \pm 0.0033$	$80.9 \pm 0.56$	$0.88 \pm 0.0039$
	Canny	$89.2 \pm 0.86$	$0.95 \pm 0.0053$	$79.6 \pm 0.59$	$0.87 \pm 0.0065$
	TEED [62]	$91.0 \pm 0.93$	$0.97 \pm 0.0062$	$86.3 \pm 0.38$	$0.93 \pm 0.0028$
	Prewitt	$91.5 \pm 0.43$	$0.97 \pm 0.0029$	$91.2 \pm 0.39$	$0.95 \pm 0.0030$

Table 7. AUCs and parameter counts of different deep learning networks with and without DFM on the PA-100K dataset. Best results are marked in bold.

Methods	AUC	Parameters (M)
VGGNet16 [27]	$0.88 \pm 0.0058$	128.05
VGGNet16-DFM	$0.91 \pm 0.0042$	256.10
GoogleNet [28]	$0.89 \pm 0.0049$	5.34
GoogleNet-DFM	$0.92 \pm 0.0036$	10.68
ResNet18 [29]	$0.92 \pm 0.0045$	10.66
ResNet18-DFM	$0.93 \pm 0.0037$	21.32
ResNet50 [29]	$0.92 \pm 0.0040$	22.42
ResNet50-DFM	$0.94 \pm 0.0029$	44.85
PiT [43]	$0.89 \pm 0.0053$	4.59
PiT-DFM	$0.94 \pm 0.0033$	9.18
PCNNM-U	$0.92 \pm 0.0048$	2.49
Proposed DFLN	$0.95 \pm 0.0030$	4.97

Table 8. Average inference time (in milliseconds) for a pedestrian image from the PETA [10] dataset.

Methods	Inference Time (ms/Image)
PCNNM-U	8.195
PCNNM-L	7.525
DFLN	19.878

Table 9. Prediction probability derived by PCNNM-U, PCNNM-L, and the DFLN on the examples of pedestrian images shown in Figure 7.

	1st	2nd	3rd	4th	5th	6th
Method	(Label: 0)	(Label: 0)	(Label: 1)	(Label: 1)	(Label: 1)	(Label: 0)
PCNNM-U	0.557	0.568	0.158	0.314	0.240	0.837
PCNNM-L	0.268	0.017	0.998	0.622	0.826	0.032
DFLN	0.306	0.172	0.767	0.517	0.810	0.344

Table 10. Performance comparison before and after class balancing on the PETA [10] dataset.

Methods	Accuracy (%) ↑	AUC ↑
PCNNM-U	$88.7 \pm 0.81$	$0.95 \pm 0.0067$
PCNNM-U#	$89.6 \pm 0.85$	$0.96 \pm 0.0071$
PCNNM-L	$87.3 \pm 0.78$	$0.94 \pm 0.0050$
PCNNM-L#	$88.0 \pm 0.66$	$0.94 \pm 0.0043$
Proposed DFLN	$91.5 \pm 0.43$	$0.97 \pm 0.0029$
Proposed DFLN#	$92.2 \pm 0.50$	$0.97 \pm 0.0021$

# indicates models that were trained on the PETA dataset after class balancing.

Table 11. Performance comparison on the PETA [10] dataset before and after the introduction of occlusion.

Methods	Accuracy (%) ↑	AUC ↑
PCNNM-U	$88.7 \pm 0.81$	$0.95 \pm 0.0067$
PCNNM-L	$87.3 \pm 0.78$	$0.94 \pm 0.0050$
Proposed DFLN	$91.5 \pm 0.43$	$0.97 \pm 0.0029$
PCNNM-U ^$	$80.3 \pm 1.66$	$0.87 \pm 0.0125$
PCNNM-L ^$	$78.1 \pm 1.02$	$0.84 \pm 0.0087$
Proposed DFLN ^$	$81.9 \pm 0.91$	$0.89 \pm 0.0064$
PCNNM−U *	$87.9 \pm 0.84$	$0.94 \pm 0.0076$
PCNNM−L *	$86.3 \pm 0.79$	$0.93 \pm 0.0056$
Proposed DFLN *	$90.6 \pm 0.68$	$0.95 \pm 0.0051$

$ indicates that the models were trained on pedestrian images with the upper body occluded. * indicates that the models were trained on pedestrian images with the lower body occluded.

Table 12. Performance comparison of different pedestrian gender recognition methods in a cross-dataset evaluation. Best results are marked in bold.

Method	PA-100K (Train) → PETA (Test)		PETA (Train) → PA-100K (Test)
Method	Accuracy (%) ↑	AUC ↑	Accuracy (%) ↑	AUC ↑
YinYang-Net [61]	65.6	0.71	61.7	0.60
MAMBA [48]	63.1	0.68	67.0	0.73
PCNNM-U	$66.8 \pm 2.86$	$0.73 \pm 0.0646$	$65.2 \pm 1.97$	$0.68 \pm 0.0521$
PCNNM-L	$65.9 \pm 2.64$	$0.72 \pm 0.0463$	$64.1 \pm 1.95$	$0.67 \pm 0.0403$
$Proposed DFLN$	$67.6 \pm 1.47$	$0.74 \pm 0.0296$	$69.0 \pm 0.82$	$0.75 \pm 0.0081$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cai, L.; Zheng, H.; Ruan, F.; Chen, F.; Xiang, W.; Lin, Q.; Shi, Y. Adaptive Decision Fusion in Probability Space for Pedestrian Gender Recognition. Appl. Sci. 2026, 16, 3640. https://doi.org/10.3390/app16083640

AMA Style

Cai L, Zheng H, Ruan F, Chen F, Xiang W, Lin Q, Shi Y. Adaptive Decision Fusion in Probability Space for Pedestrian Gender Recognition. Applied Sciences. 2026; 16(8):3640. https://doi.org/10.3390/app16083640

Chicago/Turabian Style

Cai, Lei, Huijie Zheng, Fang Ruan, Feng Chen, Wenjie Xiang, Qi Lin, and Yifan Shi. 2026. "Adaptive Decision Fusion in Probability Space for Pedestrian Gender Recognition" Applied Sciences 16, no. 8: 3640. https://doi.org/10.3390/app16083640

APA Style

Cai, L., Zheng, H., Ruan, F., Chen, F., Xiang, W., Lin, Q., & Shi, Y. (2026). Adaptive Decision Fusion in Probability Space for Pedestrian Gender Recognition. Applied Sciences, 16(8), 3640. https://doi.org/10.3390/app16083640

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Decision Fusion in Probability Space for Pedestrian Gender Recognition

Abstract

1. Introduction

2. Related Work

2.1. Pedestrian Gender Recognition Based on Hand-Crafted Features

2.2. Pedestrian Gender Recognition Based on Deep Learning–Derived Features

2.3. Decision-Level Fusion

3. Proposed Decision Fusion Learning Network

3.1. Parallel CNN Prediction Probability Learning Module

3.2. Decision Fusion Module

3.3. Complementary Behavior Analysis

4. Experimental Results and Analyses

4.1. Dataset and Evaluation Protocol

4.2. Implementation Details

4.3. Performance Comparison and Analysis

4.3.1. Comparison with State of the Art

4.3.2. Justification of Edge Extraction

4.3.3. Effectiveness of Decision Fusion Module

4.3.4. Inference Time Analysis

4.3.5. Misclassified Sample Analysis

4.3.6. Robustness Analysis

4.4. Cross-Dataset Evaluation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI