1. Introduction
Since gender is often used as a soft biometric cue, pedestrian gender recognition, which aims to classify pedestrian images into predefined gender categories, has become a widely studied problem in smart-city-related computer vision applications [
1,
2,
3]. It is expected to have broad application prospects, such as the following [
4,
5,
6,
7,
8]: (1) Human–computer interaction: Intelligent machines (e.g., autonomous vehicles) can leverage gender information to provide more appropriate interaction and communication. (2) Intelligent video applications: restricting pedestrians’ access to some areas (e.g., dormitories) that allow only a specific gender is often required for public safety. (3) Multimedia retrieval systems: Gender is an important label for indexing, annotating and searching the massive supply of visual materials (images/videos). (4) Demographic collection: Automated shopping statistics and behavior analysis of males and females are very useful for various marketing and business scenarios (e.g., advertisement). In this work, we follow the common setting in existing pedestrian datasets and focus on binary gender labels.
In practical scenarios, pedestrian images are usually captured by different kinds of cameras (e.g., high/low resolution) under various environments (e.g., long distance, indoor/outdoor.) without pedestrians’ cooperation. Some examples of pedestrian images selected from pedestrian attribute datasets, such as PA-100K [
9] and PETA [
10], are shown in
Figure 1. It can be easily observed that some crucial challenges, such as illumination changes, viewpoint variations, diverse appearances, and arbitrary postures, are frequently encountered in pedestrian images. Therefore, it is still difficult to develop an effective method for accurately recognizing the gender of a pedestrian from images captured in practical scenarios. To address this problem, several pedestrian attribute datasets, such as the abovementioned PA-100K [
9] and PETA [
10], have been constructed to facilitate the learning and evaluation of various pedestrian gender recognition methods. Promoted by these datasets, multiple gender recognition methods have been developed by exploring various features (e.g., HOG [
11], PixelHOG [
12], biologically inspired features [
13], gait features [
14,
15], and elliptic Fourier descriptors [
16]) and a wide variety of methodologies (e.g., AdaBoost [
11], SVM [
12,
13], sparse reconstruction [
15], and deep learning [
17,
18,
19,
20]). The related works will be organized and highlighted in
Section 2.
For pedestrian gender recognition, different image modalities for the same pedestriancan provide complementary information. Intuitively, a color pedestrian image (see
Figure 1) provides an appearance-centered view that emphasizes detailed information such as texture and clothing, while an edge map (see
Figure 2) provides a structure-centered view that captures rich contour and structural information on the human body. These two views can be regarded as complementary to one another. However, existing deep learning–based approaches mostly focus on learning discriminative feature representations from a single image modality [
17,
21], typically color images. Furthermore, although several works [
18,
22] have begun to explore different fusion methods for pedestrian gender recognition, these approaches predominantly focus on feature representation learning, and fusion is often implemented using simple strategies such as feature concatenation. Moreover, in recent years, the research community has increasingly explored more advanced structural modalities such as saliency maps [
23], skeleton representations [
24], pose estimation [
25], and semantic segmentation [
26]. These modalities indeed offer richer high-level semantic priors. However, compared to the abovementioned modalities, edge maps were chosen for the present study because they are more methodologically aligned with the scope of this work. First, edge maps provide a structure-centered representation that is relatively insensitive to illumination, color variation, and background clutter, which are common challenges in pedestrian imagery. Second, unlike pose or segmentation models, which require additional supervision and complex pre-trained detectors, edge extraction does not introduce task-specific bias or additional annotation dependencies.
In this paper, with these considerations, we present an adaptive decision fusion framework for pedestrian gender recognition, called the Decision Fusion Learning Network (DFLN). The proposed DFLN starts with a Parallel CNN Prediction Probability Learning Module (PCNNM), which consists of two parallel convolutional neural networks that independently learn posterior class probability vectors from color pedestrian images and their corresponding edge maps. In addition to the PCNNM, a learnable Decision Fusion Module (DFM) is designed to take the modality-specific prediction probabilities as input, concatenate them into a joint probability vector, and adaptively fuse them via a fully connected layer and a softmax operation to achieve accurate pedestrian gender recognition. The proposed DFLN can simultaneously learn the feature representations and carry out adaptive decision fusion in an end-to-end manner, in contrast to existing multimodal fusion strategies, where the decision fusion and the predictions of different branches are optimized independently or combined through pre-defined fusion rules. Furthermore, the backpropagation-based derivation was formalized to demonstrate how the predictions from PCNNM-U and PCNNM-L interact and thus indicate their complementary behavior, which provides essential support for the feasibility and interpretability of our proposed adaptive decision fusion framework.
In summary, the main contributions of this article are as follows:
We present an adaptive decision fusion framework for pedestrian gender recognition. By operating directly on the prediction probabilities learned in parallel from color pedestrian images and their corresponding edge maps, the proposed framework can fully exploit the complementary merits of the modality-specific probabilities learned from color pedestrian images and their edge maps for accurate pedestrian gender recognition.
We mathematically analyze the complementary behavior of the proposed decision fusion mechanism (see
Section 3), which demonstrates the feasibility and interpretability of our proposed DFLN for pedestrian gender recognition.
Extensive experimental results on two large-scale pedestrian attribute benchmarks, namely, PETA and PA-100K, show that the proposed DFLN achieves competitive or superior performance compared with multiple state-of-the-art pedestrian gender recognition methods. Moreover, by inserting the proposed DFM into several well-known neural networks (e.g., VGGNet [
27], GoogleNet [
28], ResNet [
29], and PiT [
30]), we achieve consistent performance improvements for pedestrian gender recognition.
The remainder of this paper is organized as follows.
Section 2 introduces related works.
Section 3 describes the proposed DFLN in detail.
Section 4 presents the experimental results and analyses. Finally,
Section 5 concludes the paper.
3. Proposed Decision Fusion Learning Network
In this paper, we present an adaptive decision fusion framework, called the Decision Fusion Learning Network (DFLN), which operates in the probability space to enhance pedestrian gender recognition.
Figure 3 provides a schematic illustration of the proposed DFLN, which mainly consists of two learning-capable modules: the Parallel CNN Prediction Probability Learning Module (PCNNM) and the Decision Fusion Module (DFM), which will be individually introduced in the following two subsections.
3.1. Parallel CNN Prediction Probability Learning Module
Considering that the gender prediction probabilities derived from a color pedestrian image and its corresponding edge map offer two complementary views, i.e., an appearance-centered view emphasizing texture and clothing attributes and a structure-centered view capturing body contours and global shape, a Parallel CNN Prediction Probability Learning Module (PCNNM) is first constructed to independently learn the prediction probabilities from the original color pedestrian image and its edge map, respectively. As shown in
Figure 3, the PCNNM is composed of two compact and lightweight CNNs. Each CNN is composed of six convolutional layers (i.e., C1, C2, C4, C5, C7, and C8), two max-pooling layers (i.e., M3 and M6), one average-pooling layer (i.e., A9), one fully connected layer (i.e., F10) with 128 neuron units, and one fully connected layer (i.e., F11) with two neuron units corresponding to gender (i.e., male/female) classes, respectively.
In the input layer, the color pedestrian images (three channels) and their corresponding edge maps (a single channel) with dimensions of
are fed to the upper and lower parts of the PCNNM, respectively. The edge map of the input pedestrian image is computed as the root mean square of image directional gradients in the horizontal and vertical directions [
58], that is:
where ⊗ represents the convolution operation;
means the raw pixel pedestrian image;
and
are the
horizontal and vertical Prewitt filters, which can be defined as follows:
Some examples of the extracted edge map corresponding to the pedestrian images in
Figure 1 are presented in
Figure 2.
In convolutional layers (i.e., C1, C2, C4, C5, C7, and C8), the number of filters is set to 32, as keeping the number of filters low is beneficial to avoid overfitting. Moreover, small filters with dimensions of
are applied to save the filter parameters. To retain the image details as much as possible, the stride of each of the convolutional layers is set to 1. To retain the edge information and the size of feature maps, a padding operation is applied for the first four convolutional layers (i.e., C1, C2, C4, and C5). Batch normalization (BN) [
59] operation is used after all convolution operations to accelerate the network training. In the pooling layers (i.e., M3, M6, and A9), the
max pooling and
average pooling operations are applied, respectively, and the stride is uniformly set to 2. In the F10 layer, 128 neuron units fully connected to the A9 layer. Next, dropout [
43] is enabled to set the output of each neuron to zero with 0.5 probability to avoid overfitting. The output layer (i.e., F11) has two neuron units fully connected to the F10 layer, and softmax activation units are applied in the F11 layer to form a binary classifier. Each softmax activation unit is associated with a class label, and the output value gives the posterior probability of the corresponding gender class. More details of the recommended configuration of the PCNNM can be referred to
Table 1.
3.2. Decision Fusion Module
The output prediction probabilities (posterior class probabilities) from the upper part of the PCNNM (denoted as PCNNM-U) and the lower part of the PCNNM (denoted as PCNNM-L) naturally define a probability space where different prediction streams can be fused at the decision level. For that purpose, on the basis of the PCNNM, a learning-capable Decision Fusion Module (DFM) is designed; this module takes the modality-specific probabilities as input, concatenates them into a joint probability vector, and fuses them via a fully connected layer and softmax operation, yielding an adaptive decision fusion mechanism in the probability space.
As shown in
Figure 3, the probabilities of the gender classes derived from the PCNNM are fed into the DFM, in which a Concat layer is used to combine them. Consequently, the output probability vector,
, is composed of four elements and can be formulated as follows:
where
and
denote the output posterior probabilities of the corresponding gender classes (i.e.,
,
) from PCNNM-U with the color pedestrian image
as input, while
and
denote those from PCNNM-L with the edge map
as input. Then, the outputs of Concat layer are fused together by performing a fully connected operation and a softmax activation to obtain the final probability vector
, which contains two elements:
where
in which
denotes biases and
represents the learnable fusion weights.
3.3. Complementary Behavior Analysis
Built upon the Decision Fusion Module (DFM), the proposed DFLN can explore the complementary merits of modality-specific prediction probabilities learned by PCNNM-U and PCNNM-L, which individually learn the prediction probabilities from the color pedestrian image and the edge map. This can be observed from both forward propagation and backpropagation processes in the proposed DFLN method, which can be mathematically demonstrated as follows.
First, according to Equations (
3)–(
6), it can be clearly seen that the final probability vector
will certainly be subject to the output probability vectors
and
resulting from PCNNM-U and PCNNM-L in the forward propagation process.
Second, in the back propagation process, the proposed DFLN is designed to minimize the objective function
, which is solved by employing Stochastic Gradient Decent (SGD) [
60]. Let
be the weight connecting the
jth neural unit in the (
n − 1)th layer and the
ith neural unit in the
nth layer, while
, where
and
h represents an activation function. Here, taking layer F11 in PCNNM-U as an example, the update step of the weight
can be formulated as
where
denotes the learning rate, and
where
is the
jth element of output feature from layer F10 in the PCNNM-U, while
where
, in which
is the
ith element in
. Therefore, it can be found that through
, the value of
in PCNNM-U will be affected by the final probability vector
in the backpropagation process. Note that the same conclusion can be drawn for PCNNM-L.
The above analyses indicate that the PCNNM-U and PCNNM-L, which individually learn the modality-specific prediction probabilities from the pedestrian images and edge maps, are capable of playing complementary roles in the forward propagation and backpropagation processes through the proposed DFM. Consequently, the proposed DFLN method is able to deliver better performance in pedestrian gender recognition.
5. Conclusions
In this paper, a learnable adaptive decision fusion method, called the Decision Fusion Learning Network (DFLN), is proposed for pedestrian gender recognition. The superior performance of the DFLN is mainly attributable to an adaptive decision fusion strategy that operates in the probability space and effectively exploits the complementary merits of the modality-specific probabilities learned from color pedestrian images and their edge maps. Specifically, the DFLN is composed of a Parallel CNN Prediction Probability Learning Module (PCNNM) and a Decision Fusion Module (DFM). The PCNNM consists of two parallel convolutional neural networks that independently learn posterior class probability vectors from pedestrian images and its their corresponding edge maps, while the learning-capable DFM takes these modality-specific probabilities as input, concatenates them into a joint probability vector, and adaptively fuses them via a fully connected layer and softmax operation to achieve accurate pedestrian gender recognition. Extensive experiments on two large-scale pedestrian attribute benchmarks, namely, PETA and PA-100K, demonstrate that the DFLN achieves competitive or superior performance compared with multiple state-of-the-art pedestrian gender recognition methods. Moreover, when the proposed DFM was inserted into several well-known neural networks, it was verified to consistently improve their performance for pedestrian gender recognition.
Although the proposed DFLN yields promising performance in pedestrian gender recognition, pedestrian gender recognition remains challenging in scenarios characterized by high inter-class similarity, such as similar clothing styles and body shapes. These factors may obscure discriminative cues and increase gender recognition ambiguity. To mitigate the above limitations, the proposed DFLN could be further refined by incorporating joint feature–probability interaction. For example, instead of operating solely on posterior probabilities, the model could incorporate intermediate semantic embeddings to guide the fusion weights, enabling context-aware modulation when structural or appearance cues become less informative. On the other hand, although the introduction of complementary behavior analysis in this paper provides essential evidence of the feasibility and interpretability of our proposed adaptive decision fusion approach, no solid theoretical justification is established to explain why fusion in probability space should be preferable to alternative fusion spaces for pedestrian gender recognition. Therefore, our future research will focus on how to resolve this limitation.