In order to address the challenges of the newly established cross-domain PAR task, we further propose a new baseline method, namely local domain discriminator-based cross-domain pedestrian attribute recognition (LDCD_PAR). According to the relationship between human body parts and attribute positions, the local domain discriminator is designed to align the global and local features of the source domain and the target domain synchronously, ensuring that the distance of the features between the source domain and the target domain is narrowed from more fine-grained levels. The network architecture of our LDCD_PAR is depicted in
Figure 4, which consists of three main components: (1) A feature extractor, (2) An attribute classifier, and (3) A local domain discriminator. Firstly, the feature extractor extracts the pedestrian features from the input images. Then, the features are passed through the attribute classifier and the local domain discriminator to obtain attribute predictions and domain predictions, respectively. LDCD_PAR aims to minimize the distribution discrepancy across domains through adversarial training between the feature extractor and the local domain discriminator.
4.1. Feature Extractor and Attribute Classifier
In LDCD_PAR, we adopt a CNN as the feature extractor to extract image features. During the training stage, a labeled source image
from the source domain dataset
and an unlabeled target image
from the target domain dataset
are input into the feature extractor
E to extract the source domain feature
and the target domain feature
respectively. The attribute classifier
C is implemented as a fully-connected layer with an output dimensionality that matches the total number of attributes present in the dataset. The source domain feature
is input into the attribute classifier
C to obtain the output logits
. Then, a sigmoid function
is adopted to get the predicted probabilities for each attribute, indicated as
. Together with the source domain attribute labels
, the classification loss can be calculated as:
where
n is the number of training samples,
m is the number of attributes, and
is the weight to alleviate the distribution imbalance between attributes.
represents the predicted probability of the
l-th attribute in the
i-th image, and
represents the ground-truth label of the
l-th attribute in the
i-th image. By minimizing the classification loss in the source domain data, the parameters of the feature extractor
E and the attribute classifier
C are both optimized, enabling the classifier
C to accurately recognize the attributes.
4.2. Local Domain Discriminator
In the classical UDA method, the features of the source domain and the target domain as well as their domain labels
are input into a domain discriminator
D. This domain discriminator then outputs the predicted domain probability
, which means that the
i-th sample belongs to the source domain if the probability is close to 1, or it belongs to the target domain if the probability is close to 0. With this probability, the domain discrimination loss can be calculated as:
where
is the number of samples in source domain, and
is the number of samples in target domain. The feature extractor
E tries to obtain two sufficiently similar feature distributions to confuse the domain discriminator
D, which requires optimizing parameters to maximize the loss of the domain discriminator
D. The adversarial training process can be formulated as:
where
is a hyper-parameter to make balance between the classification loss and the domain discrimination loss.
However, this kind of domain discriminator can only make image features of the source domain and the target domain to be domain-invariant globally. In the cross-domain PAR task, the data distribution is much more complicated. For example, as shown in
Figure 5, images of the attribute “backpack” in different datasets are significantly distinct from one another in terms of background, lighting, angle and pixels, leading to clear distribution differences. Moreover, the pedestrian attributes are often spatially distributed across the entire image. For instance, the attributes like “hat” are often located near the top of the image, while the attributes like “shoes” are typically found near the bottom of the image. Thus, relying solely on global similarity is insufficient to fully leverage fine-grained information, and it is crucial to consider local similarities among image features in the cross-domain PAR task. To achieve this, in addition to the global domain discriminator, we propose the idea of a local domain discriminator, which focuses on specific local regions in the image that are directly related to pedestrian attributes, to make more fine-grained alignment of features between the source and target domains.
As shown in the lower part of
Figure 4, the idea of the local domain discriminator is to first segment the global feature
of the image to obtain
K local region features. The global features and
K local features are then input into the domain discriminator
D to obtain
domain prediction results, and the global domain discrimination loss
and local domain discrimination loss
are calculated respectively. The global domain discrimination loss
can be calculated as Equation (
6). Let
denote the index of a local region in the feature map, and let
K be the total number of local regions. The local domain discrimination loss
can then be calculated as:
where
represents the
k-th local feature of the
i-th image in the source domain, and
represents the
k-th local feature of the
j-th image in the target domain. Based on
and
, the total domain discrimination loss
can be calculated as the average of the global and local domain discrimination losses:
Conceptually, the global discriminator encourages holistic (image-level) domain alignment by matching overall feature distributions, whereas the proposed local discriminator promotes part-aware (region-level) alignment by matching the feature distributions of semantically corresponding body regions, which is more suitable for attributes that consistently appear in specific spatial locations. In this case, it is important to decide how to acquire local features from an image. A classical strategy is the PCB [
35], which divides feature maps evenly into horizontal stripes for pooling local features, as illustrated in
Figure 6a. However, this strategy only works well when the entire image is covered by a pedestrian. In many cases, pedestrians are not distributed throughout the entire image, but only occupy a limited portion of it. This may lead to the situation that the same pedestrian part in different images will be distributed in different local features. For example, as shown in
Figure 6a, the head of the third pedestrian is in the second local feature, which should normally be in the first one instead.
To avoid this issue, we adopt a more reasonable strategy of Part Aligned Pooling (PAP) [
36] to split the feature maps into local ones, in order to ensure that the same pedestrian parts are contained within the same local features. The PAP strategy integrates human posture analysis. By analyzing human posture information and image content, the whole feature map is divided into several horizontal regions of different sizes, and each region corresponds to a specific part of the pedestrian, as shown in
Figure 6b. Specifically, an image
is first input into the feature extractor
E to obtain its global feature map
, and a human keypoint model trained on COCO [
37] is then applied on the feature map to detect 17 predefined keypoints of the pedestrian in the image. After that, the detected human keypoints are adopted by the PAP to locate human parts and obtain
K local feature masks closely related to human parts, such as “HEAD”, “UPPER_TORSO”, “LOWER_TORSO”, “UPPER_LEG”, “LOWER_LEG” and “SHOES”. Based on these local feature masks, the global feature map can be split into
K local features
. These local features are then input into the local domain discriminator to obtain the local domain discrimination losses.
From an implementation perspective, PAP can be realized as a deterministic mask-generation + masked pooling operator driven by human keypoints. Given an input image
, the backbone extracts a feature map
. In parallel, a human keypoint detector pre-trained on COCO is applied to the image to obtain 17 keypoints (nose, eyes, ears, shoulders, elbows, wrists, hips, knees and ankles). To stabilize the part boundaries, we first fuse symmetric keypoints using a simple function
that averages the
y-coordinates of a left/right keypoint pair (e.g., left/right shoulder), producing fused
y-coordinates for shoulders, hips, knees and ankles. We then derive
K part intervals along the vertical axis and generate a binary mask
for each part (
) by setting pixels within the corresponding fused-coordinate range to 1 and the others to 0.
Figure 7 summarizes this program-level workflow. Finally, each local feature is obtained by masked average pooling over the shared feature map.
By maximizing the domain discrimination loss of each local feature, we can obtain more fine-grained domain-invariant features. This enables the classifier to more effectively identify pedestrian attributes of the images from the target domain.
4.4. Time Efficiency and Computational Complexity
From a computational perspective, LDCD_PAR was designed to introduce only a lightweight overhead on top of the baseline PAR framework. In all our experiments we reuse exactly the same CNN backbone as StrongBaseline (ResNet-50), so the dominant cost of both training and inference is the convolutional feature extraction, which remains unchanged. The additional components of LDCD_PAR consist only of the part-based pooling (PAP) module and the local domain discriminator head.
Let d denote the dimension of the global feature vector, and likewise the dimension of each of the K local part features produced by PAP, and let the domain discriminator be an L-layer multilayer perceptron. The extra floating-point operations introduced by the local discriminator then scale on the order of , which is negligible compared with the convolutional backbone whose complexity scales with the spatial resolution and the number of channels of the feature maps. Moreover, the global feature and the K local features are all obtained from a single shared feature map, and the resulting feature vectors can be processed by the discriminator in parallel. Consequently, the increase in wall-clock training time introduced by LDCD_PAR is expected to be modest even when in our PAP_6P configuration. Empirically, in our implementation, adding the local domain discriminator branch increases the training time by approximately 20% compared with StrongBaseline under the same backbone, input resolution, batch size, and training schedule; this overhead remains modest in practice because the additional feature vectors can be processed by the discriminator in parallel. Moreover, since the added computation mainly comes from the PAP operator and a small MLP discriminator head on top of the shared backbone, this overhead is largely a fixed extra cost, and the relative percentage increase in training time is expected to be smaller when using a more complex backbone than the StrongBaseline/ResNet-50 setting. It is also worth emphasizing that the domain discriminator is only used during training for adversarial domain alignment. At test time, when the trained model is applied to the target domain (including UAV-based aerial imagery), only the feature extractor and the attribute classifier are evaluated, while the domain discriminator branch is discarded. Therefore, the inference-time computational complexity and latency of LDCD_PAR are essentially identical to those of StrongBaseline. This design makes LDCD_PAR amenable to deployment in scenarios with limited computational resources, such as remote sensor and UAV-based platforms.