1. Introduction
Person Re-ID is a challenging task due to the complexities arising from variations in human appearance, apparel, postures, and potential occlusions. Its primary objective is to recognize individuals across different images or video frames by leveraging their visual attributes. Although the field has witnessed significant advances, there are still considerable opportunities to bolster the robustness and generalizability of these systems. Innovative models like the Learning Part-based Convolutional Features (PCB) [
1] have explored part-based feature extraction. These limitations mainly arise from the part-localization strategy and the challenges associated with detailed part-level feature representation. The PCB model segments feature maps of CNNs spatially to derive distinct parts, which can lead to issues with within-part consistency [
1]; the method assumes that pixels within a well-located part should be like each other and dissimilar from pixels in other parts. In cases where a pixel-wise feature vector in a part is more like another part, it indicates inappropriate partitioning. The refined part pooling technique proposed, alongside PCB, aims to reassign these outliers to the closest parts, thereby improving within-part consistency; this process requires no part labels. The model is trained in a weakly supervised manner. This strategy may still struggle with inaccuracies in part localization, as it might not always effectively capture the discriminative characteristics necessary for distinguishing individuals based on standard body structures, such as the shape of the arms or legs.
Similarly, while the Harmonious Attention Network (HA-CNN) [
2] employs spatial attention maps to focus on specific areas within an image, it does not explicitly address human body parts. Although the network can highlight informative image regions, it may not effectively isolate and analyze distinct body parts, which are crucial for Person Re-ID in scenarios with a high variability in human posture and appearance. The Pose-Driven Deep Convolutional (PDC) model [
3] makes advances in handling pose-based variations by isolating different body parts and learning distinctive features. The model still depends on additional pose data, which introduces an extra layer of complexity. These limitations underscore the need for a framework that intricately combines attention mechanisms with part-specific analysis.
It is important to note that the novelty of this work does not lie in introducing entirely new individual components such as part attention, human parsing, correlation analysis, or hybrid loss functions, as these have been explored in prior studies. Instead, the contribution of this work lies in the coordinated DPBF framework, which integrates parsing-guided SPD, cross-part spatial modulation, correlation-aware feature integration, and hybrid supervision into a unified pipeline for occluded Person Re-ID.
This paper introduces a coordinated parsing-guided architecture for Person Re-ID that combines attention-guided strategies and part-based feature representations through a coordinated local–global integration framework. Unlike heuristic part-based approaches, all components of the proposed architecture are optimized jointly through backpropagation. An end-to-end system is implemented that employs parsing-guided semantic attention maps for localized body-region discrimination, named SPD, which builds upon the concept of part-based feature extraction [
1], which serves as a foundation, by giving a coarse segmentation based on expected body parts. It then implements human parsing mechanisms to enhance the segmentation through weakly supervised human parsing labels generated by a pretrained SCHP model, without requiring manual part annotations [
4], while adaptively emphasizing informative body regions to improve identity-preserving discrimination under occlusion conditions.
Despite these advantages, training such models presents challenges, particularly due to the variability in pose and occlusion, which can negatively affect the generation of accurate human parsing labels. Moreover, different poses or partial occlusions can lead to incorrect segmentation, reducing the accuracy of the parsing, which affects the overall results.
Therefore, the SPD integrates an attention mechanism with a global feature extractor, producing a spatial feature map to refine the segmentation. This architecture generates attention weights tailored for Person Re-ID, enabling the model to emphasize informative body regions during feature learning and rendering a heatmap for each body part, enabling the model to identify and emphasise pivotal regions in the images. Consequently, the SPD enhances part representations without relying solely on existing parsing label annotations.
Following part-level feature extraction, the integration of local and global features extracted from human body images is essential. Local or foreground features, which capture fine-grained details, such as clothing texture, complement global features that provide an overall representation of appearance. Nevertheless, relying solely on local or global features may reduce robustness against viewpoint, illumination, and occlusion.
Despite recent progress in the feature fusion and aggregation techniques [
5], challenges related to feature overlap and the balanced integration of feature representations still remain, as identified in studies such as [
5]. To address these limitations, this study introduces the AFICF. Beginning with a correlation-aware refinement process using statistical techniques such as Pearson correlation coefficients (PCC) and Principal Component Analysis (PCA), AFICF evaluates feature complementarity and redundancy prior to adaptive feature integration. The proposed framework demonstrates a competitive performance in localizing discriminative body regions and improving ranking performance under occlusion and appearance variations across multiple Person Re-ID benchmarks.
3. Proposed Method
The proposed DPBF framework consists of two coordinated functional modules: the SPD and AFICF modules. The SPD module generates enhanced part-level representations by emphasizing discriminative body regions through attention-guided mechanisms, while AFICF performs the adaptive integration of part-specific and global contextual features to reduce redundancy and improve feature complementarity. Although SPD and AFICF are inspired by existing concepts such as part-based attention, feature integration, and statistical correlation analysis [
1,
5]. Specifically, the SPD module generates discriminative representations that directly guide the context-aware aggregation process in AFICF, enabling the more effective handling of occlusion challenges and redundant descriptors for robust Person Re-ID.
3.1. Salient Part Discrimination (SPD)
The SPD module is developed to generate discriminative and identity-aware feature embeddings for each person instance, building upon PCB [
1]. This is realized through a Part Attention Mechanism (PAM) that computes part-specific attention maps for
-predefined body parts as illustrated in
Figure 1. The overall attention-guided design is inspired by the attention mechanisms proposed in [
22,
23]. This mechanism is executed in two primary stages:
3.1.1. Generating Attention Maps for Each Body Part
A Grad-CAM-based strategy [
24] is employed to generate attention maps for each of the
predefined body parts. These parts are defined using human parsing annotations, allowing the SPD to initialize attention maps aligned with semantically consistent body regions.
Figure 1 represents this process and illustrates the final attention map generation procedure. By dividing the human body into
parts, the objective is to capture unique features from each region, amplifying the model’s feature separability as proposed in [
1]. However, relying solely on predefined parts has its challenges. Given that these regions are static, they might not accommodate variations resulting from different poses or occlusions. To enhance the robustness of this process, a two-step sequential strategy was followed. Certain limitations may still persist, as body-part segmentation errors may occur due to extreme pose variations or occlusion, leading to the loss or misalignment of certain components. Thus, this impacts the capability of the model to extract reliable identity-aware features, particularly for anatomical regions such as the legs and feet:
Step 1: The SPD module serves as the primary parsing-guided discrimination stage within DPBF, focusing on semantic body-region localization to improve identity-aware feature extraction under occlusion and appearance variations. The input feature map
is first transformed into
using a 1 × 1 CNN layer. A lightweight neural network generates attention weights
, which are reshaped into
attention maps {
where each attention map
highlights the most salient regions of the corresponding body part representation
, for
∈ {1, …,
}. These weights are subsequently restructured into spatial attention maps representing the predefined body parts, consistent with the methodology proposed in [
25]. The effectiveness of SPD is demonstrated qualitatively in
Figure 1, where attention maps highlight informative and discriminative regions, allowing the model to focus on identity-relevant visual attributes [
26].
Subsequently, a sequence of CNN layers followed by a global average pooling (GAP) layer and a fully connected (
) layer with
output nodes form the attention-generation pipeline. The GAP computes the average response of each feature channel across the spatial dimensions, producing a compact feature vector of length
. To generate spatially localized attention maps, a softmax function is applied over the spatial dimensions
and
, following the attention-guided strategy in [
25], and the spatial attention principle used in Spatial Transformer Networks technique (STNs) [
27]. The STNs introduce a learnable module that enables CNNs to focus on the informative image regions while improving transformation invariance.
Step 2: Human parsing labelling mechanisms based on SCHP are implemented to refine the segmentation. This phase concentrates on areas subject to variations due to poses and occlusions. Given that SCHP usually requires more computational resources, it focuses on refining particular regions rather than re-evaluating the whole image. The SCHP, derived from an external pose estimation model and improved through self-correction SCHP [
4], allows for a nuanced understanding of the human form, concentrating on areas rich in information and generating parsing labels. Generating human parsing labels remains challenging due to human pose variations and occlusions.
This necessitates robust methods to improve the parsing accuracy for effective part-discriminative feature extraction. To address these challenges, SPD incorporates an attention-guided adjustment process based on SCHP. This component employs an attention mechanism and a global feature extractor to create a distinct attention map for each body part. These maps are initially informed by labels derived from SCHP. When human parsing labels are available, they aid SPD in generating precise part-specific attention maps, denoted as . These labels typically correspond to discrete body parts, including the head, torso, arms, and legs; it depends on the datasets, annotations, and framework strategies.
Given an input image
and its corresponding human parsing label
obtained from the pose estimation model, the SPD computes a part-specific attention map
for each body part:
, where
is a learned function that generates attention maps based on the input image
and the human parsing labels
. To generate the initial labels, the SCHP framework is integrated within the SPD attention mechanism. This process can be represented as follows:
, where
denotes the human parsing model,
is the input image, and
is the initial human parsing label. The correction network
learns to correct the initial predictions by minimizing the errors between the predicted labels
and ground truth labels
. This process can be formulated as follows:
where
denotes the correction loss used to optimize the correction network, and
represents the corrected labels produced by the correction network
. After training, the refined labels are updated as follows:
. These refined labels are subsequently used to retrain the human parsing model iteratively until convergence.
3.1.2. Cross-Part Spatial Feature Modulation Map (CP-SFMM)
The CP-SFMM module focuses on enhancing the identity discrimination. This is achieved through element-wise multiplication (E-WM) with the original feature map to strengthen distinctiveness body-part representations in various contexts, especially in occlusions. This technique extends the attention-guided strategy introduced in [
2], to spatially modulated features, ensuring that attention is paid to the most relevant areas for accurate identification. The process involves two main steps:
Informative regions are first enhanced through E-WM, followed by the aggregation of the refined feature maps into a unified representation. The attention maps
, generated in the previous step, are employed to compute part-specific feature maps by E-WM with the original feature map
[
28],
. The E-WM operation calculates the product of the corresponding elements in the attention maps and the input feature map, as illustrated in
Figure 2. The resulting part-specific feature maps will have the same dimensions as the input feature map
(i.e.,
).
After computing the part-specific feature maps for each body part, the channel dimension concatenates
with dimensions 32 × 32 × 512 (
= 128 × 3), to obtain a combined feature representation that captures the spatial information of all the body parts, as illustrated in
Figure 2 and
Figure 3, respectively. This combined feature representation can then be processed by the feature fusion and aggregation module, represented as follows:
.
In summary, the CP-SFMM step merges the part-specific feature maps generated by the SPD into a single feature map. This unified feature map contains discriminative information from all predefined body parts, which is crucial for creating a robust representation of the person’s appearance for identification tasks.
It should be noted that the SPD builds upon existing part-based attention and parsing-guided techniques. Compared with conventional parsing-guided Re-ID methods that primarily use parsing labels for static body-region supervision, the SPD integrates parsing-guided attention refinement with cross-part spatial feature modulation to strengthen discriminative feature learning under occlusion and pose variations. Therefore, the contribution of SPD in this work lies in its coordinated role within the overall DPBF pipeline rather than as a standalone novel attention mechanism.
For notation consistency throughout the manuscript, the semantic body-part feature representations generated by SPD are denoted as , where corresponds to a predefined body region. In the experimental analysis, visualizations, and part-wise performance evaluations, the corresponding labels (e.g., p0–p8) are used to identify individual body-part embeddings. Unless otherwise stated, both notations refer to the same semantic body-part feature representation at different stages of analysis and presentation.
3.2. Adaptive Feature Integration and Contextual Fusion (AFICF)
Multiple feature representations are extracted from human body images, including foreground local features and global features. Relying solely on global or local features may reduce robustness against viewpoint, illumination, and occlusion variations. Previous studies have emphasized improving interaction and aggregation in networks such as [
5,
29,
30,
31]. These strategies fuse and aggregate local and global features to enhance discriminative representation learning, often using feature fusion and aggregation mechanisms. However, other recent studies indicate that feature fusion and aggregation may introduce redundancy when integrated descriptors are highly correlated, negatively impacting the model’s efficiency. In addition, maintaining the balance between complementary local and global representations remains challenging, as overemphasizing one can lead to underutilization of the other [
5,
29,
30,
31,
32].
Two key strategies are therefore considered: employing advanced correlation-aware filtering and coordinated fusion mechanisms. To address these limitations, AFICF adaptively calibrates interactions between local–global body-part cues and holistic contextual embeddings prior to fusion, thereby reducing redundant feature incorporation and improving contextual feature consistency under occlusion conditions. Rather than directly concatenating local and global representations, AFICF evaluates inter-feature correlations to preserve informative identity cues and reduce unnecessary redundancy during aggregation. Pearson Correlation Coefficient (PCC) analysis and Principal Component Analysis (PCA) are incorporated as a statistical adjustment process within the integration stage to support feature decorrelation and compact representation learning. These operations are applied as statistical feature refinement procedures rather than standalone trainable layers or independent post-processing operations. This strategy was motivated by preliminary observations that the direct fusion of local and global representations may introduce redundant information and an inconsistent semantic scaling, which can negatively affect discriminative feature learning.
The PCC is employed to measure the linear correlation between two variables. The correlation between each pair of features,
and
is estimated using this formula:
where
denotes the correlation coefficient between feature representations
and
, while
and
represent the corresponding feature samples, and
and
are the means of the
and
samples. Feature components exhibiting a high absolute value of
(close to 1 or −1) indicate a strong linear relationship between the two features, suggesting redundancy. Features with a high correlation can be considered redundant and are candidates for elimination because they provide similar information. The AFICF retains one feature from each group of highly correlated features and eliminates the rest to reduce redundancy, as shown in
Figure 4.
Following correlation refinement, PCA is applied to generate a compact feature representation while preserving the dominant discriminative variance of the integrated feature space. PCA is utilized here as a dimensionality-reduction mechanism to suppress residual redundancy and improve feature compactness before the final feature fusion. The covariance relationship between feature dimensions can be represented as follows:
The principal components associated with the highest eigenvalues are retained to construct the final compact feature representation used during adaptive feature integration.
3.3. Hybrid Loss Function (HLF)
The DPBF framework employs a Hybrid Loss Function (HLF) that combines cross-entropy and triplet loss. Unlike the conventional implementation of these loss functions, DPBF employs HLF not merely as a training objective but also as a component supporting feature integration between local and global representation. Studies such as [
8,
32,
33] have demonstrated the superiority of HLF in enhancing spatial feature distribution and improving Re-ID accuracy. These findings are consistent with DPBF’s performance, which leverages HLF to significantly reduce feature-level discrepancies and optimize identity-aware representation for complex occluded scenarios. First, the cross-entropy loss is typically used for identity classification tasks. It works well with holistic embeddings because it encourages the model to learn discriminative features for identifying different individuals. The cross-entropy loss for a classification task is typically represented as follows:
where
is the number of classes,
is the true label (1 for the correct class and 0 otherwise) and the
is the predicted probability for class
. Second, the part-based embeddings are used to reduce the distance between the anchor and the positive embeddings while increasing the separation from the negative embeddings [
26]. This is especially useful in scenarios where the person might be partially occluded or when only parts of the person are visible. The triplet loss is formulated as follows:
where
is the distance between the anchor sample and the positive sample (same person),
is the distance between the anchor sample and the negative sample (different person), and
is a predefined constant that enforces a minimum separation between positive and negative pairs. The HLF combines these two objectives through a weighted sum:
where
α and β denote weighting coefficients controlling the relative contribution of the cross-entropy and and triplet loss components during optimization. In this work, both components were equally weighted to maintain balanced optimization between global identity discrimination and part-level feature separability.
The integration of triplet loss and cross-entropy in the HLF is essential to enhance the interaction between local and global features during training. Specifically, the cross-entropy term is aimed at the optimization of global-identity-level features by encouraging the network to distinguish between individuals according to overall embeddings. At the same time, the triplet loss puts an emphasis on part-level discrimination so that semantically consistent body parts are clustered together for the same identity while remaining separated across different identities. The twofold aim ensures that global and local features are jointly optimized in a complementary manner. Consequently, when these representations are transferred within AFICF, they are already well-aligned and discriminative, thus enabling the adaptive integration mechanism to more effectively eliminate redundancy and capture complementary information. The effectiveness of this strategy is further demonstrated in the experimental analysis.
AFICF does not claim novelty in the individual use of correlation analysis or dimensionality-reduction techniques. Instead, its contribution lies in integrating correlation-aware feature refinement within a coordinated local and global feature integration strategy designed to improve feature complementarity and robustness under occlusion conditions within the DPBF framework.
Algorithm 1 summarizes the end-to-end training and inference workflow of the proposed DPBF framework.
| Algorithm 1. End-to-End Training and Inference Procedure of DPBF |
| Input: |
| Training images , identity labels , parsing masks , and the number of predefined semantic body-part representations . |
| Output: |
| Optimized DPBF model and final Person Re-ID ranking results. |
Training StageExtract backbone feature maps from each input image using the CNN backbone. Generate parsing-guided semantic body-part representations using the Salient Part Discrimination (SPD) module. Construct local body-part embeddings and global contextual embeddings. Refine local-global feature interactions using the Adaptive Feature Integration and Contextual Fusion (AFICF) module. Suppress redundant feature components using Pearson Correlation Coefficient (PCC)-based refinement. Generate compact discriminative feature embeddings using Principal Component Analysis (PCA)-based refinement. Fuse the refined local and global feature representations. Optimize the DPBF framework using the Hybrid Loss Function (HLF).
|
Inference StageExtract normalized feature embeddings from query and gallery images using the trained DPBF model. Compute Euclidean distances between query and gallery embeddings. Rank gallery samples according to ascending distance values. Return the final Person Re-ID retrieval results.
|
6. Conclusions
The proposed DPBF framework demonstrates an improved capability to model variations in human appearance, effectively enhancing robustness against challenges such as occlusions, pose variations, and illumination changes.
Table 1 and
Table 2 demonstrate the effectiveness of DPBF across both holistic and occlusion-oriented benchmarks, while
Figure 5 further illustrates the robustness and relative contribution of individual body-part embeddings under challenging visibility conditions. By moving beyond traditional methods that rely on global feature representation, DPBF employs the coordinated integration of SPD and AFICF, enabling the complementary interaction between local body-region representations and global contextual information. Its AFICF module effectively merges part-specific features with a global context, enhancing feature complementarity and reducing the limitations commonly observed in conventional feature-fusion methods. Although the DPBF model demonstrates a strong performance in the realm of Person Re-ID, several areas for further improvement have been identified, including performance degradation under extreme occlusion and low-resolution conditions, as illustrated in
Figure 5 and
Figure 6b, quantitatively confirmed by the embedding degradation analysis reported in
Table 8. Additional reproducibility, computational-efficiency, and occlusion-stability analyses further demonstrated the robustness of the proposed DPBF framework, showing a stable performance across independent training runs, competitive computational complexity and evaluation efficiency, and the effective mitigation of severe occlusion effects through adaptive local–global feature integration.
Future work will investigate more stable parsing strategies and adaptive feature handling mechanisms to improve discrimination under severe occlusion, low-resolution conditions, and visually similar appearances while maintaining computational efficiency. In particular, the aim is to investigate adaptive part-weighting mechanisms and occlusion-aware feature refinement to enhance the model’s ability to extract meaningful information from lower-body regions, even when they are partially obscured or truncated due to camera limitations.