Next Article in Journal
Thermally Induced Asymmetry in Growth of Interacting Diffusion-Controlled Wax Particles in Laminar Flow
Previous Article in Journal
Multi-Class ICU Bed Reservation Under Bursty Arrivals: A Generalized Loss Model Framework with Fairness Optimization
Previous Article in Special Issue
Privacy-Aware Table Data Generation by Adversarial Gradient Boosting Decision Tree
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Occluded Person Re-Identification Method Based on Pedestrian Background Decoupling Transformer

1
School of Foreign Languages, National University of Defense Technology, Nanjing 210012, China
2
School of Computer Science, Engineering Research Center of Digital Forensics Ministry of Education, Nanjing University of Information Science and Technology, Nanjing 210044, China
3
School of Internet of Things Engineering, Wuxi University, Wuxi 214105, China
*
Authors to whom correspondence should be addressed.
Mathematics 2026, 14(10), 1725; https://doi.org/10.3390/math14101725
Submission received: 27 March 2026 / Revised: 29 April 2026 / Accepted: 30 April 2026 / Published: 17 May 2026

Abstract

As urbanization picks up pace and the public demand for security keeps climbing, video surveillance systems have emerged as a vital tool for maintaining social stability and safeguarding public safety. Person Re-Identification (Re-ID), as one of the core technologies in intelligent monitoring, mainly aims to accurately match pedestrian identities across cameras without overlapping fields of view. However, in practical applications, occlusion remains a primary challenge that severely degrades Re-ID performance. Especially in high-density crowds, pedestrians are often partially or completely obscured by other objects or individuals, resulting in incomplete image information and impaired feature representation, which significantly reduces recognition accuracy and reliability. Aiming at the problems of excessive reliance on external pose estimation models and asymmetric information matching in occluded Re-ID, this paper proposes a transformer-based pedestrian background decoupling network. The algorithm achieves foreground–background separation and multi-scale feature matching through the synergy of three modules. Meanwhile, a two-stage training strategy is adopted: the first stage optimizes the decoupling module to ensure clean feature separation, while the second stage jointly fine-tunes the correlation module to enhance matching accuracy. Extensive experimental results show that the proposed algorithm outperforms existing methods.

1. Introduction

Occluded Person Re-Identification (Occluded Person Re-ID) is a key task in the field of computer vision, aiming to accurately match partially occluded pedestrians across camera views [1]. However, person Re-ID in real-world scenarios is frequently challenged by partial occlusions, which significantly degrade the performance of traditional methods by corrupting discriminative pedestrian features. In response, recent efforts have introduced pose-guided algorithms [2] to steer feature learning. Despite this, such approaches exhibit limited efficacy under heavy occlusion and introduce a detrimental reliance on external pose estimation models. This dependency not only increases computational complexity but also reduces robustness in dynamic environments, ultimately undermining their practical utility. In-depth analysis reveals that the main reasons for performance degradation can be attributed to two points: first, the difficulty in effectively extracting pedestrian foreground information; second, the significant semantic difference between occluded images and complete images, which leads to information asymmetry.
To overcome these limitations, this paper proposes a novel transformer-based framework, the Pedestrian Background Decoupling Transformer, which employs an explicit decoupling strategy to significantly improve occlusion-invariant feature learning. Our method distinguishes itself from pose-dependent pipelines by avoiding external pose cues and simplifying the overall processing pipeline, while improving robustness to occlusion on public benchmarks. Recent transformer-based Re-ID methods have shown strong capability in global context modeling, but their strategies for handling occlusion are substantially different. TransReID improves representation learning mainly through transformer-based global modeling and patch-level operations, yet it does not explicitly decouple pedestrian foreground from background clutter. PAT enhances occluded Re-ID through part-aware modeling, but its emphasis is on part prototype learning rather than explicit foreground–background separation and pairwise correlation refinement.PFD further improves occluded Re-ID by introducing pose guidance, but it still depends on external structural priors. In contrast, our method focuses on explicit foreground–background decoupling, cross-image correlation modeling, PFD, and progressive fine-grained similarity learning without relying on external pose estimators.
The main contributions are summarized as follows.
(1)
We propose a pedestrian background decoupler that explicitly separates foreground-dominant pedestrian features from background clutter by attention-guided feature decomposition, thereby improving feature purity under occlusion.
(2)
We design a Siamese residual correlation module that projects paired features into a shared comparison space and performs adaptive correlation modeling for occluded-to-holistic matching.
(3)
We introduce a progressive fine-grained correlation learning module to aggregate multi-scale correspondence cues and refine similarity estimation under information asymmetry.
(4)
Extensive experiments on occluded, partial, and holistic person Re-ID benchmarks demonstrate the effectiveness of the proposed framework.

2. Related Work

2.1. Person Re-Identification Methods in Non-Occluded Scenes

In non-occluded scenarios, person Re-ID research primarily focuses on learning discriminative features for accurate matching. A predominant approach is representation learning, which leverages deep models to extract either global features that capture the pedestrian’s holistic appearance, or local features that focus on specific body parts and fine details. A notable example is the work of Sun et al. [3], whose PCB and RPP methods enhance robustness and accuracy through multi-scale local feature extraction and refinement, establishing them as classic benchmarks in the field. Metric learning-based methods aim to enhance the model’s discriminative capability by optimizing the feature space, such that features of the same identity are pulled closer while those of different identities are pushed apart. In parallel, local feature learning methods decompose pedestrian images into multiple regions to extract fine-grained representations, which are more robust to pose changes, viewpoint shifts, and background clutter, particularly in non-occluded settings. Additionally, attention mechanisms simulate human visual attention to emphasize critical regions and suppress irrelevant noise, further improving feature discriminability and robustness.
Very recently, several advanced studies have further promoted occluded person re-identification. Lin et al. [4] proposed a multi-level relation-aware Transformer to capture hierarchical pedestrian dependencies under occlusion. Li et al. [5] introduced occlusion attribute supervision to improve feature robustness against obstacle interference. In addition, Li et al. [6] jointly optimized pedestrian detection and re-identification for heavily occluded person search scenarios. These methods demonstrate the growing importance of Transformer-based feature reasoning and occlusion-aware representation learning.

2.2. Person Re-Identification Methods in Occluded Scenes

In occluded scenes, pedestrian images often suffer from severe missing regions and background noise, which shifts the research focus toward handling incomplete feature representations. Pose-guided approaches alleviate these issues by incorporating human structural priors. For instance, Miao et al. [7] introduced a feature alignment method that leverages pose estimation to isolate body parts from occlusions, significantly improving matching accuracy. Similarly, Gao et al. [8] developed a visibility-aware matching network that employs pose-guided attention to learn discriminative features while predicting part visibility, effectively addressing fine-grained feature loss. Wang et al. [9] proposed the HOReID model, which integrates human body topology with graph convolutional networks and utilizes pose guidance for robust feature alignment in complex occluded scenarios. Subsequently, He et al. [10] introduced TransReID, the first Transformer-based model for person Re-ID, which enhances occlusion robustness through a patch-shuffle strategy that simulates occlusion by regrouping patch features. In addition to global context modeling, local feature extraction has also been explored in Transformer-based Re-ID. To exploit complementary advantages, researchers have developed hybrid architectures that combine CNNs and Transformers. For example, Li et al. [11] proposed the Part-Aware Transformer (PAT), which incorporates a Transformer module into a CNN backbone and performs feature decoupling through a part prototype classifier, thereby improving feature representation in occluded scenes.
As research has progressed, incorporating external auxiliary information has become an important strategy for enhancing Transformer-based Re-ID models. For example, Wang et al. [12] proposed the Pose-guided Feature Disentangling (PFD) method, which integrates pose estimation to separate occluded and non-occluded regions, enabling more discriminative feature extraction and alignment. This approach not only improves recognition accuracy but also mitigates feature loss under severe occlusion. In parallel, feature recovery methods have been developed to reconstruct missing regions by leveraging visible parts and generative models, thereby producing more complete feature representations. Meanwhile, attention-based methods remain prominent because of their ability to suppress occluded regions and emphasize informative visible cues. In summary, existing Transformer-based occluded person Re-ID methods mainly improve performance through global modeling, part-aware learning, or pose-guided feature enhancement. However, explicit foreground–background decoupling combined with pairwise correlation refinement without external pose estimators remains insufficiently explored, which motivates the present study.

3. Framework and Method

To address the issues of pedestrian foreground extraction and information asymmetry in occluded person Re-ID, we propose a Transformer-based pedestrian-background decoupling network. The framework consists of three key components: a Pedestrian Background Decoupler for precise feature separation, a Siamese Residual Network for robust correlation computation, and a Progressive Fine-grained Correlation Learning module for detailed matching. Together, these components enable accurate pedestrian feature extraction and effective cross-image matching in challenging occlusion scenarios.

3.1. Network Design

The proposed architecture, illustrated in Figure 1, integrates the global modeling capability of Transformer with task-specific designs for ReID. Input images are partitioned into fixed-size patches and projected into embedding vectors via a convolutional layer. To preserve spatial relationships and incorporate contextual cues, we introduce learnable positional encodings augmented with camera and viewpoint embeddings, effectively capturing cross-camera and cross-viewpoint feature variations. The core stack comprises 12 Transformer blocks, enhanced with three dedicated components: a Pedestrian Background Decoupler, a Siamese Residual Correlation module, and a Progressive Fine-grained Learning network, collectively improving robustness and matching accuracy.

3.2. Pedestrian Background Decoupler

As illustrated in Figure 2, the Pedestrian Background Decoupler leverages an attention mechanism to meticulously decouple the input features into two independent components:
To alleviate severe background interference under occlusion, the Pedestrian Background Decoupler employs an attention-guided decomposition strategy to assign higher responses to pedestrian regions while suppressing noisy background activations. This operation enables the network to preserve identity-discriminative foreground cues. It subsequently extracts global and local feature representations from each part to suppress interference from occlusions or background noise, thereby guiding the Re-ID task and accurately localizing the primary pedestrian information. It should be noted that the attention-based decoupling is designed to suppress background interference rather than to guarantee perfect foreground–background separation in all cases. In highly cluttered scenes, especially when background textures or colors are similar to pedestrian regions, residual background responses may still remain. Therefore, the decoupled foreground feature is further processed by the subsequent correlation modules to reduce the influence of such residual noise during cross-image matching. The decoupler is sequentially embedded after the 3rd, 5th, and 7th Transformer layers. Its input is the feature map X R N × C × H × W from the preceding Transformer layer, where N is the batch size, C is the number of channels, and H and W are the height and width of the feature map, respectively. To enable the network to automatically focus on channels with more information and improve the robustness of subsequent attention map generation, this module first performs channel attention enhancement (SE attention module) on X, then performs adaptive average pooling on the input features, followed by passing the pooled result through two convolutional layers and a nonlinear activation function to generate channel attention weights w. The calculation formula is as follows,
w = σ W 2 · δ W 1 · GAP ( X )
where W 1 and W 2 denote learnable weight matrices, δ denotes the ReLU activation function, and σ denotes the Sigmoid activation function. The input features are then channel-wise weighted as,
X = X w
where ⊙ denotes element-wise multiplication, thereby ensuring each channel is adaptively enhanced or suppressed according to the weights w. Afterward, a set of convolutions is designed as attention heads to extract local information through convolution, obtaining the probability map A that indicates the likelihood of each pixel belonging to the pedestrian foreground,
A = σ IN Conv 1 × 1 Conv 3 × 3 Conv 1 × 1 X
where C o n v 1 × 1 denotes 1 × 1 convolution, used for dimensionality reduction. C o n v 3 × 3 denotes 3 × 3 depthwise separable convolution (group convolution, where groups = C/4); I N denotes Instance Normalization; σ denotes the Sigmoid activation function, which restricts the output to the [ 0 , 1 ] to form the foreground probability map. Accordingly, the pedestrian foreground feature f g and background feature f p can be expressed as,
f g = X · A
f p = X · ( 1 A )
This paper employs a contrastive learning strategy to train the attention module, enabling the deep model to precisely capture semantic distinctions across different object categories while uncovering semantic consistency within the same category. Contrastive loss constrains the attention map A to effectively differentiate between pedestrian foreground and background regions, ensuring that features in the pedestrian foreground region are more discriminative, while the influence of the background region is suppressed, i.e.,
L attn = A · f p 2 2 + ( 1 A ) · f g 2 2
where the first term, A · f p 2 2 , enforces the features in the background region to be suppressed towards zero, thereby ensuring that the attention map accurately localizes the background. The second term, ( 1 A ) · f g 2 2 , ensures that the features in the pedestrian foreground region are fully preserved, which prevents the model from misclassifying pedestrian parts as background. A local consistency loss function is designed to guide the attention module in capturing semantically consistent pedestrian representations across horizontally divided image patches. Through contrastive learning applied to these patches, the module learns to comprehensively identify the complete pedestrian foreground. The formulation is as follows:
L part = p = 1 K q = 1 K log e sim f g p , f g q e sim f g p , f g q + e sim f g p , f p q log e sim f p p , f p q e sim f p p , f p q + e sim f p p , f g q
where K is the number of image patches; f g p and f g q are the foreground features of the p-th and q-th patches, respectively; f p p and f p q are the background features of the p-th and q-th patches, respectively; and sim ( · ) computes cosine similarity. The first term promotes consistency in foreground features across patches, while the second term enforces consistency in background features. A negative contrastive loss is applied to amplify the distinction between foreground and background regions.
Intuitively, L attn and L part play different but complementary roles. L attn regularizes the attention map so that the regions assigned to the foreground retain pedestrian-related responses, while the regions assigned to the background are suppressed. In this way, the decoupler learns where to focus. By contrast, L part operates on horizontally divided patches and encourages semantic consistency across foreground patches while enlarging the discrepancy between foreground and background patches. Therefore, L part teaches the model how to maintain consistent pedestrian semantics across local regions. Their combination improves both localization quality and feature consistency under occlusion.
After the Pedestrian Background Decoupler is trained, the standard practice in the final feature extraction stage is to retain only the foreground features, outputting the final attention-weighted feature map:
f attn = X A
This feature map will be used for subsequent person Re-ID tasks to ensure that the model mainly focuses on the main part of the pedestrian and reduces interference from background and occluded areas.

3.3. Siamese Residual Network Correlation Calculation

To extract high-level semantic information of the correlation between two input feature maps, a Siamese network is used as the backbone structure. The Siamese Residual Network takes paired pedestrian features as input and projects them into a shared feature space, where residual correlation learning is performed to enhance identity consistency across different views and occlusion patterns. The Siamese network processes the query image and gallery image features f l i and f l j R C × H × W (where l { 1 , 2 , 3 } indexes the feature maps from different contrastive attention layers) simultaneously by sharing weights, ensuring symmetry and consistency in the feature extraction process. The core advantage of this design is that it makes the features of the two images comparable in the same feature space, avoiding correlation calculation bias caused by feature extraction differences.
First, the input features are passed through a bottleneck layer to reduce the features to a low-dimensional space, and a single-layer residual structure is incorporated to dynamically balance the original features and the residual component through a learnable weight α . Subsequently, the reduced-dimensional features are normalized to alleviate the similarity bias problem caused by feature scale differences. The calculation formulas are as follows,
h i = LayerNorm W i f l i + b i
h i = h i + α · R h i
where R ( · ) denotes the residual mapping function (comprising a linear layer followed by a nonlinear activation), and α is a learnable weight parameter.
To adapt to the diversity of feature differences in different scenarios and help the model establish more accurate correspondences between features at different scales, an adaptive similarity measurement factor σ is introduced,
σ = Softplus W σ GAP f i , GAP f j
where, GAP(·) represents Global Average Pooling, W σ denotes the parameter matrix, and Softplus ensures the non-negativity of σ . A fixed σ value often fails to generalize across diverse scenarios when feature distributions shift. Consequently, our adaptive σ dynamically adjusts the similarity “temperature” based on each input feature pair, preventing the similarity distribution from becoming too peaked (saturating) or too flat (insensitive).
Finally, cross-scale correlation aggregation is performed. The inner product after L2 normalization is equivalent to cosine similarity, which can eliminate amplitude differences and focus only on directional similarity, thus better capturing semantic similarity. The outputs from multiple layers, denoted as { S l } l = 1 3 , are then integrated to form a comprehensive multi-dimensional cross-scale correlation tensor S D . Specifically:
S l = Softmax h i h j σ
S D = Stack { S l } l = 1 3

3.4. Progressive Fine-Grained Correlation Learning Network

To prevent the model from underestimating the similarity between non-occluded and occluded images due to information asymmetry, this paper proposes a Progressive Fine-grained Correlation Learning (PFCL) network.
First, the multi-dimensional cross-scale correlation tensor S D is processed through multiple composite applications of spatial attention to obtain a result with more channels. A single application process involves passing S D : a 1 × 1 convolution for channel adjustment, a spatial attention module for feature enhancement, and a normalization layer. This process is defined as follows,
f ( x ) = Norm Attention Conv 1 × 1 ( x )
Subsequently, the feature map passes through a channel attention layer. This layer dynamically recalibrates channel-wise weights to emphasize task-relevant, discriminative features while suppressing irrelevant noise. This adaptive process enhances the model’s generalization capability across diverse scenarios,
f h = σ W 2 · ReLU W 1 · f N ( x ) + b 1 + b 2
where, the variable N represents the number of composite spatial attention blocks applied; W and b are the weight and bias parameters of the two linear layers, respectively. Following, the channel-reweighted correlation tensor is obtained. Higher-level semantic information is then captured through a subsequent distance measurement,
f s = σ W 2 · ReLU W 1 · f N ( x ) + b 1 + b 2
To enable the correlation module to accurately capture the correspondences between the compared images, a Siamese Residual Network is employed to perform pixel-level correlation computation at each layer. Subsequently, the Progressive Fine-grained Correlation Learning network leverages multi-scale features to capture neighborhood information across varying receptive fields, thereby extracting higher-level semantic features f s . This design also helps alleviate the possible influence of residual background noise after foreground–background decoupling. Even if the attention map cannot perfectly remove all background responses, the SRN and PFCL modules compare query and gallery images through multi-level correspondence modeling and similarity refinement, thereby reducing the impact of noisy local regions on the final matching score. An image similarity loss function is applied to it:
L sim = l i , j log f s i , j + 1 l i , j log 1 f s i , j
where l i , j is a binary indicator variable that equals 1 if the i-th and j-th images belong to the same identity, and 0 otherwise. The model predicts the similarity score between image pair i and j, output through a sigmoid function, with a value range of [ 0 , 1 ] . L sim is the standard binary cross-entropy loss.

3.5. Training and Inference

The training procedure is divided into two stages for stability and functional decoupling.
Stage 1 focuses on learning reliable foreground-aware representations. In this stage, the Pedestrian Background Decoupler is optimized using identity loss, triplet loss, attention separation loss, and local consistency loss. The purpose of this stage is to enable the network to suppress background interference and preserve semantically consistent pedestrian regions before pairwise matching is introduced.
Stage 2 focuses on cross-image matching. After Stage 1, the decoupler is fixed, and the Siamese Residual Network together with the Progressive Fine-grained Correlation Learning module are trained using the image similarity loss. Fixing the decoupler at this stage stabilizes the foreground representation and allows the subsequent modules to concentrate on learning reliable pairwise correspondences. The loss function in Stage 1 is defined as:
L stage 1 = L id P X cls + 1 G i = 1 G L id P X p i + L tri X cls + 1 G i = 1 G L tri X p i + λ 1 L attn + λ 2 L part
where X cls and X p denote the global and local features produced by the Pedestrian Background Decoupler, P ( · ) denotes the fully connected layer, and λ 1 and λ 2 are the corresponding loss weights.
In Stage 2, the multi-level feature maps extracted by the pre-trained Pedestrian Background Decoupler are used to train the correlation modules. The corresponding loss is given by
L stage 2 = L sim
During inference, query and gallery images are first transformed into foreground-aware feature maps by the trained decoupler, and these features are then passed through the SRN and PFCL modules to produce the final similarity score.

4. Experiments

4.1. Experimental Parameter Configuration

In this paper, we adopt the Transformer architecture as the foundational backbone network for our model. Initially, the encoder’s weights are pre-trained on the large-scale ImageNet-21K dataset, which provides a rich and diverse set of visual features to kick-start the learning process.
During the experimental phase of this study, all input images are uniformly resized to a dimension of 256 × 128 pixels to ensure consistency in data processing. Subsequently, these resized images are partitioned into non-overlapping patches, each measuring 16 × 16 pixels. As a result, a total of N = 128 patches are generated from each image.
In terms of batch processing, we set the batch size to 32. Moreover, for each identity in the dataset, we include 4 images to enrich the training data and enhance the model’s ability to learn distinct features associated with different individuals.
To bolster the model’s generalization capacity, enabling it to perform well on unseen data, we subject the training images to a series of augmentation techniques. These include random horizontal flipping, which creates a mirrored version of the image; padding, which adds extra pixels around the image edges; random cropping, which extracts a random portion of the image; and random erasing, which randomly removes a part of the image.
For the optimization process, we employ the Stochastic Gradient Descent (SGD) optimizer with a weight decay of 10 4 . This helps in preventing overfitting by penalizing large weights in the model. The learning rate is initially set to 0.008 and is then decayed following a cosine schedule. This cosine annealing of the learning rate allows for a smooth and adaptive adjustment of the learning rate during the training process, facilitating better convergence of the model.
To ensure statistical reliability, all experiments were repeated 3 times with different random seeds. Results are reported as mean ± standard deviation (std) over the three runs.
In terms of computational cost, our model contains 34.2 M parameters and requires 9.7 GFLOPs for a 256 × 128 input. For reference, the baseline TransReID has 86.5 M parameters and 12.3 GFLOPs under the same setting, while PFD has 94.3 M parameters and 15.2 GFLOPs. Our method achieves fewer parameters and lower FLOPs compared with TransReID.

4.2. Experimental Results and Analysis

4.2.1. On Occluded Datasets

This paper presents a set of comparative experiments conducted on the Occluded-DukeMTMC dataset, with the outcomes detailed in Table 1. The algorithms compared in this study include four categories: firstly, traditional algorithms tailored for non-occluded scenarios; secondly, algorithms that specifically leverage Convolutional Neural Networks (CNNs) to tackle occlusion challenges; thirdly, hybrid algorithms that integrate CNN and Transformer architectures; and fourthly, algorithms based on the Vision Transformer (ViT). Notably, the proposed method attained a Rank-1 accuracy of 72.3% and a mean Average Precision (mAP) of 65.5% on the Occluded-DukeMTMC dataset. The experimental findings demonstrate that, in comparison with existing methods and those relying solely on CNNs for occlusion handling, the proposed method exhibits superior performance.
The experiments on the Occluded-Market1501 and Occluded-REID datasets are presented in Table 2. The proposed method has demonstrated exceptional performance, securing the top position on both datasets. Specifically, on the Occluded-Market1501 dataset, it attained a remarkable 83.7% Rank-1 accuracy along with a 69.1% mean Average Precision (mAP). On the Occluded-REID dataset, it achieved an even more impressive 87.9% Rank-1 accuracy and an 84.1% mAP.
When compared with existing methods, the proposed approach significantly surpasses other mainstream techniques on both datasets. For instance, on the Occluded-Market1501 dataset, the Transformer-based method TransReID, which previously held the best performance record, only managed to reach 78.2% Rank-1 accuracy and 64.7% mAP. In contrast, the proposed method brought about an improvement of 5.5% in Rank-1 accuracy and 4.4% in mAP.
On the Occluded-REID dataset, the well-performing SPT method achieved 86.8% Rank-1 accuracy and 81.3% mAP. However, the proposed method further elevated these figures to 87.9% Rank-1 accuracy and 84.1% mAP. In summary, it is evident that existing methods still exhibit notable limitations when it comes to addressing occlusion problems.

4.2.2. On Partial Datasets

The experimental data showcased in Table 3 unequivocally demonstrates that the experiments carried out on the Partial-REID and Partial-iLIDS datasets thoroughly expose the substantial performance disparities among various methods when dealing with partially occluded scenarios.
Take the traditional method of PCB as an illustrative case. Owing to the inherent constraints of its local feature extraction mechanism, it struggles to comprehensively and precisely represent pedestrian feature information in the intricate context of partial occlusion. Although enhanced CNN-based methods like VPM, DSR, and STNReID have, to a certain extent, mitigated this issue, they still harbor certain shortcomings.
In contrast, methods that employ attention mechanisms and adopt local–global feature fusion strategies, such as PGFA, FPR, HOReID, PVPM, and PFT, exhibit more pronounced advantages in enhancing recognition efficacy.
Nevertheless, the proposed method has truly distinguished itself. On the Partial-REID dataset, it attained an impressive 86.6% Rank-1 (R-1) accuracy and a remarkable 94.3% Rank-3 (R-3) accuracy. On the Partial-iLIDS dataset, it achieved 85.5% R-1 and 90.7% R-3, respectively. These results indicate that the proposed method significantly outperforms all other methods, thereby offering robust support for recognition tasks under complex occlusion conditions.

4.2.3. On Holistic Datasets

For the holistic datasets, the compared methods can be broadly categorized into three groups according to their design philosophies. The first group includes traditional local alignment methods, such as PCB and PGFA. The second group consists of approaches that focus on optimizing global features and similarity metrics, such as TransReID and FRT. The third group contains methods that further enhance local information, such as VPM and HOReID.
A closer examination of the results shows that traditional local alignment methods exhibit relatively limited performance in occluded scenarios, mainly because their local feature extraction is insufficient for capturing discriminative pedestrian cues when parts of the image are obscured. By contrast, global feature learning methods improve overall matching performance, but they still struggle with asymmetric occlusion because crucial local details may be overlooked.
In comparison, the proposed method combines pedestrian background decoupling, Siamese residual correlation modeling, and progressive fine-grained correlation learning. Through foreground–background separation, the method reduces interference from irrelevant regions. The Siamese residual correlation module captures more stable cross-image relationships, while the progressive fine-grained learning module further refines multi-scale correspondences for matching.
As shown in Table 4, On the Market-1501 dataset, the proposed method achieves 95.5% Rank-1 accuracy and 89.9% mean Average Precision (mAP). On the DukeMTMC dataset, it achieves 91.9% Rank-1 accuracy and 84.8% mAP. These results demonstrate the effectiveness and robustness of the proposed framework on standard public benchmarks for Re-Id.

4.3. Ablation Studies

4.3.1. Effectiveness Analysis of Each Module and Loss Function

To investigate the impact of each module and loss function on task performance, ablation experiments were conducted on the Occluded-DukeMTMC dataset. Comparative analysis was performed by removing or adding the following key components: (1) Pedestrian Background Decoupler (PBD); (2) Siamese Residual Network (SRN); (3) Correlation Learning module (CL); (4) attention separation loss L attn ; (5) local consistency loss L part ; and (6) image similarity loss L sim .
The ablation results reveal the functional role of each module rather than merely reporting incremental gains. The Pedestrian Background Decoupler provides the largest single improvement because it directly addresses the most fundamental challenge in occluded person Re-ID, namely the contamination of pedestrian representations by background clutter and occluders. Once the foreground representation becomes cleaner, the correlation learning component further improves performance by modeling local correspondences between query and gallery images, which is particularly important when visible regions are incomplete or spatially misaligned. The Siamese Residual Network contributes additional gains by projecting paired features into a shared and more stable comparison space, thereby reducing feature inconsistency across image pairs.
The three loss terms also play complementary supervisory roles. Specifically, L attn improves foreground–background separation, L part enhances semantic consistency across local foreground regions, and L sim directly optimizes pairwise similarity estimation. Therefore, the best performance is achieved only when feature purification, pairwise correlation construction, and similarity supervision are jointly integrated. The detailed ablation results are summarized in Table 5.

4.3.2. Parameter Sensitivity Analysis

This experiment investigates the impact of varying the number of image patches on model performance. As shown in Figure 3, dividing the image into four patches yields the best Re-ID performance, achieving 69.6% Rank-1 accuracy and 60.3% mAP. This suggests that an appropriate patch division facilitates more effective local feature learning, allowing the model to capture stronger semantic relationships across different body regions and thereby improve recognition accuracy.
When the number of patches is too small (e.g., 0 or 2), the model fails to capture sufficient fine-grained local features. For example, with only two patches corresponding roughly to the upper and lower body, the appearance of different pedestrians in these regions can vary substantially, creating semantic gaps that hinder the model from learning consistent alignment relationships across samples. Moreover, insufficient local contrast forces the model to rely more heavily on global features, which reduces its adaptability to occlusion and pose variation.
Conversely, an excessively large number of patches may lead to over-fragmentation of local features. In such cases, individual patches may contain too little discriminative information, which weakens their ability to represent meaningful body parts. In addition, some body regions may contribute limited discriminative cues, thereby impairing the model’s ability to capture global semantics. As a result, overall performance may stagnate or even degrade.
By contrast, the four-patch setting achieves a better balance between global and local information, enabling the model to extract fine-grained features without excessively fragmenting the representation. Based on the current experimental setting, we therefore adopt the four-patch configuration as an empirical trade-off between local detail preservation and feature fragmentation.

5. Conclusions

In real-world scenarios, occlusion frequently leads to critical issues such as loss of key features and uneven information distribution, significantly degrading the performance of person Re-ID systems. Traditional Re-ID methods often fail to effectively separate pedestrians from background interference under occluded conditions, resulting in a notable decline in matching accuracy. Moreover, these methods commonly rely on external pose estimation models, which not only introduce additional computational overhead but are also constrained by the accuracy of the pose estimators themselves. Consequently, developing an efficient Re-ID approach that handles occlusion effectively without depending on external modules has become a crucial challenge in the field. Motivated by this, we conduct the following research in this paper:
1.
This paper investigates Transformer-based Re-Id, examining its unique strengths in global feature extraction and long-range dependency modeling. The approach compensates for the limitations of traditional convolutional neural networks in local feature representation under occluded scenarios, thereby offering a promising solution for occluded person Re-ID.
2.
To reduce the dependency of traditional methods on external pose estimation models, this paper designs a Transformer-based algorithm with pedestrian foreground–background decoupling. By incorporating a Pedestrian Foreground–Background Decoupler, the model achieves autonomous separation of foreground pedestrian regions from background interference, thereby eliminating errors introduced by external modules and significantly improving adaptability in complex occluded scenarios.

Author Contributions

Conceptualization, C.Y. and Q.L.; methodology, X.L. and Y.C. (Yuheng Chen); software, Y.L.; validation, Y.C. (Yuheng Chen), Y.W. and Y.C. (Yi Cao); formal analysis, X.L.; investigation, X.L., Y.W. and Y.L.; writing—original draft preparation, X.L.; writing—review and editing, C.Y. and Q.L.; supervision, C.Y. and Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under grants U22B2062 and U23B2023.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to the protection of the ongoing study.

Acknowledgments

The authors thank all anonymous reviewers for their constructive comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhang, H.; Guo, J.; Deng, C.; Fan, Y.; Gu, F. Can video surveillance systems promote the perception of safety? Evidence from surveys on residents in Beijing, China. Sustainability 2019, 11, 1595. [Google Scholar] [CrossRef]
  2. Rami, H.; Giraldo, J.H.; Winckler, N.; Lathuilière, S. Source-guided similarity preservation for online person re-identification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January 2024; pp. 1711–1720. [Google Scholar]
  3. Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond Part Models: Person Retrieval with Refined Part Pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
  4. Lin, G.; Bao, Z.; Huang, Z.; Li, Z.; Zheng, W.S.; Chen, Y. A Multi-Level Relation-Aware Transformer Model for Occluded Person Re-Identification. Neural Netw. 2024, 177, 106382. [Google Scholar] [CrossRef] [PubMed]
  5. Ren, T.; Lian, Q.; Chen, J. Boosting Occluded Person Re-Identification by Leveraging Occlusion Attributes. Inf. Sci. 2025, 701, 121866. [Google Scholar] [CrossRef]
  6. Li, Y.; Shuai, S.; Zhou, Y.; Deng, B.; Zhang, D. Joint Detection and Re-Identification for Occluded Person Search. Sci. Rep. 2025, 15, 22470. [Google Scholar] [CrossRef] [PubMed]
  7. Miao, J.; Wu, Y.; Liu, P.; Ding, Y.; Yang, Y. Pose-guided feature alignment for occluded person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 542–551. [Google Scholar]
  8. Gao, S.; Wang, J.; Lu, H.; Liu, Z. Pose-Guided Visible Part Matching for Occluded Person ReID. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11744–11752. [Google Scholar]
  9. Wang, G.; Yang, S.; Liu, H.; Wang, Z.; Yang, Y.; Wang, S.; Yu, G.; Zhou, E.; Sun, J. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6449–6458. [Google Scholar]
  10. He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. TransReID: Transformer-based Object Re-Identification. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Virtual Conference, 11–17 October 2021; pp. 14993–15002. [Google Scholar]
  11. Li, Y.; He, J.; Zhang, T.; Liu, X.; Zhang, Y.; Wu, F. Diverse part discovery: Occluded person re-identification with part-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Conference, 11–17 October 2021; pp. 2898–2907. [Google Scholar]
  12. Wang, T.; Liu, H.; Song, P.; Guo, T.; Shi, W. Pose-guided feature disentangling for occluded person re-identification based on transformer. Proc. AAAI Conf. Artif. Intell. 2022, 36, 2540–2549. [Google Scholar] [CrossRef]
  13. Tan, L.; Dai, P.; Ji, R.; Wu, Y. Dynamic prototype mask for occluded person re-identification. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 531–540. [Google Scholar]
  14. Suh, Y.; Wang, J.; Tang, S.; Mei, T.; Lee, K.M. Part-aligned bilinear representations for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 402–419. [Google Scholar]
  15. Li, W.; Zhu, X.; Gong, S. Harmonious attention network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2285–2294. [Google Scholar]
  16. Huang, H.; Li, D.; Zhang, Z.; Chen, X.; Huang, K. Adversarially occluded samples for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5098–5107. [Google Scholar]
  17. Ge, Y.; Li, Z.; Zhao, H.; Yin, G.; Yi, S.; Wang, X.; Li, H. FD-GAN: Pose-guided feature distilling GAN for robust person re-identification. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
  18. Zhu, K.; Guo, H.; Liu, Z.; Tang, M.; Wang, J. Identity-guided human semantic parsing for person re-identification. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 346–363. [Google Scholar]
  19. Jia, M.; Cheng, X.; Zhai, Y.; Lu, S.; Ma, S.; Tian, Y.; Zhang, J. Matching on sets: Conquer occluded person re-identification without alignment. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1673–1681. [Google Scholar] [CrossRef]
  20. Wang, P.; Ding, C.; Shao, Z.; Hong, Z.; Zhang, S.; Tao, D. Quality-aware part models for occluded person re-identification. IEEE Trans. Multimed. 2022, 25, 3154–3165. [Google Scholar] [CrossRef]
  21. Liu, Z.; Mu, X.; Lu, Y.; Zhang, T.; Tian, Y. Learning transformer-based attention region with multiple scales for occluded person re-identification. Comput. Vis. Image Underst. 2023, 229, 103652. [Google Scholar] [CrossRef]
  22. Jia, M.; Cheng, X.; Lu, S.; Zhang, J. Learning disentangled representation implicitly via transformer for occluded person re-identification. IEEE Trans. Multimed. 2022, 25, 1294–1305. [Google Scholar] [CrossRef]
  23. Wang, Z.; Zhu, F.; Tang, S.; Zhao, R.; He, L.; Song, J. Feature erasing and diffusion network for occluded person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4754–4763. [Google Scholar]
  24. Luo, H.; Jiang, W.; Gu, Y.; Liu, F.; Liao, X.; Lai, S.; Gu, J. A strong baseline and batch normalization neck for deep person re-identification. IEEE Trans. Multimed. 2019, 22, 2597–2609. [Google Scholar] [CrossRef]
  25. Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-scale feature learning for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 3702–3712. [Google Scholar]
  26. Tan, L.; Xia, J.; Liu, W.; Dai, P.; Wu, Y.; Cao, L. Occluded person re-identification via saliency-guided patch transfer. Proc. AAAI Conf. Artif. Intell. 2024, 38, 5070–5078. [Google Scholar] [CrossRef]
  27. Sun, Y.; Xu, Q.; Li, Y.; Zhang, C.; Li, Y.; Wang, S.; Sun, J. Perceive where to focus: Learning visibility-aware part-level features for partial person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 393–402. [Google Scholar]
  28. He, L.; Liang, J.; Li, H.; Sun, Z. Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7073–7082. [Google Scholar]
  29. Luo, H.; Jiang, W.; Fan, X.; Zhang, C. STNReID: Deep convolutional networks with pairwise spatial transformer networks for partial person re-identification. IEEE Trans. Multimed. 2020, 22, 2905–2913. [Google Scholar] [CrossRef]
  30. He, L.; Wang, Y.; Liu, W.; Liao, X.; Zhao, H.; Sun, Z.; Feng, J. Foreground-aware pyramid reconstruction for alignment-free occluded person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 8450–8459. [Google Scholar]
  31. Zhao, Y.; Zhu, S.; Wang, D.; Liang, Z. Short range correlation transformer for occluded person re-identification. Neural Comput. Appl. 2022, 34, 17633–17645. [Google Scholar] [CrossRef]
  32. Song, C.; Huang, Y.; Ouyang, W.; Wang, L. Mask-guided contrastive attention model for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1179–1188. [Google Scholar]
  33. Liu, J.; Ni, B.; Yan, Y.; Zhou, P.; Cheng, S.; Hu, J. Pose transferrable person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4099–4108. [Google Scholar]
  34. Chen, P.; Liu, W.; Dai, P.; Liu, J.; Ye, Q.; Xu, M.; Chen, Q.; Ji, R. Occlude them all: Occlusion-aware attention network for occluded person re-id. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual Conference, 11–17 October 2021; pp. 11833–11842. [Google Scholar]
  35. Xu, B.; He, L.; Liang, J.; Sun, Z. Learning feature recovery transformer for occluded person re-identification. IEEE Trans. Image Process. 2022, 31, 4651–4662. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Overall pipeline of the proposed framework, including the Pedestrian Background Decoupler, Siamese Residual Network, and Progressive Fine-grained Correlation Learning module. The three modules are connected sequentially for foreground purification, pairwise correlation construction, and final similarity estimation.
Figure 1. Overall pipeline of the proposed framework, including the Pedestrian Background Decoupler, Siamese Residual Network, and Progressive Fine-grained Correlation Learning module. The three modules are connected sequentially for foreground purification, pairwise correlation construction, and final similarity estimation.
Mathematics 14 01725 g001
Figure 2. Structure of the Pedestrian Background Decoupler. Given an intermediate Transformer feature map X, the module first applies channel attention enhancement and then generates a foreground probability map A through the attention head. The reweighted feature map is subsequently decomposed into foreground and background branches, producing foreground features, background features, and the final foreground-aware representation used for person Re-ID. The decoupler is supervised by the attention separation loss and the local consistency loss to improve foreground localization and suppress background interference.
Figure 2. Structure of the Pedestrian Background Decoupler. Given an intermediate Transformer feature map X, the module first applies channel attention enhancement and then generates a foreground probability map A through the attention head. The reweighted feature map is subsequently decomposed into foreground and background branches, producing foreground features, background features, and the final foreground-aware representation used for person Re-ID. The decoupler is supervised by the attention separation loss and the local consistency loss to improve foreground localization and suppress background interference.
Mathematics 14 01725 g002
Figure 3. Effect of the number of image patches on recognition performance on the Occluded-DukeMTMC dataset. Rank-1 accuracy and mAP both improve as local partitioning becomes finer and reach their best values at four patches, after which excessive fragmentation leads to a slight performance drop. The results indicate that the four-patch setting provides an effective trade-off between local detail preservation and global semantic integrity.
Figure 3. Effect of the number of image patches on recognition performance on the Occluded-DukeMTMC dataset. Rank-1 accuracy and mAP both improve as local partitioning becomes finer and reach their best values at four patches, after which excessive fragmentation leads to a slight performance drop. The results indicate that the four-patch setting provides an effective trade-off between local detail preservation and global semantic integrity.
Mathematics 14 01725 g003
Table 1. Performance Analysis on Occluded-DukeMTMC Dataset.
Table 1. Performance Analysis on Occluded-DukeMTMC Dataset.
MethodRank-1 (%)mAP (%)
Part aligned [13]28.820.2
HACNN [14]34.426.0
PCB [3]42.633.7
AdverOcclusion [15]44.532.2
FD-GAN [16]40.8
PGFA [7]51.437.3
HONet [9]55.143.8
ISP [17]62.852.3
MoS [18]61.049.2
QPM [19]64.449.7
OPR-DAAO [20]64.847.5
PAT [11]64.553.6
DRL-Net [21]65.853.9
FRT [22]70.761.3
TransReID [10]64.255.7
PFD [12]67.760.1
FED [21]68.156.1
Proposed Method72.3 ± 0.465.5 ± 0.3
Table 2. Performance Analysis on Occluded-Market1501 and Occluded-REID Datasets.
Table 2. Performance Analysis on Occluded-Market1501 and Occluded-REID Datasets.
MethodOccluded-Market1501 Occluded-REID
Rank-1 (%)mAP (%) Rank-1 (%)mAP (%)
PCB [3]66.049.4 41.338.9
BoT [23]70.651.5
OSNet [24]65.542.8
TransReID [10]78.264.7
PGFA [7]64.145.5
PVPM [8]66.849.4 70.461.2
HOReID [9]64.949.3 80.370.2
PFD [12] 79.881.3
FRT [22] 80.471.0
SPT [25]68.657.4 86.881.3
Proposed Method83.7 ± 0.469.1 ± 0.4 87.9 ± 384.1 ± 0.4
Table 3. Performance Analysis on Partial-REID and Partial-iLIDS Datasets.
Table 3. Performance Analysis on Partial-REID and Partial-iLIDS Datasets.
MethodPartial-REIDPartial-iLIDS
Rank-1 (%)Rank-3 (%) Rank-1 (%)Rank-3 (%)
PCB [3]66.346.8
VPM [26]67.781.967.276.5
DSR [27]50.770.058.867.2
STNReID [28]66.780.354.676.3
PGFA [7]68.080.069.180.9
FPR [29]81.068.1
HOReID [9]85.391.072.686.4
PVPM [8]78.387.7
PFT [30]81.374.887.3
Proposed Method86.6 ± 0.494.3 ± 0.3 85.5 ± 0.490.7 ± 0.3
Table 4. Performance Analysis on Market-1501 and DukeMTMC Datasets.
Table 4. Performance Analysis on Market-1501 and DukeMTMC Datasets.
MethodMarket-1501DukeMTMC
Rank-1 (%)mAP (%) Rank-1 (%)mAP (%)
PCB [3]92.377.481.866.1
PGFA [7]91.276.882.665.5
VPM [26]93.080.883.672.6
MGCAN [31]83.874.346.746.0
PT [32]87.768.978.556.9
HOReID [9]94.284.986.975.6
OAMN [33]92.379.886.372.6
MGN [34]95.786.988.778.4
SPT [25]94.586.289.479.1
PAT [11]95.488.088.878.2
DRL-Net [21]94.786.988.176.6
FED [35]95.086.389.478.0
FRT [22]95.588.190.581.7
Proposed Method95.589.9 91.984.8
Table 5. Ablation Experiment Results.
Table 5. Ablation Experiment Results.
IndexPBDSRNCL L attn L part L sim Rank-1 (%)mAP (%)
1 58.248.3
2 66.357.1
3 67.158.6
4 69.660.3
5 71.361.0
6 68.658.3
7 70.763.1
872.365.5
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, X.; Chen, Y.; Wu, Y.; Liang, Y.; Cao, Y.; Liu, Q.; Yuan, C. Occluded Person Re-Identification Method Based on Pedestrian Background Decoupling Transformer. Mathematics 2026, 14, 1725. https://doi.org/10.3390/math14101725

AMA Style

Li X, Chen Y, Wu Y, Liang Y, Cao Y, Liu Q, Yuan C. Occluded Person Re-Identification Method Based on Pedestrian Background Decoupling Transformer. Mathematics. 2026; 14(10):1725. https://doi.org/10.3390/math14101725

Chicago/Turabian Style

Li, Xinting, Yuheng Chen, Yuchen Wu, Yuchong Liang, Yi Cao, Qingcheng Liu, and Chengsheng Yuan. 2026. "Occluded Person Re-Identification Method Based on Pedestrian Background Decoupling Transformer" Mathematics 14, no. 10: 1725. https://doi.org/10.3390/math14101725

APA Style

Li, X., Chen, Y., Wu, Y., Liang, Y., Cao, Y., Liu, Q., & Yuan, C. (2026). Occluded Person Re-Identification Method Based on Pedestrian Background Decoupling Transformer. Mathematics, 14(10), 1725. https://doi.org/10.3390/math14101725

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop