Dual-Branch Occlusion-Aware Semantic Part-Features Extraction Network for Occluded Person Re-Identification

Sun, Bo; Zhang, Yulong; Wang, Jianan; Jiang, Chunmao

doi:10.3390/math13152432

Open AccessArticle

Dual-Branch Occlusion-Aware Semantic Part-Features Extraction Network for Occluded Person Re-Identification

by

Bo Sun

^*,

Yulong Zhang

,

Jianan Wang

and

Chunmao Jiang

School of Computer and Mathematics, Fujian University of Technology, Fuzhou 350118, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(15), 2432; https://doi.org/10.3390/math13152432

Submission received: 9 June 2025 / Revised: 18 July 2025 / Accepted: 24 July 2025 / Published: 28 July 2025

Download

Browse Figures

Versions Notes

Abstract

Occlusion remains a major challenge in person re-identification, as it often leads to incomplete or misleading visual cues. To address this issue, we propose a dual-branch occlusion-aware network (DOAN), which explicitly and implicitly enhances the model’s capability to perceive and handle occlusions. The proposed DOAN framework comprises two synergistic branches. In the first branch, we introduce an Occlusion-Aware Semantic Attention (OASA) module to extract semantic part features, incorporating a parallel channel and spatial attention (PCSA) block to precisely distinguish between pedestrian body regions and occlusion noise. We also generate occlusion-aware parsing labels by combining external human parsing annotations with occluder masks, providing structural supervision to guide the model in focusing on visible regions. In the second branch, we develop an occlusion-aware recovery (OAR) module that reconstructs occluded pedestrians to their original, unoccluded form, enabling the model to recover missing semantic information and enhance occlusion robustness. Extensive experiments on occluded, partial, and holistic benchmark datasets demonstrate that DOAN consistently outperforms existing state-of-the-art methods.

Keywords:

occluded person re-identification; part-aware feature learning; occlusion data augmentation; attention mechanism; feature recovery

MSC:

68T07

1. Introduction

Person re-identification (Re-ID) is a fundamental task in computer vision that aims to match individuals across non-overlapping camera views [1]. Although substantial progress has been achieved in holistic person Re-ID, with a range of effective methods proposed [1,2], real-world surveillance scenarios—such as urban streets, transportation hubs, and retail environments—frequently involve partial occlusions caused by other pedestrians or surrounding objects [3]. These occlusions often lead to missing body parts or the inclusion of irrelevant visual information, presenting significant challenges to accurate person identification. Therefore, developing a robust framework for occluded person re-identification holds considerable practical importance.

To address the issue of incomplete body information caused by occlusion, existing methods mainly focus on extracting discriminative features from visible body parts. These methods can be broadly categorized into (a) those that utilize external guidance cues—such as pose estimation or parsing maps—to extract features from visible regions [4,5,6,7,8,9,10,11,12,13,14,15,16] and (b) those that adopt attention mechanisms to dynamically focus on unobstructed body regions [17,18,19,20,21]. Among these, approaches guided by external cues generally achieve better performance under occlusion. Motivated by this observation and the effectiveness of attention-based strategies, we propose a parallel channel and spatial attention (PCSA) block. This module enhances the model’s ability to distinguish between occluders and pedestrian body regions, thereby improving its performance in occluded scenarios.

Despite the promising performance of existing methods, a key limitation remains: a pronounced data imbalance in current occluded person Re-ID datasets, where unoccluded images greatly outnumber occluded ones [22,23]. This imbalance hampers the model’s ability to learn robust representations for occlusion handling, ultimately limiting its recognition accuracy.

To address this challenge, we draw inspiration from recent advances in occlusion-aware data augmentation [22,24,25,26,27,28] and propose two complementary strategies that go beyond conventional methods relying solely on external guidance cues.

First, we employ an instance segmentation model [29] to extract pedestrians and various occluding objects from auxiliary pedestrian datasets. These segmented instances are then composited with the target training set to generate synthetic occluded images, as illustrated in Figure 1. Building upon the framework introduced in [10], which utilizes human parsing information, we further incorporate external parsing labels [30] and added occluders to construct occlusion-aware parsing labels. These labels serve as supervision signals to improve the model’s ability to recognize occluded body parts and reduce the influence of occlusion noise.

Second, we enhance occlusion perception through a reconstruction-based learning strategy. Specifically, we introduce an occlusion-aware recovery (OAR) module that learns to reconstruct the original, unoccluded version of synthetically occluded images. This process implicitly encourages the model to localize occluded regions and infer missing semantic information, akin to human visual inference, thereby further boosting its robustness in the presence of occlusion.

Overall, building upon the framework proposed in [10], we design a dual-branch occlusion-aware module that explicitly and implicitly guides the model to perceive and handle occlusions.

During the initial training phase, occluding objects are manually added to target pedestrian images to simulate real-world occlusion scenarios. The first branch incorporates an Occlusion-Aware Semantic Attention (OASA) module to extract semantic part features. Within this module, we introduce a parallel channel and spatial attention (PCSA) block, which enables the model to more accurately differentiate between pedestrian body regions and occlusion noise, thereby enhancing robustness under occluded conditions. Additionally, occluder masks are extracted and combined with external human parsing labels to generate occlusion-aware parsing labels that serve as supervision signals. This supervision encourages the model to focus on visible regions while explicitly learning the occlusion structure.

The second branch features an occlusion-aware recovery (OAR) module, which learns to reconstruct occluded pedestrians back to their original, unoccluded appearance. This reconstruction process implicitly guides the model to localize occluded areas and recover missing semantic information, analogous to the human visual system’s ability to infer complete structures from partial observations. As a result, the model achieves enhanced occlusion perception and improved generalization in occluded scenarios.

The OAR module is designed to restore occluded semantic features to their unoccluded counterparts, providing complementary guidance for the OASA module to extract more complete and robust part-level representations. Together, the OASA and OAR modules form a tightly integrated dual-branch architecture that unifies attention-based refinement with implicit feature recovery. Unlike previous methods that rely solely on attention mechanisms or explicit human parsing guidance, our approach jointly addresses the localization of visible parts and the restoration of occluded regions, offering a more comprehensive and robust solution to occlusion challenges in person re-identification.

This paper makes the following main contributions:

We propose a dual-branch occlusion-aware network (DOAN) for occluded person re-identification, which explicitly and implicitly enhances the model’s ability to perceive and handle occlusions by integrating structural supervision with semantic feature recovery.
We design an Occlusion-Aware Semantic Attention (OASA) module to extract semantic part features under occlusion, introducing a novel parallel channel and spatial attention (PCSA) block to precisely distinguish pedestrian body regions from occlusion noise.
We present a parsing-guided supervision strategy that combines occluder masks with external human parsing labels to generate occlusion-aware parsing labels, providing explicit supervision for learning occlusion structure.
We develop an occlusion-aware recovery (OAR) module to reconstruct occluded pedestrians, enabling implicit localization of occluded regions and recovery of missing information for occlusion-invariant feature learning.
Extensive experiments on occluded, partial, and holistic person re-identification benchmarks demonstrate the superior performance and robustness of the proposed DOAN framework.

The remainder of this paper is organized as follows: Section 2 reviews the existing person Re-ID approaches. Section 3 details our proposed dual-branch occlusion-aware semantic part-features extraction network (DOAN). Section 4 presents the experimental results and analysis, and Section 5 concludes the paper.

2. Related Work

Person re-identification (Re-ID) is a computer vision task aimed at matching images of the same individual captured by non-overlapping cameras within surveillance networks. With the rapid development of deep learning techniques, substantial advancements have been achieved in holistic person re-identification [31,32,33,34]. Nonetheless, occluded pedestrians are far more common in practical surveillance environments, rendering occlusion handling a critical challenge in Re-ID research. The problem of occluded person re-identification was first introduced by [22], which subsequently spurred a variety of follow-up studies. These approaches can be broadly divided into two categories: part-based methods and occlusion augmentation methods.

2.1. Part-Based Methods

Extracting features from visible body parts is deemed effective in occluded scenes. Current approaches generally fall into two categories: (a) The first is methods that utilize external cues [4,5,6,7,8,9,10,11,12,13,14,15,16], such as pose estimation, semantic parsing, or other external models, to facilitate feature extraction. Ref. [4] tackles the occlusion issue by employing a pose estimation model to create pose landmark attention maps that emphasize non-occluded body parts. The authors of [6] leverage pose estimation for feature localization and then address the occlusion through adaptive graph convolutions and cross-graph alignment. To address feature misalignment resulting from occlusions and viewpoint variations, ref. [9] proposes the keypoint-aware semantic alignment module, which leverages pedestrian keypoint information to enhance the alignment of local features across images. Ref. [10] proposes a semantic classifier guided by external semantic cues, tackling occlusion through selective matching of visible local features while explicitly disregarding occluded regions. (b) The second is methods that apply attention mechanisms to direct the model’s focus toward visible regions [17,18,19,20,21]. Ref. [17] addresses the occlusion by leveraging attention learning mechanisms to effectively capture part prototypes and discover discriminative body parts. Ref. [20] introduces a multi-head self-attention network designed to suppress irrelevant noise and extract essential local features from images, effectively addressing the occlusion challenge. Although existing methods have achieved significant progress, they still face difficulties due to severe data imbalances in current occluded pedestrian datasets, limiting their capacity to handle diverse occlusion scenarios. To overcome this, our proposed method integrates a dual-branch occlusion-aware network designed to effectively suppress occlusion-induced noise, thereby improving the part-based model’s robustness under diverse occlusion scenarios.

2.2. Occlusion Augmentation Methods

The significant imbalance between occluded and non-occluded samples in pedestrian training datasets impedes the model’s capacity to learn robust feature representations for occluded cases. Numerous studies [22,24,25,26,27,28] have enhanced the model’s capability to handle occlusion challenges by employing occlusion augmentation. Ref. [24] introduces a feature diffusion model with an occlusion augmentation and erasing strategy to enable the model to differentiate various occlusion scenarios and accurately identify target pedestrians. Ref. [25] tackles the occlusion issues by integrating an attention mechanism with occlusion augmentation, enabling precise body part capture across diverse occlusion scenarios. Occlusion augmentation has proven highly effective. Motivated by this, we introduce a two-step occlusion-aware strategy. First, we generate synthetic occluded images using an external instance segmentation model [29] and combine them with human parsing labels [30] to create occlusion-aware supervision signals (Figure 1). This guides the model to focus on visible body parts and ignore occlusion noise. Second, we design an occlusion-aware recovery module that reconstructs occluded images, helping the model infer missing information. Together, these two components significantly enhance the model’s robustness in occlusion scenarios.

3. Methods

The overall framework of our dual-branch occlusion-aware semantic part-features extraction network (DOAN) is illustrated in Figure 2. It comprises an occlusion augmentation strategy and two occlusion-aware modules: the Occlusion-Aware Semantic Attention (OASA) module and Occlusion-Aware Recovery (OAR) module. Section 3.1 outlines the occlusion augmentation strategy’s principles. Section 3.2 details the OASA module and the extraction of semantic body part features. Section 3.3 covers the OAR module. Section 3.4 describes the overall training and inference procedure.

3.1. Occlusion Augmentation Strategy

To enhance the model’s ability to handle occlusion scenarios, we adopt the segmentation method from [29] to extract pedestrians and occluding objects from auxiliary pedestrian datasets, as illustrated in Figure 3. These segmented instances are stored in an occlusion bank. During training, occlusions are simulated by randomly sampling from this bank and overlaying the selected instances at random positions on pedestrian images. For example, occluders extracted from the Market-1501 dataset are used to generate synthetic occluded samples for the Occluded-Duke dataset, and vice versa. These synthetic samples are then employed in the subsequent occlusion-aware modules to improve the model’s robustness against occlusion.

3.2. Occlusion-Aware Semantic Attention and Semantic Features Extraction

To enhance the model’s occlusion perception, we propose the Occlusion-Aware Semantic Attention (OASA) module as the first occlusion-aware branch for improved occlusion sensitivity. Initially, we extract multi-scale feature maps

{F_{i}}_{i = 1}^{4}

, where each

F_{i} \in R^{\frac{H}{2^{i + 1}} \times \frac{W}{2^{i + 1}} \times C \times 2^{i - 1}}

, using the Swin Transformer [35]. Here, H and W denote the height and width of the input image, respectively, and C represents the channel dimension of the first stage of the Swin Transformer. Next, we upsample these feature maps to a consistent spatial size and concatenate them along the channel dimension. A

1 \times 1

convolutional layer is employed as the fusion block to fuse the concatenated features into

F_{n 1} \in R^{\frac{H}{4} \times \frac{W}{4} \times 8 C}

as shown in Figure 4.

Motivated by the proven effectiveness of attention mechanisms [36,37], we design a PCSA (parallel channel and spatial attention) block, as illustrated in Figure 5, to enable more precise discrimination between occluding noise and the pedestrian’s body regions. The PCSA block consists of two submodules: parallel channel attention and spatial attention.

The parallel channel attention facilitates interactions across channel and spatial dimensions. Given the input feature map

F_{n 1}

, it is processed via two parallel branches. In the first branch, interactions occur between channel and height dimensions: the feature map is rotated 90° counterclockwise along the height axis using a permutation operation. Adaptive Average Pooling and Adaptive Max Pooling are then applied in parallel. Afterward, the height and width dimensions are permuted back, and the features pass through a

3 \times 3

convolutional layer, followed by Batch Normalization (BN) and a sigmoid activation to generate attention weights. These weights are applied to the original input feature map via element-wise multiplication to produce an output feature map. The second branch performs analogous processing for the interaction between channel and width dimensions. The outputs from both branches are subsequently combined via element-wise summation to yield fused feature maps.

Following this, the spatial attention module is applied to highlight crucial spatial regions. The fused feature maps generated by the parallel channel attention module undergo both max pooling and average pooling operations. The resulting pooled features are concatenated and passed through a

3 \times 3

convolutional layer. Finally, a sigmoid activation generates spatial attention weights, which are multiplied element-wise with the fused feature maps to produce the final output. This refined feature map is then fed into the pixel classifier for body part region classification.

Subsequently, a pixel classifier—implemented as a second

1 \times 1

convolution layer parameterized by

P \in R^{(K + 1) \times 8 C}

, followed by a softmax activation—is applied to classify each pixel in

F_{n 1}

into one of K predefined body parts or the background. This classification is supervised by external human parsing labels. The pixel classifier outputs classification scores

M \in R^{\frac{H}{4} \times \frac{W}{4} \times (K + 1)}

, where for each spatial location

(h, w)

, the highest probability in M corresponds to the predicted body part class. The classification scores are optimized using a body part attention loss

L_{p a}

[10], which is a cross-entropy loss with label smoothing [38], defined as

L_{p a} = - \sum_{k = 0}^{K} \sum_{h = 0}^{\frac{H}{4} - 1} \sum_{w = 0}^{\frac{W}{4} - 1} q_{k} \cdot log (M_{k} (h, w)),

(1)

where

q_{k} = \{\begin{matrix} 1 - \frac{N - 1}{N} ε & if Y (h, w) = k \\ \frac{ε}{N} & otherwise \end{matrix}

Here,

Y \in R^{\frac{H}{4} \times \frac{W}{4}}

denotes the human parsing labels obtained from [30], N is the batch size,

ε

is the label smoothing regularization coefficient used to prevent the model from becoming over-confident by assigning a small probability mass to non-target classes, and

M_{k} (h, w)

represents the predicted probability of the k-th part at spatial location

(h, w)

.

To further strengthen occlusion awareness, we employ external occlusion masks

M_{ex_occ}

(as described in Section 3.1) to filter the original human parsing labels, producing occlusion-aware parsing labels

Y_{o c c_a w a r e}

, as illustrated in Figure 6. This procedure enhances the model’s sensitivity to occluded regions and equips the pixel classifier with occlusion perception capabilities, thereby establishing an occlusion-aware mechanism within the OASA module.

Next, to extract semantic part representations, we leverage the classification scores from the pixel classifier to derive body part masks

{M_{k}}_{k = 1}^{K}

(excluding the background class (

k = 0

)). These masks are element-wise multiplied with the feature map

F_{n 2} \in R^{\frac{H}{4} \times \frac{W}{4} \times 8 C}

, produced by another fusion block, to extract semantic part features

{f_{p_{k}}}_{k = 1}^{K}

as illustrated in Figure 2. Then, by applying a max operation across the body part masks, we generate a foreground mask that isolates the foreground feature

f_{f}

. Finally, we concatenate all part features along with the unmasked feature map

F_{n 2}

to form the concatenated part feature

f_{c}

and the global feature

f_{g}

. The foreground feature, concatenated part feature, and global feature, denoted as

{f_{f}, f_{c}, f_{g}}

, are average-pooled and used to compute the identity loss as follows:

L_{i d} = \sum_{i \in {f, c, g}} L_{C E} (f_{i}),

(2)

where

L_{C E}

denotes the cross-entropy loss with label smoothing and the BNNeck trick [39].

For each part feature

{f_{p_{k}}}_{k = 1}^{K}

, we apply the part average triplet loss [10] defined by

d_{p a r t s}^{i j} = \frac{1}{K} \sum_{k = 1}^{K} {dist}_{e u c l} (f_{k}^{i}, f_{k}^{j}),

(3)

L_{t r i}^{p a r t s} (f_{p_{1}}, \dots, f_{p_{K}}) = {[d_{p a r t s}^{a p} - d_{p a r t s}^{a n} + α]}_{+},

(4)

where

{dist}_{e u c l}

denotes the Euclidean distance. We compute the average Euclidean distance for each part feature between two samples i and j within the same mini-batch. Then, the part-averaged distance from the anchor to the hardest positive sample, denoted

d_{p a r t s}^{a p}

, and to the hardest negative sample, denoted

d_{p a r t s}^{a n}

, is calculated. The loss enforces that

d_{p a r t s}^{a p}

is at least a margin

α

smaller than

d_{p a r t s}^{a n}

.

Additionally, to further suppress occlusion noise, binary visibility scores are assigned to each part feature, where

v_{i} = 1

indicates visibility and

v_{i} = 0

denotes occlusion. These scores are applied exclusively during inference to ensure that comparisons are performed only between mutually visible body parts in the query and gallery images. For the foreground, concatenated part, and global features

{f_{f}, f_{c}, f_{g}}

, visibility scores are set to one:

v_{f} = v_{c} = v_{g} = 1

. For each part feature

f_{p_{i}}

with

i \in {1, \dots, K}

, the visibility score

v_{i}

is set to 1 if the corresponding body part mask

M_{k}

,

k \in {1, \dots, K}

, is present; otherwise, it is set to 0, indicating the part is occluded.

3.3. Occlusion-Aware Recovery

To further enhance the model’s occlusion perception capability, we propose an occlusion-aware recovery (OAR) module as a second occlusion-aware branch, illustrated in Figure 7. This module takes the final-stage feature

F_{4} \in R^{\frac{H}{32} \times \frac{W}{32} \times 8 C}

extracted by the Swin Transformer as input. A decoder is then designed, consisting of a convolutional layer that expands the feature channel dimension followed by a reshape operation, to reconstruct

F_{4}

into a recovered image

I_{r e c} \in R^{H \times W \times 3}

matching the original input image size. We employ the Smooth L1 Loss as the reconstruction loss

L_{r e c}

to supervise the recovered image

I_{r e c}

toward the original training image

I_{o r i g} \in R^{H \times W \times 3}

. This encourages the model to better detect and interpret occlusions in a manner similar to human visual perception. The reconstruction loss

L_{r e c}

is defined as follows:

L_{p i x e l} (y_{1}, y_{2}) = \{\begin{matrix} \frac{1}{2} {(y_{1} - y_{2})}^{2}, & | y_{1} - y_{2} | \leq 1 \\ (| y_{1} - y_{2} | - \frac{1}{2}), & o t h e r w i s e \end{matrix} .

(5)

L_{r e c} = \frac{\sum_{c = 1}^{3} \sum_{h = 1}^{H} \sum_{w = 1}^{W} L_{p i x e l} (I_{r e c} (c, h, w), I_{o r i g} (c, h, w))}{3 \times H \times W}

(6)

3.4. Overall Training and Inference Procedure

3.4.1. Training

The detailed training procedure of DOAN is described in Algorithm 1. The overall loss function guiding the network optimization is defined as follows:

L = L_{i d} + L_{t r i}^{p a r t s} (f_{p 1}, \dots, f_{p K}) + λ L_{p a} + L_{r e c}

(7)

where

L_{i d}

denotes the identity loss,

L_{t r i}^{p a r t s}

represents the part average triplet loss, and

L_{p a}

corresponds to the body part attention loss. The hyperparameter

λ

balances the contribution of the body part attention loss in the overall objective and is empirically set to 0.4. Additionally,

L_{r e c}

refers to the recovery loss.

Algorithm 1: Training Procedure of DOAN

Input: Original image

I_{orig}

, Occlusion-augmented image

I_{occ}

, Parsing labels Y, External Occlusion mask

M_{ex_occ}

, hyperparameters

λ

,

α

1: Feature Extraction:
2: Extract multi-scale features ${F_{1}, F_{2}, F_{3}, F_{4}} \leftarrow SwinTransformer (I_{occ})$
3: OASA Module:
4: $F_{concat} \leftarrow Concat (Upsample ({F_{i}}))$
5: $F_{n 1} \leftarrow 1 x 1 Conv (F_{concat})$
6: $F_{n 1} \leftarrow PCSA (F_{n 1})$
7: $M \leftarrow PixelClassifier (F_{n 1})$
8: $Y_{o c c_a w a r e} \leftarrow Y ⊙ (1 - M_{ex_occ})$
9: $L_{p a} \leftarrow CrossEntropy (M, Y_{o c c_a w a r e})$ with label smoothing
10: Obtain part masks ${M_{k}}_{k = 1}^{K}$ from M
11: $F_{n 2} \leftarrow 1 x 1 Conv (F_{concat})$
12: $f_{p_{k}} \leftarrow M_{k} ⊙ F_{n 2}, \forall k = 1, . . ., K$
13: $f_{f} \leftarrow Max (M_{k}) ⊙ F_{n 2}$
14: $f_{c} \leftarrow Concat ({f_{p_{k}}}, \forall k = 1, . . ., K)$
15: $f_{f} \leftarrow AvgPool (f_{f})$ $f_{c} \leftarrow AvgPool (f_{c})$ $f_{g} \leftarrow AvgPool (f_{g})$
16: $L_{i d} \leftarrow \sum_{i \in {f, c, g}} CrossEntropy (f_{i})$ with label smoothing
17: $L_{t r i}^{p a r t s} \leftarrow TripletLoss ({f_{p_{k}}}, α)$
18: OAR Module:
19: $I_{rec} \leftarrow Decoder (F_{4})$
20: $L_{r e c} \leftarrow SmoothL 1 (I_{rec}, I_{orig})$
21: Total Loss:
22: $L \leftarrow L_{i d} + L_{t r i}^{p a r t s} + λ \cdot L_{p a} + L_{r e c}$
23: Update model parameters via backpropagation;

3.4.2. Inference

The detailed inference procedure of DOAN is presented in Algorithm 2. During inference, the occlusion-aware recovery module and the external human parsing model are not utilized. Instead, semantic features are extracted solely using the Occlusion-Aware Semantic Attention module. To compute the distance between a query q and a gallery sample g, we adopt the visibility-based part-to-part matching strategy [10], relying exclusively on the foreground feature and the body part features.

d i s t_{t o t a l}^{q g} = \frac{v_{f}^{q} \cdot v_{f}^{g} \cdot d i s t_{e u c l} (f_{f}^{q}, f_{f}^{g}) + \sum_{i \in {f_{p 1}, \dots, f_{p K}}} (v_{i}^{q} \cdot v_{i}^{g} \cdot d i s t_{e u c l} (f_{p i}^{q}, f_{p i}^{g}))}{v_{f}^{q} \cdot v_{f}^{g} + \sum_{i \in {f_{1}, \dots, f_{K}}} (v_{i}^{q} \cdot v_{i}^{g})} .

(8)

Visibility scores

v_{i}^{q | g}

are applied to ensure comparisons are made only between mutually visible body parts.

Algorithm 2: Inference Procedure of DOAN

4. Experimental Results and Analysis

4.1. Datasets and Evaluation Protocol

To validate the effectiveness of our approach, we conduct experiments on six widely used person re-identification (re-ID) datasets, including two occluded, two partial occluded, and two holistic datasets.

Occluded-Duke [4] is a subset of DukeMTMC specifically designed for occluded person re-ID. It comprises 15,618 training images of 702 identities, 2210 occluded query images of 519 identities, and 17,661 gallery images of 1110 identities. This dataset is considered one of the most challenging benchmarks for occluded re-ID tasks.

Occluded-ReID [22] is collected using mobile devices and includes 2000 images of 200 identities. Each identity has five holistic and five occluded images, making it suitable for evaluating performance under occlusion.

Partial-REID [40] contains 600 images of 60 individuals captured in various campus scenes and from different viewpoints. For each identity, five full-body images and five partial images are provided in both query and gallery sets, respectively.

Partial-iLIDS [41] consists of 238 images of 119 individuals. Captured at an airport using multiple non-overlapping cameras, the query images are all occluded, while the gallery images are unoccluded.

Market-1501 [42] is a large-scale benchmark for holistic person re-ID. It contains 1501 identities captured across six cameras, with 12,936 training images of 751 persons, 19,732 query images, and 3368 gallery images of 750 persons.

DukeMTMC-ReID [43] contains 1404 identities captured by eight different cameras. It provides 16,522 training images of 702 identities, along with 2228 query images and 17,661 gallery images, all from the same 702 individuals.

Evaluation Protocol. We evaluate model performance using two standard metrics: Cumulative Matching Characteristic (CMC) and mean Average Precision (mAP). As Occluded-ReID, Partial-REID, and Partial-iLIDS lack dedicated training sets, we adopt the protocol used in previous studies [6,24,44], training the model on Market-1501 and directly testing on these three datasets.

4.2. Implementation Details

We adopt the Swin Transformer [35] pretrained on ImageNet [45] as our backbone network, resizing all input images to

256 \times 128

. During training, images are augmented with random horizontal flipping, padding, random cropping, and random erasing. Following prior works [10,33], we set the batch size to 64 with four images per identity. The model is optimized using SGD with a momentum of 0.9 and weight decay of

1 \times 10^{- 4}

. The initial learning rate is 0.008, which decays according to a cosine schedule. The label smoothing regularization rate

ε

is set to 0.1 and the triplet loss margin

α

is set to 0.3. Training is conducted for 120 epochs. The number of body parts K is set to 8 for occluded datasets and 5 for holistic datasets. Our DOAN framework is implemented in PyTorch (latest v. 2.7.1) and all experiments are performed on an NVIDIA A100 GPU.

4.3. Comparison with State-of-the-Art Methods

We conduct extensive experiments to compare our proposed DOAN method with current state-of-the-art approaches across six diverse datasets, covering occluded, partial, and holistic person re-identification tasks.

Results on Occluded Datasets. Table 1 presents the performance comparison on occluded Re-ID benchmarks. For clarity, existing methods are divided into CNN-based and Transformer-based categories. Our DOAN method achieves leading results on the Occluded-Duke dataset, attaining 76.8% Rank-1 accuracy and 64.6% mAP, outperforming all competitors. Notably, DOAN surpasses the Swin-B [35] baseline by significant margins of 11.2% in Rank-1 and 11.8% in mAP. The effectiveness of DOAN is further confirmed on the Occluded-REID dataset, where it achieves impressive results with 84.9% Rank-1 accuracy and 78.8% mAP. These superior results can be attributed to DOAN’s dual-branch occlusion-aware design, which explicitly guides the model to focus on visible pedestrian regions while implicitly encouraging the recovery of occluded parts. This comprehensive approach enables the model to robustly extract discriminative features even under severe occlusion, leading to significant performance gains over existing CNN- and Transformer-based methods. Collectively, these findings demonstrate DOAN’s strong robustness in tackling complex occlusion challenges inherent in occluded person re-identification tasks.

Results on Partial Datasets. To further assess the effectiveness of DOAN, we perform comprehensive experiments on two partial person re-identification benchmarks: Partial-REID and Partial-iLIDS. Since these datasets do not provide dedicated training sets, we follow previous studies by training our model on Market-1501 and evaluating it directly on the partial datasets. As reported in Table 2, DOAN achieves state-of-the-art results, attaining 85.7% Rank-1 and 89.0% Rank-3 accuracies on Partial-REID, as well as 78.2% Rank-1 and 91.6% Rank-3 accuracies on Partial-iLIDS. These significant improvements over existing methods validate that our approach, trained on large-scale data, is not only effective in occlusion handling but also robust in addressing the challenges of partial person re-identification.

Results on Holistic Datasets. To further validate the strong generalization ability of our proposed DOAN method, we perform extensive experiments on two widely recognized holistic person re-identification benchmarks: Market-1501 and DukeMTMC-ReID. Although these datasets contain few occluded samples and are not explicitly designed for occlusion handling, DOAN still demonstrates highly competitive performance. Specifically, it achieves 95.7% Rank-1 accuracy and 89.9% mAP on Market-1501, and 91.6% Rank-1 accuracy and 83.0% mAP on DukeMTMC-ReID, as detailed in Table 3. These findings indicate that DOAN not only excels in occluded person re-identification but also provides a robust and versatile framework for general pedestrian retrieval tasks.

4.4. Ablation Study

In this section, we conduct an ablation study to evaluate the contribution of each component in the proposed DOAN framework, specifically the Occlusion-Aware Semantic Attention (OASA) module and the occlusion-aware recovery (OAR) module. All the experiments are performed on the Occluded-Duke dataset, and the results are presented in Table 4.

Effectiveness of the OASA Module: The Occlusion-Aware Semantic Attention (OASA) module is composed of three main components: the pixel classifier (PC), the occlusion-aware human parsing label (OAL), and the parallel channel and spatial attention (PCSA) block. As shown in Table 4, comparing Index-1 and Index-2, the introduction of the pixel classifier leads to an improvement of +7.1% in Rank-1 and +6.3% in mAP over the baseline Swin-Base backbone. This gain stems from the PC’s ability to accurately isolate visible pedestrian body parts for precise matching while effectively suppressing occlusion noise.

Further, from Index-2 to Index-3, utilizing external occlusion information to generate the occlusion-aware human parsing label as supervision enhances performance by an additional +1.6% in Rank-1 and +3.2% in mAP, indicating that the occlusion-aware labels significantly boost the model’s ability to perceive occluded regions.

Comparing Index-3 and Index-4, adding the PCSA block contributes a further improvement of +0.5% in Rank-1 and +1.1% in mAP. This is because the PCSA block enables more precise discrimination between occlusion noise and valid pedestrian body regions.

Finally, comparing Index-4 with Index-5 and Index-6—where only the parallel channel attention (PCA) or spatial attention (SA) is used individually—the performance drops significantly, in some cases even falling below the baseline with no attention. This indicates that a single type of attention mechanism is insufficient. In contrast, the complete PCSA block, which integrates both PCA and SA, compensates for the limitations of each individual module and provides a more comprehensive attention mechanism. This leads to improved occlusion handling and more robust feature extraction.

Effectiveness of the OAR Module: According to Table 4, comparing Index-4 and Index-8, the integration of the occlusion-aware recovery (OAR) module yields an improvement of +2.0% in Rank-1 and +1.2% in mAP. This enhancement can be attributed to the OAR module’s ability to restore occluded semantic features to their unoccluded counterparts, thereby providing complementary guidance for the OASA module to extract more complete and robust part-level representations.

Moreover, when applied independently to the Swin-Base backbone (Index-1 vs. Index-7), the OAR module still achieves performance gains of +1.1% in Rank-1 and +1.6% in mAP. This demonstrates that OAR enables the model to infer missing information in occluded regions in a manner analogous to the human visual system, thereby enhancing its overall occlusion awareness.

These consistent improvements confirm the effectiveness of the OAR module in enhancing the model’s capability to perceive and recover from occlusions through its dedicated recovery mechanism.

4.5. Visualization

Figure 8 illustrates the semantic body part attention maps generated by the OASA module in our DOAN model on the Occluded-Duke dataset. It is evident that the OASA module effectively extracts semantic part features that are particularly beneficial for handling occlusion scenarios.

Figure 9 illustrates the outputs of our OAR module. As shown in samples with IDs 0039, 0332, and 0756, the OAR module is able to recover missing information to some extent. However, certain limitations remain. Since the reconstruction is directly performed using the final stage features of the Swin Transformer, the generated images exhibit blocky artifacts and still differ significantly from the original inputs. Future work may focus on improving the reconstruction quality by refining the architecture or incorporating multi-scale features.

In addition, we present qualitative results to further illustrate the effectiveness of our method. As shown in Figure 10, we display the top-10 retrieval outcomes for the baseline model, the baseline with the OASA module, and the full DOAN framework. The visual comparisons clearly indicate that DOAN achieves significantly improved recognition performance, particularly under occlusion scenarios.

5. Conclusions

In this paper, we introduced a novel dual-branch occlusion-aware network (DOAN) designed to tackle the challenges of occluded person re-identification. DOAN integrates two core components: the Occlusion-Aware Semantic Attention (OASA) module and the occlusion-aware recovery (OAR) module. The OASA module leverages our specially designed parallel channel and spatial attention (PCSA) block to extract semantic part features, effectively distinguishing pedestrian body regions from occlusion noise. Furthermore, by combining external human parsing annotations with occluder masks, we generate occlusion-aware parsing labels that provide structural supervision, guiding the model to focus on visible body parts. Complementing this, the OAR module reconstructs occluded pedestrian images, enabling the network to implicitly localize occlusions and recover missing information in a manner akin to human visual perception.

Extensive experiments on occluded, partial, and holistic person re-identification benchmarks demonstrate the effectiveness and superiority of our DOAN framework. Despite these promising results, certain limitations remain. The dependency on external parsing annotations may restrict scalability and generalization. Moreover, the recovery module encounters difficulties with severe or rare occlusions, often producing reconstructed images with noticeable blocky artifacts that deviate from the originals.

For future work, we plan to investigate self-supervised parsing approaches to reduce reliance on external annotations, enhance reconstruction quality by refining the architecture and incorporating multi-scale features, and extend DOAN’s applicability to more complex real-world scenarios involving dynamic and multi-person occlusions.

Author Contributions

Conceptualization, B.S.; Methodology, B.S. and Y.Z.; Writing—original draft preparation, Y.Z.; Data curation, Y.Z. and J.W.; Writing—Review and Editing, B.S. and C.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fujian Provincial Natural Science Foundation of China (2024J01157) and Fujian University of Technology Research Fund Project (GY-Z220212).

Data Availability Statement

The data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2872–2893. [Google Scholar] [CrossRef] [PubMed]
Wu, D.; Zheng, S.J.; Zhang, X.P.; Yuan, C.A.; Cheng, F.; Zhao, Y.; Lin, Y.J.; Zhao, Z.Q.; Jiang, Y.L.; Huang, D.S. Deep learning-based methods for person re-identification: A comprehensive review. Neurocomputing 2019, 337, 354–371. [Google Scholar] [CrossRef]
Peng, Y.; Wu, J.; Xu, B.; Cao, C.; Liu, X.; Sun, Z.; He, Z. Deep learning based occluded person re-identification: A survey. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 20, 1–27. [Google Scholar] [CrossRef]
Miao, J.; Wu, Y.; Liu, P.; Ding, Y.; Yang, Y. Pose-guided feature alignment for occluded person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 542–551. [Google Scholar]
Gao, S.; Wang, J.; Lu, H.; Liu, Z. Pose-guided visible part matching for occluded person reid. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 14–19 June 2020; pp. 11744–11752. [Google Scholar]
Wang, G.; Yang, S.; Liu, H.; Wang, Z.; Yang, Y.; Wang, S.; Yu, G.; Zhou, E.; Sun, J. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 14–19 June 2020; pp. 6449–6458. [Google Scholar]
Yang, J.; Zhang, J.; Yu, F.; Jiang, X.; Zhang, M.; Sun, X.; Chen, Y.C.; Zheng, W.S. Learning to know where to see: A visibility-aware approach for occluded person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 11885–11894. [Google Scholar]
Zheng, K.; Lan, C.; Zeng, W.; Liu, J.; Zhang, Z.; Zha, Z.J. Pose-guided feature learning with knowledge distillation for occluded person re-identification. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 4537–4545. [Google Scholar]
Wang, S.; Huang, B.; Li, H.; Qi, G.; Tao, D.; Yu, Z. Key point-aware occlusion suppression and semantic alignment for occluded person re-identification. Inf. Sci. 2022, 606, 669–687. [Google Scholar] [CrossRef]
Somers, V.; De Vleeschouwer, C.; Alahi, A. Body part-based representation learning for occluded person re-identification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 1613–1623. [Google Scholar]
Jia, M.; Sun, Y.; Zhai, Y.; Cheng, X.; Yang, Y.; Li, Y. Semi-attention partition for occluded person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 29 January 2023; Volume 37, pp. 998–1006. [Google Scholar]
Dou, S.; Jiang, X.; Tu, Y.; Gao, J.; Qu, Z.; Zhao, Q.; Zhao, C. DROP: Decouple Re-Identification and Human Parsing with Task-specific Features for Occluded Person Re-identification. arXiv 2024, arXiv:2401.18032. [Google Scholar]
Cui, C.; Huang, S.; Song, W.; Ding, P.; Zhang, M.; Wang, D. ProFD: Prompt-Guided Feature Disentangling for Occluded Person Re-Identification. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 1583–1592. [Google Scholar]
Gao, S.; Yu, C.; Zhang, P.; Lu, H. Part representation learning with teacher-student decoder for occluded person re-identification. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 2660–2664. [Google Scholar]
Somers, V.; Alahi, A.; Vleeschouwer, C.D. Keypoint promptable re-identification. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 216–233. [Google Scholar]
Li, Z.; Zhang, H.; Zhu, L.; Sun, J.; Liu, L. MSPL: Multi-granularity Semantic Prototype Learning for occluded person re-identification. Neurocomputing 2025, 634, 129894. [Google Scholar] [CrossRef]
Li, Y.; He, J.; Zhang, T.; Liu, X.; Zhang, Y.; Wu, F. Diverse part discovery: Occluded person re-identification with part-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 2898–2907. [Google Scholar]
Jia, M.; Cheng, X.; Lu, S.; Zhang, J. Learning disentangled representation implicitly via transformer for occluded person re-identification. IEEE Trans. Multimed. 2022, 25, 1294–1305. [Google Scholar] [CrossRef]
Tan, L.; Dai, P.; Ji, R.; Wu, Y. Dynamic prototype mask for occluded person re-identification. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 531–540. [Google Scholar]
Tan, H.; Liu, X.; Yin, B.; Li, X. MHSA-Net: Multihead self-attention network for occluded person re-identification. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 8210–8224. [Google Scholar] [CrossRef]
Cheng, X.; Jia, M.; Wang, Q.; Zhang, J. More is better: Multi-source dynamic parsing attention for occluded person re-identification. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 6840–6849. [Google Scholar]
Zhuo, J.; Chen, Z.; Lai, J.; Wang, G. Occluded person re-identification. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
Zhang, L.; Cheng, S.; Wang, L. Label-guided diversified learning model for occluded person re-identification. Expert Syst. Appl. 2025, 272, 126745. [Google Scholar] [CrossRef]
Wang, Z.; Zhu, F.; Tang, S.; Zhao, R.; He, L.; Song, J. Feature erasing and diffusion network for occluded person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4754–4763. [Google Scholar]
Chen, P.; Liu, W.; Dai, P.; Liu, J.; Ye, Q.; Xu, M.; Chen, Q.; Ji, R. Occlude them all: Occlusion-aware attention network for occluded person re-id. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11833–11842. [Google Scholar]
Shu, X.; Li, H.; Wen, W.; Qiao, R.; Li, N.; Ruan, W.; Su, H.; Wang, B.; Chen, S.; Zhou, J. Precise occlusion-aware and feature-level reconstruction for occluded person re-identification. Neurocomputing 2025, 616, 128919. [Google Scholar] [CrossRef]
Wang, T.; Liu, M.; Liu, H.; Li, W.; Ban, M.; Guo, T.; Li, Y. Feature completion transformer for occluded person re-identification. IEEE Trans. Multimed. 2024, 26, 8529–8542. [Google Scholar] [CrossRef]
Zhao, C.; Qu, Z.; Jiang, X.; Tu, Y.; Bai, X. Content-adaptive auto-occlusion network for occluded person re-identification. IEEE Trans. Image Process. 2023, 32, 4223–4236. [Google Scholar] [CrossRef]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. Rtmdet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar] [CrossRef]
Kreiss, S.; Bertoni, L.; Alahi, A. Pifpaf: Composite fields for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 11977–11986. [Google Scholar]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 274–282. [Google Scholar]
He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15013–15022. [Google Scholar]
Chen, W.; Xu, X.; Jia, J.; Luo, H.; Wang, Y.; Wang, F.; Jin, R.; Sun, X. Beyond appearance: A semantic controllable self-supervised learning framework for human-centric visual tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 18–22 June 2023; pp. 15050–15061. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Mao, G.; Liao, G.; Zhu, H.; Sun, B. Multibranch attention mechanism based on channel and spatial attention fusion. Mathematics 2022, 10, 4150. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Zheng, W.S.; Li, X.; Xiang, T.; Liao, S.; Lai, J.; Gong, S. Partial person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 4678–4686. [Google Scholar]
Zheng, W.S.; Gong, S.; Xiang, T. Person re-identification by probabilistic relative distance comparison. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 649–656. [Google Scholar]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar]
Zheng, Z.; Zheng, L.; Yang, Y. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3754–3762. [Google Scholar]
Wang, P.; Ding, C.; Shao, Z.; Hong, Z.; Zhang, S.; Tao, D. Quality-aware part models for occluded person re-identification. IEEE Trans. Multimed. 2022, 25, 3154–3165. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Huang, M.; Hou, C.; Yang, Q.; Wang, Z. Reasoning and tuning: Graph attention network for occluded person re-identification. IEEE Trans. Image Process. 2023, 32, 1568–1582. [Google Scholar] [CrossRef]
Zheng, H.; Shi, Y.; Ling, H.; Li, Z.; Wang, R.; Li, Z.; Li, P. Cascade transformer reasoning embedded by uncertainty for occluded person re-identification. IEEE Trans. Biom. Behav. Identity Sci. 2024, 6, 219–229. [Google Scholar] [CrossRef]
Xu, B.; He, L.; Liang, J.; Sun, Z. Learning feature recovery transformer for occluded person re-identification. IEEE Trans. Image Process. 2022, 31, 4651–4662. [Google Scholar] [CrossRef]
Pang, Y.; Zhang, H.; Zhu, L.; Liu, D.; Liu, L. Self-similarity guided probabilistic embedding matching based on transformer for occluded person re-identification. Expert Syst. Appl. 2024, 237, 121504. [Google Scholar] [CrossRef]
He, L.; Liang, J.; Li, H.; Sun, Z. Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7073–7082. [Google Scholar]
Sun, Y.; Xu, Q.; Li, Y.; Zhang, C.; Li, Y.; Wang, S.; Sun, J. Perceive where to focus: Learning visibility-aware part-level features for partial person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 393–402. [Google Scholar]
Luo, H.; Jiang, W.; Fan, X.; Zhang, C. Stnreid: Deep convolutional networks with pairwise spatial transformer networks for partial person re-identification. IEEE Trans. Multimed. 2020, 22, 2905–2913. [Google Scholar] [CrossRef]
Miao, J.; Wu, Y.; Yang, Y. Identifying visible parts via pose estimation for occluded person re-identification. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 4624–4634. [Google Scholar] [CrossRef]
Zhao, Y.; Zhu, S.; Wang, D.; Liang, Z. Short range correlation transformer for occluded person re-identification. Neural Comput. Appl. 2022, 34, 17633–17645. [Google Scholar] [CrossRef]
Zhu, K.; Guo, H.; Liu, Z.; Tang, M.; Wang, J. Identity-guided human semantic parsing for person re-identification. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 346–363. [Google Scholar]

Figure 1. Examples of synthetically occluded images in the Occluded-Duke dataset. Occlusions are generated by randomly pasting segmented pedestrians and objects—extracted using an external instance segmentation model [29]—onto original images.

Figure 2. Overall framework of our DOAN. The model first randomly selects an occlusion pattern from an external occlusion bank and overlays it onto the input training image. A Swin Transformer is then used to extract multi-scale feature maps. These features are fed into the OASA module, which leverages the proposed PCSA block and our actively designed occlusion-aware parsing labels to capture fine-grained semantic part representations. Meanwhile, the OAR module is applied in parallel to enhance the model’s awareness and robustness under occluded conditions.

Figure 3. External occlusion bank constructed from the Market-1501 dataset, which is used to synthesize occluded images for the Occluded-Duke dataset.

Figure 4. Structure of the Occlusion-Aware Semantic Attention (OASA) module. Multi-level features from the Swin Transformer [35] are fused and refined through the proposed PCSA block to better distinguish occlusions from body regions. A pixel classifier then generates semantic part representations, guided by occlusion-aware parsing labels.

Figure 5. Illustration of the PCSA block. It consists of two submodules: parallel channel attention and spatial attention. The former models channel-height and channel-width interactions via dual attention branches to enhance feature representations. The latter refines these features by emphasizing informative spatial regions, thereby suppressing occlusion noise and improving pixel-level body part classification.

Figure 6. Examples of occlusion-aware human parsing labels. Human parsing annotations are first obtained using the method in [30] and then filtered by occlusion masks generated from external sources (see Section 3.1). The resulting labels emphasize visible body parts under occlusion and are used to improve the OASA module’s performance.

Figure 7. The structures of the occlusion-aware recovery (OAR) module. The OAR module receives the final-stage feature from the Swin Transformer and utilizes a decoder to reconstruct it into a recovered image that matches the original input.

Figure 8. Visualization of semantic body part attention maps generated by the OASA module in our DOAN model on the Occluded-Duke dataset. The first column represents the input query image and the corresponding returned rank list. The green boxes in the first column indicate images of the same identity, while the red boxes represent different identities. “dist” denotes the Euclidean distance between image features. The column at index 0 shows the foreground attention map of the input image, and columns at indices 1 to 8 correspond to the attention maps of the k semantic body parts, where k is set to 8. In the columns from index 0 to 8, green indicates that the corresponding body part is visible, while red indicates it is occluded.

Figure 9. Visualization of the outputs from our OAR module. The first column shows the original images. The second column presents the model inputs after the preprocessing steps, such as resizing and normalization. The third column displays the reconstructed outputs generated by our OAR module.

Figure 10. Comparison of the top-10 retrieval results on the Occluded-Duke dataset among the base model, base + OASA, and our DOAN. The first column represents the input pedestrian image, and the following 10 columns show the returned rank list results. Correct retrievals are highlighted with green boxes, while incorrect retrievals are marked with red boxes.

Table 1. Comparison results on occluded person Re-ID datasets.

Methods	Backbone Type	Occluded-Duke		Occluded-REID
Methods	Backbone Type	Rank-1	mAP	Rank-1	mAP
PCB [31]	CNN-based	42.6	33.7	41.3	38.9
PGFA [4]		51.4	37.3	-	-
PVPM [5]		47.0	37.7	66.8	59.5
HoReID [6]		55.1	43.8	80.3	70.2
PGFL [8]		63.0	54.1	80.7	70.3
RTGAT [46]		61.0	50.1	71.8	51.0
BPBreID [10]		66.7	54.1	76.9	68.6
CTU [47]		68.2	56.1	83.4	74.5
PAT [17]	Transformer-based	64.5	53.6	81.6	72.1
TransRelD [33]		66.4	59.2	-	-
FRT [48]		70.7	61.3	80.4	71.0
MSDPA [21]		70.4	61.7	81.9	77.5
SAP [11]		70.0	62.2	83.0	76.8
SSPEM [49]		70.2	62.8	82.8	78.5
Swin-B [35]		65.6	52.8	82.5	77.9
DOAN (ours)		76.8	64.6	84.9	78.8

Table 2. Comparison results on partial person Re-ID datasets.

Methods	Partial-REID		Partial-iLIDS
Methods	Rank-1	Rank-3	Rank-1	Rank-3
PCB [31]	66.3	-	46.8	-
DSR [50]	58.8	67.2	50.7	54.8
VPM [51]	67.7	81.9	65.5	74.8
PGFA [4]	68.8	80.0	69.1	80.9
STNReID [52]	66.7	80.3	54.6	71.3
PVPM [5]	78.3	87.7	-	-
PMF [53]	72.5	83.0	70.6	81.3
MSHA-Net [20]	72.5	83.0	70.6	81.3
PFT [54]	81.3	-	74.8	87.3
QPM [44]	81.7	88.0	77.3	85.7
Swin-B [35]	77.7	81.7	76.5	90.8
DOAN (ours)	85.7	89.0	78.2	91.6

Table 3. Comparison results on holistic person Re-ID datasets.

Methods	Market-1501		DukeMTMC-reID
Methods	Rank-1	mAP	Rank-1	mAP
PCB [31]	92.3	77.4	81.8	66.1
MGN [32]	95.7	86.9	88.7	78.4
PGFA [4]	91.2	76.8	82.6	65.5
HoReID [6]	94.2	84.9	86.9	75.6
ISP [55]	95.3	88.6	89.6	80.0
PGFL [8]	95.3	87.2	89.6	79.5
OAMN [25]	93.2	79.8	86.3	72.6
TransRelD [33]	95.0	88.2	89.6	80.6
FRT [48]	95.5	88.1	90.5	81.7
BPBreID [10]	95.1	87.0	89.6	78.3
CTU [47]	95.7	88.3	89.5	78.3
SSPEM [49]	95.3	89.2	91.0	82.9
Swin-B [35]	94.6	86.2	88.2	74.7
DOAN (ours)	95.7	89.9	91.6	83.0

Table 4. Ablation study of the proposed components on the Occluded-Duke dataset.

Index	Swin-B	OASA					OAR	Params	Occluded-Duke
Index	Swin-B	PC	OAL	PCSA	PCA	SA	OAR	Params	Rank-1	mAP
1	✓							86.9 M	65.6	52.8
2	✓	✓						86.91 M	72.7	59.1
3	✓	✓	✓					86.91 M	74.3	62.3
4	✓	✓	✓	✓				86.91 M	74.8	63.4
5	✓	✓	✓		✓			86.91 M	74.6	62.6
6	✓	✓	✓			✓		86.91 M	74.4	62.0
7	✓						✓	90 M	66.7	54.4
8	✓	✓	✓	✓			✓	90.01 M	76.8	64.6

OASA denotes the Occlusion-Aware Semantic Attention module, which consists of three components: PC (pixel classifier), OAL (occlusion-aware human parsing Label), and PCSA (parallel channel and spatial attention). PCA and SA refer to using only the parallel channel or spatial attention, respectively. OAR denotes the occlusion-aware recovery module. “Params” indicates the parameter count of each component.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, B.; Zhang, Y.; Wang, J.; Jiang, C. Dual-Branch Occlusion-Aware Semantic Part-Features Extraction Network for Occluded Person Re-Identification. Mathematics 2025, 13, 2432. https://doi.org/10.3390/math13152432

AMA Style

Sun B, Zhang Y, Wang J, Jiang C. Dual-Branch Occlusion-Aware Semantic Part-Features Extraction Network for Occluded Person Re-Identification. Mathematics. 2025; 13(15):2432. https://doi.org/10.3390/math13152432

Chicago/Turabian Style

Sun, Bo, Yulong Zhang, Jianan Wang, and Chunmao Jiang. 2025. "Dual-Branch Occlusion-Aware Semantic Part-Features Extraction Network for Occluded Person Re-Identification" Mathematics 13, no. 15: 2432. https://doi.org/10.3390/math13152432

APA Style

Sun, B., Zhang, Y., Wang, J., & Jiang, C. (2025). Dual-Branch Occlusion-Aware Semantic Part-Features Extraction Network for Occluded Person Re-Identification. Mathematics, 13(15), 2432. https://doi.org/10.3390/math13152432

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual-Branch Occlusion-Aware Semantic Part-Features Extraction Network for Occluded Person Re-Identification

Abstract

1. Introduction

2. Related Work

2.1. Part-Based Methods

2.2. Occlusion Augmentation Methods

3. Methods

3.1. Occlusion Augmentation Strategy

3.2. Occlusion-Aware Semantic Attention and Semantic Features Extraction

3.3. Occlusion-Aware Recovery

3.4. Overall Training and Inference Procedure

3.4.1. Training

3.4.2. Inference

4. Experimental Results and Analysis

4.1. Datasets and Evaluation Protocol

4.2. Implementation Details

4.3. Comparison with State-of-the-Art Methods

4.4. Ablation Study

4.5. Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI