CNN Attention Enhanced ViT Network for Occluded Person Re-Identification

Wang, Jing; Li, Peitong; Zhao, Rongfeng; Zhou, Ruyan; Han, Yanling

doi:10.3390/app13063707

Open AccessArticle

CNN Attention Enhanced ViT Network for Occluded Person Re-Identification

by

Jing Wang

,

Peitong Li

^*

,

Rongfeng Zhao

,

Ruyan Zhou

and

Yanling Han

Shanghai Marine Intelligent Information and Navigation Remote Sensing Engineering Technology Research Center, Shanghai Ocean University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(6), 3707; https://doi.org/10.3390/app13063707

Submission received: 5 February 2023 / Revised: 13 March 2023 / Accepted: 13 March 2023 / Published: 14 March 2023

(This article belongs to the Special Issue New Trends in Image Processing III)

Download

Browse Figures

Versions Notes

Abstract

:

Person re-identification (ReID) is often affected by occlusion, which makes most of the features extracted by ReID models contain a lot of identity-independent noise. Recently, the use of Vision Transformer (ViT) has enabled significant progress in various visual artificial intelligence tasks. However, ViT suffers from insufficient local information extraction capability, which should be of concern to researchers in the field of occluded ReID. This paper conducts a study to exploit the potential of attention mechanisms to enhance ViT in ReID tasks. In this study, an Attention Enhanced ViT Network (AET-Net) is proposed for occluded ReID. We use ViT as the backbone network to extract image features. Even so, occlusion and outlier problems still exist in ReID. Then, we combine the spatial attention mechanism into the ViT architecture, by which we enhance the attention of ViT patch embedding vectors to important regions. In addition, we design a MultiFeature Training Module to optimize the network by the construction of multiple classification features and calculation of the multi-feature loss to enhance the performance of the model. Finally, the effectiveness and superiority of the proposed method are demonstrated by broad experiments on both occluded and non-occluded datasets.

Keywords:

person re-identification; vision transformer; artificial intelligence; attention mechanism; classification

1. Introduction

With the rise of digital cities, intelligent security projects cover numerous fields, and intelligent monitoring systems play a huge role in maintaining the convenience and security stability of society. Person re-identification (ReID) aims to correlate identities captured by different image sensor devices (like CCD or CMOS) using computer vision-related techniques. However, ReID also faces many problems with uncontrollable factors in complex environments, such as low-resolution of images, light variations, pose changes, camera viewpoint changes, occlusion and inaccurate detection making Re-ID still difficult to solve [1,2,3]. The various challenges are shown in Figure 1. Among them, local occlusions often occur in our real life, which significantly influences the accuracy of models for identity recognition [4,5,6]. Therefore, the construction of a ReID model, which can solve the occlusion problem, is crucial [7,8,9].

Based on a detailed observation of the occlusion situation, the types of occlusions can be classified into two types. The first type is external occlusion which does not belong to the human himself (e.g., a car or a railing). These can be easily separated. The second type is additional occlusion. These occlusions belong to the human (e.g., a backpack or umbrella) and change with the posture and viewing angle of the human. As a result, additional occlusions are difficult to separate from the humans themselves.

The current research on deep learning for ReID is mainly divided into two types: supervised learning and unsupervised learning [1], but unsupervised learning is not as effective as supervised learning in terms of recognition accuracy due to the limitation of model training [10,11]. However, on the other hand, for the scarcity and hard-to-label ReID datasets, some researchers have done in-depth research on data sampling, weakly supervised learning and unsupervised learning [12,13,14]. In recent years, with the help of convolutional neural networks, supervised learning models have been broken through in solving the occlusion ReID problem [2,3,4,5,6,7,8,9]. Some researchers extracted local regions of humans by combining with target segmentation [15,16,17] and semantic analysis [2,7,18,19,20,21] models to accomplish identity matching, but the introduction of additional models made the cost of occluded ReID much higher. Additionally, other researchers also match detailed information by region segmentation, but such an approach does not work well in the case of occlusion [15,18,22,23]. Because of the randomness, occlusions make these works often fail to match during model identification. Furthermore, lots of works have introduced attention mechanisms into the ReID model in order to filter more significant parts from a huge amount of information, aiming to focus the attention of the model to identify information and ignore secondary information [5,15,16,24,25,26].

It has been studied that CNNs use convolution to limit the extraction of image features by the model to small regions, and the use of down-sampling modules would tend to lose the local details of humans [24]. In recent years, the Transformer, popular in Natural Language Processing (NLP) [27,28], was introduced to the ReID task, achieving comparable performance to CNN-based methods with the help of a self-attention mechanism [5,24,29]. However, ViT suffers from insufficient local information extraction capability [30,31], which should be concerning for researchers in the field of occluded ReID.

We hope to improve ViT’s ability to focus on important areas by incorporating CNN’s spatial attention mechanism. This method uses the convolution process to extract local importance distribution, similar to how the CNN attention mechanism enhances attention to important information. In addition, the channel attention of CNN can extract the channel distribution state of the current feature vector, and it can enhance the attention of the model to the important channel features according to the backpropagation algorithm. Based on the above, this work proposed an Attention Enhanced ViT Network (AET-Net) to address the challenges of ReID occlusion, the general procedure is shown in Figure 2. Additionally, the specific methodology is presented in Section 3, following the related works of Section 2. Thereafter the experimental settings and results are shown in Section 4. The paper’s contributions are summarised as follows:

We proposed a ViT-based ReID framework, named AET-Net, to address the occlusion problem.
Spatial Attention Enhanced Module (SAEM) is designed to strengthen the ViT patch embedding structure to enhance its attention to important local features of the image.
MultiFeature Training Module (MFTM) exploits multiple losses to optimize the model and also to keep the model from over-biasing the attention feature.
AET-Net achieves superior performance on the occluded dataset and competitive performance on the non-occluded datasets.

2. Related Work

2.1. Occluded Person ReID Based on Deep Learning

Occluded ReID aims to find the unoccluded region in the image for identity matching, and the incomplete information makes this task more of a challenge. However, most of the ReID models are based on the existing complete human information in the image and are highly sensitive to the occlusion situation [16,17,20,23,32,33]. We divide the occluded ReID models into two types: partial models and attention models. The process of these two types of models is shown in Figure 3.

2.1.1. Partial Models

Robust local features have been demonstrated to help occluded ReID [2,16,18,20,22]. Previous methods mainly construct local regions to extract local features by human segmentation, human semantic analysis, and horizontal stripe segmentation. Song et al. [16] designed the MGCAM, which uses a binary segmentation mask to segment human regions. Models for solving occlusion like PDC [20], HOReID [2], PGFA [7], and Spindle Net [18] are used to construct local regions by generating human local key points through semantic analysis models. Sun et al. [22] proposed PCB that divides human images into multiple horizontal local regions from top to bottom. In addition, the alignment of the human body has also been shown to improve the accuracy of the model [6,7,9,34,35,36]. However, the accuracy of these strategies is affected by the division method and pre-trained models.

2.1.2. Attention Models

Attention mechanisms have been studied in progressive deep learning research that attempts to focus the model’s attention on locally important information, which plays the role of enhancing recognition features and suppressing irrelevant features [37]. A large number of studies have introduced it into ReID tasks and achieved excellent performance. Li et al. [38] proposed a lightweight attention network, HA-CNN, which learns both hard regional attention and soft pixel attention in human images to obtain invariant feature representations. Cai et al. [15] proposed MMGA in order to extract discriminable features in human images by learning global and local attention jointly. Zhang et al. [25] proposed a powerful module for Relation-aware Global Attention (RGA) that strengthens the discriminative ability of features themselves by extracting feature-to-feature relationships. Affected by Transformer in the NLP field, He et al. [24] proposed TransReID, which is the first Transformer-based framework for ReID. High performance was obtained with the latter on several public datasets, including occluded datasets. Jia et al. [5] proposed a DRL-Net, which also uses the Transformer architecture to achieve advanced performance on occluded ReID datasets.

2.2. Transformer in Computer Vision

The performance of the Transformer in NLP has brought a huge revolution to computer vision [39,40,41,42,43,44,45]. Dosovitskiy et al. [39] propose a pure Transformer model referred to as a Vision Transformer (ViT), which solves the input problem of image classification and it can extract the hidden relationships between sequence vectors through self-attention achieve. It has advanced performance on several image classification benchmarks. However, in the process of patch embedding, local feature extraction in ViT is to use convolution, which does not have strong robustness [30,31]. To solve this problem, researchers have made attempts to integrate the advantages of Transformer architecture with CNN structure and proposed various methods to enhance the local feature extraction capability of ViT. It has been shown that Transformer can strengthen the CNN convolution performance (VT [40], BoT-Net [41]). In addition, DEIT [43], ConViT [44], and ResT [45] also demonstrate that Transformer architectures can be enhanced by CNN structures to achieve more advanced performance.

Different from the existing ReID model, our proposed AET-Net attempts to use the Transformer for feature learning of ReID and constructs CNN attention modules applicable to ViT for improving the robustness and classification ability of ViT global features by spatial enhancement and multi-loss optimization.

3. Method

With this section, the proposed framework for AET-Net is introduced, including the ViT [39] architecture as a backbone network and the SAEM to enhance the ViT concentrating on more relevant regions for the ReID task. As well, MFTM is devised by utilizing channel attention for the construction of multi-loss, thus further optimizing the network. In addition, MFTM allows the model not to be overly biased towards attentional features. These two modules are to be trained jointly through an end-to-end method. The overall architecture of AET-Net is shown in Figure 4.

3.1. Feature Extraction Backbone

This section introduces the backbone network of AET-Net based on ViT, an architecture that is the basis for extracting reliable global features of humans. The utilization of ViT is inspired by the nature of strong global correlation in Transformer, its success in NLP and applicable computer vision. Based on the above ideas and inspired by the work of He [24], we use ViT as the backbone network of AET-Net, which consists of four main components.

Patch Embedding: Given a human image vector

x_{i n} \in R^{H \times W \times C}

where H, W, and C denote the height, width, and number of image channels, respectively. To solve the input problem of the Transformet in computer vision, ViT proposes patch embedding, which uses a fixed-size

K \times K

convolution kernel to partition the image into

N = HW / K^{2}

non-overlapping image blocks and embed them as feature sequence vectors

f = {f_{1}, f_{2}, \dots, f_{N}}

into the Transformer Encoder. The process of patch embedding is shown in the green area of Figure 4.

Class Token and Positional Embedding: ViT adds an additional learnable embedding class token noted as classification representation before the feature sequence vector, denoted by

f_{c l s}

, which will be used to represent the state of the transformer encoder output. The class token is shown as [cls] in the blue part of Figure 4. Furthermore, to keep the image location information, a learnable bilinear 2D interpolation position embedding

v_{p o s}

is applied to ViT. The positional embedding is shown in the lower right of the blue area in Figure 4. The process of obtaining the standard input vectors of the Transformer Encoder by positional embedding can be expressed as:

f_{i n} = c a t (f_{c l s}; f) + v_{p o s},

(1)

where

f_{i n} \in R^{(N + 1) \times D}

denotes the standard input vectors of the Transformer Encoder, D denotes the convolutional kernel dimension of patch embedding, and

c a t (\cdot)

denotes the concatenating operation.

Transformer Encoder: A transformer encoder is composed of L transformer layers joined in a series; each transformer layer comprises a multi-headed self-attention layer (MSA) and an MLP layer.

Supervision Learning: Similar to the general ReID framework, Cross-Entropy loss and Triplet loss are used by AET-Net to guide the learning of the entire network [5,10,22,35,36], but the difference is that AET-Net constructs attentional features from channel attention, with multi-feature corresponding to losses, and the details of how the different losses are calculated are in Section 3.4.

3.2. Spatial Attention Enhancement Module

Although ViT can obtain more robust global features, its approach of obtaining sequence vectors through patch embedding would have an insufficient focus on local regions and remain challenging in the case of occluded ReID. The SAEM is inspired by the CNN attention mechanism, which aims to enhance ViT’s attention to important local information by strengthening the regional information of image feature vectors after patch embedding [46]. Specifically, we generate a 2D spatial attention map produced by exploiting the spatial relationship between the sequence vectors of the patch embedding, and this 2D spatial attention map emphasizes “where” the given image is important. The detailed process of SAEM is shown in Figure 5a.

To calculate the spatial attention map, in the first place, we reshape the sequence vector map

f \in R^{B \times N \times D}

after patch embedding in N dimensions as

f^{'} \in R^{B \times \frac{H}{K} \times \frac{W}{K} \times D}

, where B is the number of images inferred from one batch, K is the convolutional size of patch embedding,

N = H W / K^{2}

is the number of sequence vector, and

f

is denoted as a generalization of the information of

N \times K \times K

regions. Second, average pooling and maximum pooling are performed separately along the channel dimension, and the results of both are concatenated to produce a feature descriptor. Along the channel dimension, pooling has been shown to be effective in highlighting local information [22]. A convolution is used on the feature descriptors to generate spatial features corresponding to the

f^{'}

channel dimension. Finally, the spatial feature is put into a sigmoid activation function to map into a probability distribution for obtaining a spatial attention map. The process of applying the spatial attention map to the sequence vector can be simply defined as:

{f^{'}}_{s a} = s i g m o i d (S A (f^{'})) \times f^{'},

(2)

where

{f^{'}}_{s a}

denotes the spatial attention-enhanced feature,

s i g m o i d (\cdot)

denotes the probability mapping function, and

S A (\cdot)

denotes the generation process of spatial attention features, the specific generation process of which can be defined as:

S A (x) = C o v^{7 \times 7} (c a t (A v g P o o l (x); M a x P o o l (x))),

(3)

where

A v g P o o l (\cdot)

and

M a x P o o l (\cdot)

denote the average pooling and maximum pooling, respectively,

c a t (\cdot)

denotes the splicing operation, and

C o v^{7 \times 7}

denotes the

7 \times 7

convolution process.

The feature vector

{f^{'}}_{s a}

generated by SAEM is a weighted sum with the original feature vector

f

by the weighted hyperparameter

λ

; this process can be defined as:

f_{s a} = λ \times {f^{'}}_{s a} + (1 - λ) \times f,

(4)

where

f_{s a}

denotes the weighted vector. It’s going to be reshaped as

B \times N \times D

and spliced with the classification representation

f_{c l s}

. The splicing result and the position embedding are added and input to the Transformer encoder. In the ablation study, we analyzed the weighted hyperparameters

λ

. The overall process of SAEM is shown in the blue part of Figure 4.

3.3. MultiFeature Training Module

The ViT has gained popularity for its unique global feature extraction, but its fixed training mode of using individual global features for classification brings challenges compared with the flexible and versatile characteristics of CNN. It has been studied that multi-feature of deep learning can correspond to the same label, and combining multiple features for the model training could help to improve its effectiveness [1,22,47].

We construct a MultiFeature Training Module (MFTM) for ViT to improve the effectiveness of AET-Net. As shown in the upper half of the blue, orange, and gray areas in Figure 4. The original ViT directly uses the classification representation

g_{c l s} \in R^{1 \times C}

after Transformer Encoder as the global features, but directly using them for classification will inevitably lose some information of the remaining N encoding vectors. In detail, MFTM takes into consideration two characteristics of the global feature. First, the dimension of the global feature from spatial is 1. At the same time, it is strongly representative of the global feature as it integrates information from all channels in the entire image. However, when the occlusion exists, a large amount of occlusion and complex background is blended in this feature, which is unfavorable for the identity recognition of the occlusion condition. In other words, although the global features of the Transformer Encoder are highly representative, too much noise is involved in its channel dimension. In contrast to spatial attention, channel attention aims to emphasize the “what” is important for a given image. The proposed MFTM conducts its training process by constructing attention global features jointly with the original ViT global features, and the process of attention global features construction is shown in Figure 5b. It is divided into two stages: computation of channel attention map and generation of global attention features.

For the calculation of the channel attention map, we first perform spatial compression on the reshaped vector

f^{'}

after patch embedding; two feature descriptors are generated by aggregating spatial information using mean and maximum pooling. Next, two feature descriptors are fed into a shared multilayer perceptron to generate two channel features, and then the two channel features are combined into one whole channel feature. Finally, the synthesized channel features are fed into a sigmoid function which yields a channel attention map. The calculation process of a channel attention map is represented as follows:

w_{c a} = s i g m o i d (C A (f^{'})),

(5)

where

w_{c a}

denotes the generated channel attention map,

f^{'}

denotes the reshaped vector

f

after patch embedding, and

C A (\cdot)

denotes the extraction process of channel attention. The detailed process of

C A (\cdot)

can be expressed as follows:

C A (x) = M L P (A v g P o o l (x)) + M L P (M a x P o o l (x)),

(6)

where

M L P (\cdot)

denotes a multilayer perceptron shared by two feature descriptors. In this way, the global features of the Transformer encoder output will be applied to the channel attention map generating the attention global feature. This process can be simply defined as:

g_{a t t n} = w_{c a} \times g,

(7)

where

g_{a t t n}

denotes the attention global feature and

g

denotes the global feature output by Transformer Encoder. The generated

g_{a t t n}

and

g

further calculate Cross-Entropy loss and Triplet loss, respectively. Some studies have shown that the use of the attention mechanism may cause the model to ignore the attention to some important secondary information [25,28,37]. To make the model not overly biased to the attention global feature, we introduce a weight hyperparameter

δ

to suppress the attention global feature, including its losses. Specifically, during training,

δ

is used to balance the losses generated by the attention global feature

g_{a t t n}

and the global feature

g

. During testing,

δ

is used to balance

g_{a t t n}

and

g

, which can be defined as follows:

z^{'} = δ \times z_{a t t n} + (1 - δ) \times z,

(8)

where

z^{'}

denotes the balanced parameter term,

δ

denotes the weight hyperparameter, which has been extensively investigated in our ablation studies, and

z_{a t t n}

denotes the attention parameter term, which is generally the attention global feature or its output losses.

z

denotes the non-attention parameter term, which is generally the global feature or its losses.

3.4. Loss Functions

Similar to the regular ReID task, Cross-Entropy loss and Triplet loss were constructed to optimize the network. Unlike that, we construct two kinds of losses for both the attention global feature and the global feature (as shown in the gray part of Figure 4).

3.4.1. SoftMax Cross-Entropy Loss

For the global feature g and the attention global feature

g_{a t t n}

, we calculate the Cross-Entropy Loss

L_{I D}

and

L_{I D_a t t n}

for the two features, respectively. The process of Cross-Entropy loss calculation is defined as:

L_{I D}^{*} = - \sum_{i} y (i) \log p^{*} (i),

(9)

where

y (i)

denotes the real label of the i-th image;

p^{*} (i)

denotes the identity prediction probability distribution of the i-th image generated by two global features through their respective fully connected layers. The weights of

L_{I D}

and

L_{I D_a t t n}

are controlled by the weight hyperparameter

δ

, as shown in (8).

3.4.2. Triplet Loss

The triplet loss was constructed so that on hard samples, the model performed better, and we used the general supervised learning process to construct triplet losses

L_{t r i}

and

L_{t r i_a t t n}

, by using the global feature and attention global feature, respectively. Triplet loss is generally able to close the in-class spacing and widen the out-of-class spacing. More precisely, the process of building mixed feature Triplet loss with soft margin is given by:

L_{t r i}^{*} = L n [1 + \exp ({‖ f_{A}^{*} - f_{P}^{*} ‖}_{2}^{2} - {‖ f_{A}^{*} - f_{N}^{*} ‖}_{2}^{2})],

(10)

where

L_{t r i}^{*}

denotes

L_{t r i}

or

L_{t r i_a t t n}

.

f_{A}^{*}

,

f_{P}^{*}

, and

f_{N}^{*}

denote the anchor sample features, the positive sample features, and the negative sample features, respectively. These feature vectors are derived from the global feature

g

and the attention global feature

g_{a t t n}

.

3.4.3. Total Loss

For recognizing identities, the help of the important secondary regions is sometimes necessary, and the introduction of attention may make the model overly dependent on the most important local region. Therefore, we introduce the weight hyperparameter

φ

to improve the constraints on the attention global feature. AET-Net’s total loss is defined as:

L_{t o t a l} = φ \times L_{I D_t o t a l} + (1 - φ) \times L_{t r i_t o t a l},

(11)

wherein

φ

is the weighting parameter used to balance the two types of losses, which is set as 0.5 in the experiment.

L_{I D_t o t a l}

and

L_{t r i_t o t a l}

denote the balanced total Cross-Entropy loss and total Triplet loss, respectively. AET-Net trains the network end-to-end by minimizing

L_{t o t a l}

.

4. Results and Discussion

4.1. Datasets and Evaluation Protocols

4.1.1. Datasets

Our experiments were performed on the three public ReID datasets Market-1501 [48], DukeMTMC-ReID [49], and Occluded-DukeMTMC [7]; details of these datasets are given in Table 1.

The Market-1501 dataset consists of 32,668 images from 6 cameras of 1501 people, with 12,936 images from 751 people serving as the training set and 19,732 images from another 750 people serving as the test set.

The DukeMTMC-ReID dataset contains 36,411 images from 1812 humans captured by eight cameras. A total of 16,522 images from 702 humans were randomly selected as the training set, while another 19,889 Images from 702 humans were used as the test set.

The Occluded-Duke dataset was used to study occluded ReID, and it is a subset of the DukeMTMC-ReID. In contrast to other datasets, Occluded-Duke is selected from DukeMTMC with 9%/100%/10% occluded images, according to train/query/gallery, respectively.

4.1.2. Evaluation Protocols

The model’s performance is assessed using standard evaluation metrics, such as mean average precision (mAP) [48] and cumulative matching characteristics (CMC) [50] at Rank-1, which are commonly used in general ReID methods.

The performance of the ReID model can be comprehensively evaluated by the mean average precision (mAP), which measures how well the model places the correct predicted results of a query image at the top of the retrieved list. To calculate mAP, the average precision (AP) of the model for each identity prediction result must first be computed. The AP calculation process can be described as follows:

A P_{i} = \frac{\sum_{i = 1}^{N} p_{i}}{I_{i}},

(12)

where

\sum P_{i}

represents the count of images that contain the i-th identity target,

I_{i}

denotes the number of images containing the i-th identity target, and N represents the total number of identities.

The mAP is the average of AP scores weighted by the number of images containing each identity, and this process can be mathematically represented as:

m A P = \frac{\sum_{i = 1}^{N} A P_{i}}{N},

(13)

where

\sum_{i = 1}^{N} A P_{i}

denotes the average precision of each identity.

In general, the AP metric evaluates the performance of a learned model on a single category, while the mAP metric evaluates the model’s overall performance across all identity targets.

Rank-n is a frequently used evaluation metric in ReID, which is derived from the cumulative distribution curve (CMC). The value of n denotes a particular situation, which involves ranking the metric distance between the query set features and the gallery set features and measuring the accuracy of the query’s correct result ranked before the n-th position. This means that the identity of the query set is found in the gallery set with the most similar image of the human person. In a ReID task, the value of n is usually set to 1, 5, 10, or 20. In this particular paper, the value of n is set to 1.

4.2. Implementation Details

Data Augmentation: During the training and inference phases, all images are resized to a fixed size. In addition, training images are augmented with random horizontal flips, random cropping, and random erasures, with a probability of 0.5 for each.

Backbone: AET-Net is done under the Pytorch version 1.8.0 deep learning framework, the patch size is set to

16 \times 16

, and the convolutional kernel depth D of patch embedding is set to 768. The ViT initial weights are obtained after being first pre-trained on ImageNet21K and then finetuned on Image-1K. Note that label smoothing is not used.

Training: The experiments were conducted on an Nvidia GeForce 3090 GPU, with a batch size of 64 and 4 instances, and a total training time was 69 min and 58 s. In order to compare with the ViT-based baseline model TransReID, the training was carried out for 120 epochs, using SGD as the optimizer, the initial learning rate was 0.008, and the cosine learning rate schedule was used for decay.

4.3. Comparison with State-of-the-Art Methods

In order to showcase the effectiveness of our supervised learning ReID model, AET-Net, we performed comparisons with current dominant models on three public datasets, which are divided into partial models and attention models. Although several state-of-the-art models utilize the Attention mechanism or Transformer architecture, there is no integration of both for ReID models, which are still more sensitive to the occlusion problem.

4.3.1. Results for the Non-Occluded Dataset

In Table 2, we show the results of AET-Net and other models on two non-occluded datasets, Market-1501 and DukeMTMC. AET-Net achieved an 87.5% mAP and a 94.8% Rank-1 accuracy on the Market-1501 dataset. It surpassed the mAP of partial models in the table, which are considered state-of-the-art, and also demonstrated superior performance in Rank-1 compared to most models. On DukeMTMC, AET-Net also performed fairly well, reporting 80.1% mAP and 89.5% Rank-1 accuracy. Exceeding all state-of-the-art models in the table, as well, there is a +0.8%/+3.5% mAP and +0.7%/+1.4% Rank-1 improvement compared to TransReID and DRL-Net. This indicates the effectiveness of AET-Net in solving the ReID task. This slight improvement is due to the attention mechanism that enhances the ViT’s ability to extract local details.

4.3.2. Results for the Occluded Dataset

The results of AET-Net on the Occluded-Duke dataset are presented in Table 3, along with those of other classic models for comparison. The Occluded-Duke images have more background clutter, occlusions, and other confusing features, which lead to lower accuracy for all models. However, the proposed AET-Net baseline model reported 54.5% mAP and 64.5% Rank-1 accuracy, outperforming the best partial model HOReID by 10.7% mAP and 9.4% Rank-1 and exceeding the best attention model PAT by 0.9% in mAP. In particular, AET-Net shows certain improvements compared to TransReID, DRL-Net, and PAT based on Transformer architecture. These outcomes suggest that our proposed AET-Net has improved the recognition performance and classification ability of the model.

4.4. Ablation Study and Visualization

We performed a comprehensive ablation study on Market-1501 and Occluded-Duke to verify the efficacy of each component in AET-Net and to design the optimal architecture. In order to ensure impartial comparison, none of the models used rearrange. The results of the ablation study are presented in Table 4.

4.4.1. Transformer Architecture

Table 4 presents a comparison of the results between the ViT-based ReID baseline model and the CNN-based model on the Market-1501 and Occluded-Duke. The results show that the ViT-based model achieves comparable performance to the CNN-based model on the non-occluded datasets. However, on the occluded dataset, the ViT-based model far outperforms the CNN-based model. From these results, it can be concluded that using the Transformer architecture to extract potential relationships among the local region of human targets would help to address the occlusion challenge in the ReID task.

4.4.2. Spatial Attention Enhancement Module

We conducted several ablation studies on the SAEM of AET-Net to verify its efficacy. To explore the effect of the weighted hyperparameter

λ

on SAEM, we gradually increased the parameter

λ

on the Occluded-Duke dataset to enhance the effect of the spatial attention feature, and the results are shown in Figure 6a. SAEM performance is best on the Occluded-Duke dataset when

λ

takes the value of 0.9. The results in SAEM ablation studies on the Market-1501 and Occluded-Duke are illustrated in Table 5, corresponding to index-1 to index-3, and it can be seen from the results that under both the occluded and non-occluded datasets, SAEM shows a slight performance improvement. The results testify that the global feature classification ability extracted by the model is improved under the effect of spatial attention.

4.4.3. Multi-Feature Training Module

In the same way as that of SAEM, we conduct the ablation study on MFTM for AET-Net to verify its validity. The analysis results of the balance parameters

δ

on Occluded-Duke are shown in Figure 6b. MFTM performs best on the Occluded-Duke dataset when

δ

takes the value of 0.6. Table 5 shows the results of the ablation study on MFTM, corresponding to index-4 and index-5. This demonstrates that MFTM provides better performance compared to SAEM, leading to a more significant improvement in the model’s performance.

4.4.4. Integration of the Two Modules

To verify that the two attention modules together can better improve the performance of the ViT architecture, we conduct an experiment by uniting the two attention modules, and the performance outcomes are presented in Table 5, corresponding to index-6 and index-7. It is shown that the performance under the integrated action of the two attention modules exceeds that of the individual module. In particular, the enhancement effect is higher on the occluded dataset compared with the non-occluded dataset.

4.4.5. Inferential Costs

We compute the inference time, computation and number of parameters of AET-Net by the following experiments. Furthermore, to decrease the inference time error resulting from the difference in memory IO of the device, the input for the model tests were all set to

64 \times 256 \times 128

. The results of this analysis are shown in Table 6. To make the comparison more explicit, the inference cost of ResNet50 is expressed as 1x.

The results show that the addition of SAEM accelerates the inference speed of the model, although the effect was not significant, which provides a feasible way of Transformer based models’ acceleration. The attention model has a slightly higher cost in terms of flops and parameters compared to the baseline model, owing to the inclusion of additional operations. Compared with SAEM, MFTM requires more complex computations and network nodes for refinement operations in the channel dimension, which is more costly. Finally, the attention enhancement models have the advantage of being lightweight compared to the Jigsaw Patch Module in TransReID.

4.4.6. Visualization

We conducted feature map visualization to further demonstrate the recognition performance of AET-Net. As shown in Figure 7, four humans’ images from the Occluded-Duke dataset, which are no occlusion, external occlusion, additional occlusion, and multiple occlusions, illustrate the attention of the model to the images in different scenes.

The visualization results demonstrate that the CNN-based model (ResNet50) distributes the perceptual fields mostly in small regions using convolution. In contrast, the self-attention mechanism in ViT allows it to capture inter-region dependencies, enabling the model to focus on multiple regions of the image. However, ViT is limited by patch embedding, and the model has a lot of attention scattered in the occluded regions or background regions, which is not adequate for a ReID task. The added SAEM changes the patch embedding state of ViT, which gives the model more attention in the regions of identity discrimination. In addition, MFTM by fusing different features, so the model’s attention will be distributed in a great number of discriminable regions of the image. Moreover, the joint application of SAEM and MFTM will integrate the two module features so that the model attention is heavily concentrated on the human regions. The results show that the AET-Net indirectly improves the extraction of local features by enhancing the model’s attention to the identification regions.

5. Conclusions

In this work, we propose a CNN attention-enhanced ViT network for occluded person re-identification. Firstly, the AET-Net uses the Vision Transformer as the backbone network for feature extraction, which can extract the implied relationships between regions, and this improves mAP accuracy by 9.3% and Rank-1 accuracy by 5.4% on the Occluded-Duke dataset compared to the CNN-based model HoReID. Secondly, we designed the spatial attention enhancement module to make the ViT model more focused on locally important regions. Additionally, we design the multi-feature training module to guide channel attention by computing multiple losses, which can reduce the model’s over-bias on attention features. We have conducted extensive studies on three benchmark datasets, and the proposed model outperforms the baseline model TransReID by 0.7% of mAP on the non-occluded dataset Market-1501. The experimental results show that AET-Net has good performance on occluded ReID datasets, exceeding the baseline model by 1.4% for mAP and 4% for Rank-1. This work provides a new idea for transformer-based ReID tasks.

Author Contributions

J.W. and P.L. conceived and designed the framework of the study. R.Z. (Rongfeng Zhao) completed the data collection and processing. R.Z. (Ruyan Zhou) and Y.H. completed the data analysis. J.W. and P.L. completed the algorithm design and model construction and were the lead authors of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

The work was funded by the National Natural Science Foundation of China, grant/award number: 61806123 and the National Key R&D Program of China, grant/award number: 2019YFD0900805.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data in this paper are public datasets that can be downloaded by all researchers through search engines.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ye, M.; Shen, J.; Lin, G. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2872–2893. [Google Scholar] [CrossRef] [PubMed]
Wang, G.A.; Yang, S.; Liu, H. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6449–6458. [Google Scholar]
Wang, P.; Ding, C.; Shao, Z.; Hong, Z.; Zhang, S.; Tao, D. Quality-aware part models for occluded person re-identification. IEEE Trans. Multimed. 2022. [Google Scholar] [CrossRef]
Huang, H.; Li, D.; Zhang, Z. Adversarial occluded samples for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5098–5107. [Google Scholar]
Jia, M.; Cheng, X.; Lu, S. Learning disentangled representation implicitly via transformer for occluded person re-identification. IEEE Trans. Multimed. 2022. [Google Scholar] [CrossRef]
Jia, M.; Cheng, X.; Zhai, Y. Matching on sets: Conquer occluded person re-identification without alignment. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1673–1681. [Google Scholar] [CrossRef]
Miao, J.; Wu, Y.; Liu, P. Pose-guided feature alignment for occluded person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 542–551. [Google Scholar]
Li, Y.; He, J.; Zhang, T. Diverse part discovery: Occluded person re-identification with part-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2898–2907. [Google Scholar]
He, L.; Wang, Y.; Liu, W. Foreground-aware pyramid reconstruction for alignment-free occluded person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8450–8459. [Google Scholar]
Wang, H.; Lu, J.; Pang, F. Bi-directional Style Adaptation Network for Person Re-Identification. IEEE Sens. J. 2022, 22, 12339–12347. [Google Scholar] [CrossRef]
Hu, Z.; Hou, W.; Liu, X. Deep Batch Active Learning and Knowledge Distillation for Person Re-Identification. IEEE Sens. J. 2022, 22, 14347–14355. [Google Scholar] [CrossRef]
Wu, Y.; Lin, Y.; Dong, X. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5177–5186. [Google Scholar]
Wu, Y.; Lin, Y.; Dong, X. Progressive learning for person re-identification with one example. IEEE Trans. Image Process. 2019, 28, 2872–2881. [Google Scholar] [CrossRef]
Lin, Y.; Wu, Y.; Yan, C. Unsupervised person re-identification via cross-camera similarity exploration. IEEE Trans. Image Process. 2020, 29, 5481–5490. [Google Scholar] [CrossRef]
Cai, H.; Wang, Z.; Cheng, J. Multi-scale body-part mask guided attention for person re-identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 1555–1564. [Google Scholar] [CrossRef] [Green Version]
Song, C.; Huang, Y.; Ouyang, W. Mask-guided contrastive attention model for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1179–1188. [Google Scholar]
Liu, M.; Yan, X.; Wang, C. Segmentation mask-guided person image generation. Appl. Intell. 2020, 51, 1161–1176. [Google Scholar] [CrossRef]
Zhao, H.; Tian, M.; Sun, S. Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1077–1085. [Google Scholar]
Su, C.; Li, J.; Zhang, S. Pose-driven deep convolutional model for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3960–3969. [Google Scholar]
Kalayeh, M.M.; Basaran, E.; Gökmen, M. Human semantic parsing for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1062–1071. [Google Scholar]
Zhu, K.; Guo, H.; Liu, Z. Identity-guided human semantic parsing for person re-identification. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 346–363. [Google Scholar]
Sun, Y.; Zheng, L.; Yang, Y. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
Wang, G.; Yuan, Y.; Chen, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 274–282. [Google Scholar]
He, S.; Luo, H.; Wang, P. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15013–15022. [Google Scholar]
Zhang, Z.; Lan, C.; Zeng, W. Relation-aware global attention for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3186–3195. [Google Scholar]
Chen, B.; Deng, W.; Hu, J. Mixed high-order attention network for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 371–381. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Galassi, A.; Lippi, M.; Torroni, P. Attention in natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4291–4308. [Google Scholar] [CrossRef]
Zhao, J.; Wang, H.; Zhou, Y. Spatial-Channel Enhanced Transformer for Visible-Infrared Person Re-Identification. IEEE Trans. Multimed. 2022. [Google Scholar] [CrossRef]
Khan, S.; Naseer, M.; Hayat, M. Transformers in vision: A survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, Y.; Wang, Y. A survey of visual transformers. arXiv 2021, arXiv:2111.06091. [Google Scholar]
Zhang, S.; Zhang, Q.; Yang, Y. Person re-identification in aerial imagery. IEEE Trans. Multimed. 2020, 23, 281–291. [Google Scholar] [CrossRef] [Green Version]
Yang, W.; Huang, H.; Zhang, Z. Towards rich feature discovery with class activation maps augmentation for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1389–1398. [Google Scholar]
Zhao, L.; Li, X.; Zhuang, Y. Deeply-learned part-aligned representations for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3219–3228. [Google Scholar]
He, L.; Liang, J.; Li, H. Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7073–7082. [Google Scholar]
Zhang, X.; Luo, H.; Fan, X. Alignedreid: Surpassing human-level performance in person re-identification. arXiv 2017, arXiv:1711.08184. [Google Scholar]
Guo, M.H.; Xu, T.X.; Liu, J.J. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Li, W.; Zhu, X.; Gong, S. Harmonious attention network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2285–2294. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Wu, B.; Xu, C.; Dai, X. Visual transformers: Token-based image representation and processing for computer vision. arXiv 2020, arXiv:2006.03677. [Google Scholar]
Srinivas, A.; Lin, T.-Y.; Parmar, N. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16519–16529. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
D’Ascoli, S.; Touvron, H.; Leavitt, M.L. Convit: Improving vision transformers with soft convolutional inductive biases. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 2286–2296. [Google Scholar]
Zhang, Q.; Yang, Y.B. ResT: An efficient transformer for visual recognition. Adv. Neural Inf. Process. Syst. 2021, 34, 15475–15485. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Xu, Y.; Jiang, Z.; Men, A. Multi-view feature fusion for person re-identification. Knowl. Based Syst. 2021, 229, 107344. [Google Scholar] [CrossRef]
Zheng, L.; Shen, L.; Tian, L. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar]
Zheng, Z.; Zheng, L.; Yang, Y. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3754–3762. [Google Scholar]
Wang, X.; Doretto, G.; Sebastian, T. Shape and appearance context modeling. In Proceedings of the 2007 IEEE 11th International conference on Computer Vision, Rio De Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar]
Miao, J.; Wu, Y.; Yang, Y. Identifying visible parts via pose estimation for occluded person re-identification. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 4624–4634. [Google Scholar] [CrossRef] [PubMed]
Luo, H.; Gu, Y.; Liao, X. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]

Figure 1. Challenges in ReID tasks. (a) low resolution; (b) light variations; (c) change of posture; (d) change of camera view; (e) occlusion; (f) inaccurate detection.

Figure 2. The process diagram of the attention mechanism to enhance Transformer architecture.

Figure 3. The process of the two types of ReID models. (a) partial models’ process; (b) attention models’ process.

Figure 4. Framework of proposed AET-Net. SAEM changes the state of the spatial distribution of embedding vectors by extracting spatial attention maps (blue). MFTM optimizes the network by constructing multiple features and calculating multiple losses (yellow and gray).

Figure 5. The extraction framework diagram of spatial attention map and channel attention map. (a) The extraction framework of spatial attention map. (b) The extraction framework of channel attention map.

Figure 6. Ablation study results of weight hyperparameter

λ

and

δ

. (a) Ablation study of

λ

on Occluded-Duke. (b) Ablation study of

δ

on Occluded-Duke.

Figure 6. Ablation study results of weight hyperparameter

λ

and

δ

. (a) Ablation study of

λ

on Occluded-Duke. (b) Ablation study of

δ

on Occluded-Duke.

Figure 7. Visualization of models’ attention heat maps. (a) Input image, (b) CNN-based model, (c) ViT-based model, (d) SAEM, (e) MFTM, (f) SAEM + MFTM. The red regions represent a high level of attention by the model, and the blue regions indicate a low level of attention.

Table 1. Dataset information. ID denotes the total number of identities in the dataset. The images denote the total number of images in the dataset.

Datasets	ID/Image	Train (ID/Image)	Query	Gallery
Market-1501	1501/32,668	751/12,936	750/3368	750/19,732
DukeMTMC-ReID	1404/36,441	702/16,522	702/2228	1110/17,661
Occluded-Duke	1404/35,489	702/15,618	519/2210	1110/17,661

Table 2. The performance comparison between our proposed model and state-of-the-art models on DukeMTMC and Market-1501.

Types	Models	DukeMTMC		Market-1501
Types	Models	mAP/%	Rank-1/%	mAP/%	Rank-1/%
Partial Models	Spindle [18]	-	-	-	76.9
	Part Aligned [34]	-	-	63.4	81.0
	PDC [19]	-	-	63.4	84.1
	DSR [35]	-	-	64.2	83.6
	MGCAM [16]	-	-	74.3	83.8
	Aligned Reid [36]	-	-	79.3	91.8
	PGFA [7]	65.5	82.6	76.8	91.2
	PGFA-PE [51]	72.6	86.2	81.3	92.7
	PCB [22]	66.1	81.8	77.4	92.3
	PCB+RPP [22]	69.2	83.3	81.6	93.8
	HOReID [2]	75.6	86.9	84.9	94.2
	FPR [9]	78.4	88.6	86.6	95.4
	MGN [23]	78.4	88.7	86.9	95.7
Attention Models	HACNN [38]	63.8	80.5	82.8	93.8
	MHN-6 [26]	77.2	89.1	85.0	95.1
	CAM [33]	72.9	85.8	84.5	94.7
	PAT [8]	78.2	88.8	88.0	95.4
	DRL-Net [5]	76.6	88.1	86.9	94.7
	TransReID [24]	79.3	88.8	86.8	94.7
Ours	AET-Net	80.1	89.5	87.5	94.8

Table 3. The performance comparison between our proposed model and state-of-the-art models on the Occluded-Duke.

Types	Models	Occluded-Duke
Types	Models	mAP/%	Rank-1/%
Partial Models	Part Aligned [34]	20.2	28.8
	DSR [35]	30.4	40.8
	Ad-Occluded [4]	32.2	44.5
	PGFA [7]	37.3	51.4
	PCB [22]	42.6	33.7
	HOReID [2]	43.8	55.1
	PGFA-PE [51]	43.5	56.6
Attention Models	HACNN [38]	26.0	34.4
	DRL-Net [5]	50.8	65.0
	ISP [21]	52.3	62.8
	PAT [8]	53.6	64.5
	TransReID [24]	53.1	60.5
Ours	AET-Net	54.5	64.5

Table 4. Comparison results of Transformer architecture and CNN-based architecture.

Type	Models	Market-1501		Occluded-Duke
Type	Models	mAP/%	Rank-1/%	mAP/%	Rank-1/%
CNN-Based	HOReID [2]	84.9	94.2	43.8	55.1
CNN-Based	HACNN [38]	82.8	93.8	26	34.4
Transformer	ViT-based (baseline)	86.8	94.7	53.1	60.5

Table 5. Results of ablation study of aet-net on Market-1501 and Occluded-duke datasets.

λ | δ

denotes the use of the attention-weight hyperparameter.

Table 5. Results of ablation study of aet-net on Market-1501 and Occluded-duke datasets.

λ | δ

denotes the use of the attention-weight hyperparameter.

Index	SAEM	MFTM	$λ \| δ$	Market-1501		Occluded-Duke
Index	SAEM	MFTM	$λ \| δ$	mAP/%	Rank-1/%	mAP/%	Rank-1/%
1				86.8	94.7	53.1	60.5
2	√			86.4	94.5	51.5	58.1
3	√		√	87.2	94.7	53.9	61.6
4		√		87.0	94.1	52.3	59.9
5		√	√	87.5	94.8	54.4	62.4
6	√	√		87.4	94.7	53.5	61.2
7	√	√	√	87.5	94.8	54.5	64.5

Table 6. Comparison of inferential costs on a different model. Baseline + S + M denotes baseline, and both modules are used together.

Models	Inference Time	Flops	Parameters
ResNet50 [52]	1x	1x	1x
Baseline	0.4338x	2.7105x	3.6411x
TransReID + JPM	0.4595x	2.9401x	3.9427x
Baseline + SAEM	0.4212x	2.7105x	3.6411x
Baseline + MFTM	0.3955x	2.7106x	3.6442x
Baseline + S + M	0.4159x	2.7106x	3.6442x

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Li, P.; Zhao, R.; Zhou, R.; Han, Y. CNN Attention Enhanced ViT Network for Occluded Person Re-Identification. Appl. Sci. 2023, 13, 3707. https://doi.org/10.3390/app13063707

AMA Style

Wang J, Li P, Zhao R, Zhou R, Han Y. CNN Attention Enhanced ViT Network for Occluded Person Re-Identification. Applied Sciences. 2023; 13(6):3707. https://doi.org/10.3390/app13063707

Chicago/Turabian Style

Wang, Jing, Peitong Li, Rongfeng Zhao, Ruyan Zhou, and Yanling Han. 2023. "CNN Attention Enhanced ViT Network for Occluded Person Re-Identification" Applied Sciences 13, no. 6: 3707. https://doi.org/10.3390/app13063707

APA Style

Wang, J., Li, P., Zhao, R., Zhou, R., & Han, Y. (2023). CNN Attention Enhanced ViT Network for Occluded Person Re-Identification. Applied Sciences, 13(6), 3707. https://doi.org/10.3390/app13063707

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CNN Attention Enhanced ViT Network for Occluded Person Re-Identification

Abstract

1. Introduction

2. Related Work

2.1. Occluded Person ReID Based on Deep Learning

2.1.1. Partial Models

2.1.2. Attention Models

2.2. Transformer in Computer Vision

3. Method

3.1. Feature Extraction Backbone

3.2. Spatial Attention Enhancement Module

3.3. MultiFeature Training Module

3.4. Loss Functions

3.4.1. SoftMax Cross-Entropy Loss

3.4.2. Triplet Loss

3.4.3. Total Loss

4. Results and Discussion

4.1. Datasets and Evaluation Protocols

4.1.1. Datasets

4.1.2. Evaluation Protocols

4.2. Implementation Details

4.3. Comparison with State-of-the-Art Methods

4.3.1. Results for the Non-Occluded Dataset

4.3.2. Results for the Occluded Dataset

4.4. Ablation Study and Visualization

4.4.1. Transformer Architecture

4.4.2. Spatial Attention Enhancement Module

4.4.3. Multi-Feature Training Module

4.4.4. Integration of the Two Modules

4.4.5. Inferential Costs

4.4.6. Visualization

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI