Transformer-Based Attention Network for Vehicle Re-Identification

Lian, Jiawei; Wang, Dahan; Zhu, Shunzhi; Wu, Yun; Li, Caixia

doi:10.3390/electronics11071016

Open AccessArticle

Transformer-Based Attention Network for Vehicle Re-Identification

by

Jiawei Lian

^1,2,

Dahan Wang

^1,2,*

,

Shunzhi Zhu

^1,2,

Yun Wu

^1,2 and

Caixia Li

^1,2

¹

Fujian Key Labarotory of Pattern Recognition and Image Understanding, Xiamen 361024, China

²

School of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361024, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(7), 1016; https://doi.org/10.3390/electronics11071016

Submission received: 18 February 2022 / Revised: 11 March 2022 / Accepted: 12 March 2022 / Published: 24 March 2022

(This article belongs to the Special Issue Recent Advances in the IoT and Smart City Based on Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Vehicle re-identification (ReID) focuses on searching for images of the same vehicle across different cameras and can be considered as the most fine-grained ID-level classification task. It is fundamentally challenging due to the significant differences in appearance presented by a vehicle with the same ID (especially from different viewpoints) coupled with the subtle differences between vehicles with different IDs. Spatial attention mechanisms that have been proven to be effective in computer vision tasks also play an important role in vehicle ReID. However, they often require expensive key-point labels or suffer from noisy attention masks when trained without key-point labels. In this work, we propose a transformer-based attention network (TAN) for learning spatial attention information and hence for facilitating learning of discriminative features for vehicle ReID. Specifically, in contrast to previous studies that adopted a transformer network, we designed the attention network as an independent branch that can be flexibly utilized in various tasks. Moreover, we combined the TAN with two other branches: one to extract global features that define the image-level structures, and the other to extract the auxiliary side-attribute features that are invariant to viewpoint, such as color, car type, etc. To validate the proposed approach, experiments were conducted on two vehicle datasets (the VeRi-776 and VehicleID datasets) and a person dataset (Market-1501). The experimental results demonstrated that the proposed TAN is effective in improving the performance of both the vehicle and person ReID tasks, and the proposed method achieves state-of-the-art (SOTA) perfomance.

Keywords:

vehicle re-identification; transformer; vehicle search; smart transportation; smart cities

1. Introduction

Vehicle re-identification (ReID) aims to match vehicle images in a camera network. Recently, this task has drawn increasing attention due to its wide applications in many fields such as urban surveillance and traffic flow analysis, etc. While deep convolutional neural networks (CNNs) have shown remarkable performance in vehicle ReID in recent years [1,2], various challenges still remain and need further investigation. Among these challenges, the most significant is that a vehicle captured from different viewpoints usually has dramatically different visual appearances. As shown in Figure 1, vehicle images obtained from different perspectives have obviously different appearances. On the other hand, vehicles with different IDs are likely to have very similar appearances in many scenarios, as shown in Figure 2. Hence, vehicle ReID is a challenging classification task with high intra-class variation and high inter-class similarity. Therefore, learning discriminative features that can distinguish different IDs but are invariant to viewpoints will be of vital importance for vehicle ReID.

Many discriminative feature learning methods have been proposed for vehicle ReID, which can be roughly categorized into two types: part-based [3,4,5,6,7,8,9] and attention-based [10,11,12] methods.

The idea of part-based methods lies in the fact that the discriminative features that can effectively distinguish different IDs always exist in parts regions only. The key step for part-based methods is to detect the discriminative regions and extract these local features, explicitly or implicitly [3,4,5,6]. However, part-level annotations are expensive. Weakly supervised methods can be considered as an alternative way to overcome the lack of part-level annotations with only slight performance loss [7,8]. It should be noted that, in vehicle ReID, it is difficult to perform part-level annotation due to the high intra-class variation. Consequently, part-based methods that rely too much on regional features for fine-grained discrimination may fail when distinguishing vehicles with the same ID from different viewpoints. Therefore, many existing methods adopt the attention mechanism for discriminative feature extraction, which detects key points to obtain spatial information to guide discriminative feature learning.

The main idea of attention-based methods is to use the key points of the vehicle direction as the supervised information for learning discriminative features [13]. Attention-based methods have achieved high performance. The key points have a significant effect on the performance of attention-based methods [10,11], and it is desired that the most informative key points are detected and utilized to guide discriminative feature learning, though this is very difficult to achieve by artificial annotation or automated detection. Uninformative key points may lead to poor and unstable attention.

In this paper, we propose a transformer-based attention network (TAN) to facilitate the learning of discriminative features for vehicle ReID without key point annotation. Transformers were first proposed in the natural language processing (NLP) field and significantly improved performance on the 11 NLP tasks [14], due to their ability to represent very long dependency and enable better parallel learning in sequence learning problems. The transformer has been successfully customized for many computer vision tasks [15,16,17,18], showing promising performance and high potential in computer vision.

Inspired by these studies, in this paper we propose a TAN for vehicle ReID for learning the spatial attention features that have high discriminative power. In contrast to previous studies that adopted the transformer network as a substructure, we designed the attention network as an independent branch that can be flexibly utilized in various tasks. Moreover, we combined the attention features captured by TAN and the global features that define the image-level structures, as well as the auxiliary side-attribute features that are invariant to viewpoint, such as color, car type, etc. To better combine the attention features, the global features, and the side features, we combined the cross-entropy (CE) loss and the triplet loss for training. The integration of these features is capable of enhancing the robustness of the model.

We conducted experiments on both vehicle ReID and person ReID tasks, using the VeRi-776 and VehicleID datasets (for vehicle ReID), and the Market-1501 dataset (for person ReID). Experimental results showed that the proposed TAN could improve the performance significantly on both these tasks. The proposed method, combining three kinds of features, achieved state-of-the-art (SOTA) performance, demonstrating the effectiveness of the combination strategy.

In summary, the main contributions of this work are:

*: A multi-branch deep learning framework is proposed for vehicle ReID, which contains three main modules. The global module is used to extract the global features for overall discrimination, the side module processes specific side information such as color and type information, and the attention module processes spatial attention features to enhance the discriminative ability of the model.
*: The transformer is introduced as the structure of our attention branch. As far as we know, this is the first study to exploit the transformer network for the ReID problem.
*: We evaluate our method on three large-scale ReID benchmark datasets and obtain state-of-the-art performance without using any viewpoint annotations.

The rest of this paper is organized as follows. Section 2 introduces related work with respect to vehicle re-identification. Section 3 describes the details of our proposed approach. Section 4 presents the experiments, and Section 5 addresses some qualitative analysis. Section 6 offers the concluding remarks.

2. Related Work

2.1. Object ReID

The ReID task is to search for one specific ID from different camera views. Vehicle ReID [7,11,19] and person ReID [20,21,22,23,24] are the two most popular tasks that have captured increasing attention in recent years. Most ReID methods attempt to solve the problem using one of two methods: invariant feature extraction or distance metric study. Invariant feature extraction methods [20,21,22,23,24,25] attempt to learn a more discriminative feature classifier. Distance metric study methods [19,26] compare two images by calculating the feature vector similarity.

2.2. Vehicle ReID

In Vehicle ReID, many studies have attempted to adopt metric learning VANet [19], attribute information networks [27,28], generative adversarial networks (GANs) [8], graph networks (GNs) [29,30], horizontal and vertical segmentation networks [6,7], semantic parsing (SP), ref. [13] and vehicle part detection (VPD) [3,4] to enhance the performance. VANet [19] uses two branches to extract different perspective features and propose a viewpoint-aware metric learning method for re-identification. Regarding attribute-based information networks, the authors in [27,28] enhance the model performance by capturing different types of local information by attributes. The GAN-based vehicle ReID method [8] attempts to generate cross-view images to enhance model robustness. The authors in [29,30] build an extractor of spatial information to overcome perspective differences. In [6,7], horizontal and vertical segmentation networks are combined to enhance the performance. PRND and PGAN [3,4] attempt to use vehicle details and local regions, respectively. PRND [3] and PGAN [4] detect predefined regions (e.g., back mirrors, light, wheels, etc.) and describe them with deep features. SAVER [8] modifies the input image and uses GAN to erase the vehicle details. Then, this composite image is integrated with the input image, creating a new version with visually enhanced details for the ReID. Some studies attempt to deal with dramatic changes in viewpoint [13,29]. Meng et al. [13] use semantic parsing to obtain different viewpoints for each vehicle and utilize GN to align the spatial relationships between them. These studies have contributed significantly to the development of vehicle re-identification, but they ignore the fact that the attention mechanisms can enhance the discriminatory performance of the model.

Other studies exploit spatial attention information [10,11,12] or enhance attention features [4,5,8,9]. Most of the previous studies on attention use segmented or detected orientation information as the guide information to generate a kind of spatial information attention [10,11], or they use attention information to enhance the generalization performance of the model, thereby improving the discriminative ability of the model [4,5,8]. In general, when using attention as a guiding method, the model usually needs to rely on some manually processed data labels, and is not stable enough to detect manually labeled local areas or key points in the model processing. Attention is only used as an enhanced plug-in module. It may lose some of the spatial characteristics, and it does not fully combine the discriminative features with the attention features. In this paper, we propose a transformer-based attention network (TAN) for extracting the attention features and combine the attention features and the other features to improve the robustness of the model.

2.3. Transformer-Based Attention

In previous years, the transformer has gained more attention in natural language processing models and computer vision. The vision transformer (ViT) [15] is discussed in the earliest work to apply the transformer structure to the visual classification task. In addition to basic image classification, transformers are also used to solve various other computer vision problems, including object detection [16], semantic segmentation [17], image processing [18], and video understanding. ViT shows satisfactory performance with large-scale and more complex datasets. By cutting the image into patches, the transformer effectively captures the global information through the sequence of patches, and the patches of different layers are connected to build global spatial features [31]. However, in ReID, the spatiality is critical for feature learning [22] and has not been fully exploited. This inspired us to introduce the transformer for capturing global spatial information for ReID.

In this paper, we attempt to adopt transformer-based attention to better capture the long-distance dependent features and derive global information, as complementary information to that learned by a convolutional neural network (CNN), which focuses only on local characteristics. Moreover, we design the TAN as a single branch that can be combined with other branches, improving the flexibility of the model.

3. Approach

Our work adopted the multi-branch architecture, which is a popular design strategy to integrate different branches [5,7,20,24,27,28,32]. These approaches enable the network to attend to different features of individual branches, e.g., distinct spatial parts or channels. Whether they are used in pedestrian or vehicle re-recognition tasks, multi-branch designs usually improve the model performance and enhance the model robustness.

3.1. Network Architecture

Similarly to most recent efforts, we used an end-to-end neural network architecture for image feature extraction, pre-trained on ImageNet [33]. In this subsection, we provide a detailed description of the structure and training of the proposed network. As illustrated in Figure 3, our network structure has three major branches: the global branch, the attention branch, and the side branch.

Let

X \in R^{w \times h \times c}

be an input image, where w, h, c are the width, the height, and the channels of the image, respectively. Initially, the image X is input into the shared backbone network, and the backbone F(I) is constituted from the first layer of the third block of ResNet50 [34]. This design approach was adopted early in ResNet50, which used the first block up to conv4.0. Then, after passing through the shared layer F(I), X passes into three distinct branches, where the global and side branches use the remaining layers of ResNet50 until the end, while the attention branch uses the modified transformer structure. Before accessing each individual branch network, we obtain a tensor of dimension

\hat{X} \in R^{\hat{w} \times \hat{h} \times \hat{c}}

through the shared layer, after which each branch uses

\hat{X}

as input, effectively ensuring the consistency of the branch dimension.

{\hat{X}}_{i} = F (X)

(1)

The global branch. In this branch, we obtain the following global representations. First, we obtain the 2048-dimensional features through the last two layers of ResNet50

F_{g} (\hat{X})

and then obtain the final feature vector g through 2D global average pooling (GAP).

The side branch. This branch captures the auxiliary side-attribute features (such as color, car type, etc.) that are invariant to viewpoint and help to improve the ReID performance. The initial

\hat{X} \in R^{\hat{w} \times \hat{h} \times \hat{c}}

tensor is reduced to a 2048-dimensional vector and is mainly used to learn the auxiliary feature models or colors of the model. Similarly, the 2048-feature vector

g_{s i d e}

is obtained through the last two layers of ResNet50

F_{s i d e} (\hat{X})

, and then the final vector is obtained through 2D GAP. It is worth noting that, in training, the ID of the cross-entropy loss used by the side branch is the vehicle model or color ID.

3.2. Transformer Based Attention

In this paper, we introduce transformer-based attention into the ReID problem, as the attention branch shown in Figure 3. Inspired by work [35] that combined the transformer with CNN for image classification and achieved promising results, we modified the multi-head self-attention (MHSA) structures for the ReID task. In Figure 4, the left subfigure is the design structure in BoTNet [35], and the right subfigure is our proposed structure, where the multi-layer perception (MLP) layers and the norm layers are added into the original structure of BoTNet. The norm normalization is added to BoTNet to make the value passed by each layer more stable, and the MLP is added before the final residual to ensure the complexity of the spatial sequence and improve the robustness of the model.

With this design, we transform the initial

\hat{X} \in R^{\hat{w} \times \hat{h} \times \hat{c}}

tensor into the transformer design structure network. The initialized vector is obtained through a structure of six-layer transformer layers

F_{a t t} (\hat{X})

to obtain the final feature vector

g_{a t t}

through 2D global average pooling (GAP).

3.3. Feature Embedding

After obtaining the feature representation vectors g,

g_{s i d e}

, and

g_{a t t}

, we must connect the different vectors linearly in different metric spaces. The embedding feature vector obtained before the batch normalization layer is used for optimizing the triplet loss [36], and the embedding classification vectors acquired by the linear layer following the fully connected layer

\hat{g}

,

\hat{g_{s i d e}}

,

\hat{g_{a t t}}

are applied to optimize the classification loss (e.g., the cross-entropy (CE) loss). The embedding feature vector obtained after the batch normalization but before the fully connected layer finds a balance between the representations of the two different metric spaces and is therefore used for inference. The vectors g and

g_{s i d e}

calculate the cross-entropy loss differently; the target label when calculating g is the ID label of the vehicle, and

g_{s i d e}

is the attribute or type label of the vehicle. However, the target ID used by g and

g_{a t t}

in calculating the CE loss is the same. From the resulting embedding vector we form two sets, given by

L : = {\hat{g}, \hat{g_{s i d e}}, \hat{g_{a t t}}}

(2)

R : = {g, g_{s i d e}, g_{a t t}}

(3)

3.4. Training and Loss Functions

For training, we used a combination of the CE loss and the triplet loss [36]. The latter was designed to make the anchor close to the positive sample and far away from the negative sample in the triple relationship of the positive sample and the negative sample at an anchor point. The triplet loss

L_{t r i}

is used to calculate the global embeddings

R

obtained before batch normalization, and can be written as:

L_{T r i} = m a x ({\hat{D}}^{a p} - {\hat{D}}^{a n} + γ, 0)

(4)

where

{\hat{D}}^{a p}

and

{\hat{D}}^{a n}

denote the distance from the anchor point to the positive sample and to the negative sample in the triplet, and

γ

is the minimum interval between two distances representing

{\hat{D}}^{a p}

and

{\hat{D}}^{a n}

. The CE loss

L_{c e}

on all

L

obtained after applying softmax activation to the fully connected layer, can be written as:

L (x, y) = - [y \cdot log (x) + (1 - y) \cdot log (1 - x)]

(5)

L_{C E} (f (X), y) = \frac{1}{n} \sum_{i \in n} L (f {(X)}_{i}, y_{i})

(6)

where n is the size of the training batch and

f (X)

is the output of our network when forwarding X. For a

C E

loss

L_{c e}

, the data-tag IDs that we used all came from the standard dataset, which has been marked publicly. Thus, the overall objective loss function is

L = α L_{C E} + β L_{T r i}

(7)

where

α

and

β

are suitable weights. Additionally, we used random erasing augmentation (REA) [37], which randomly substitutes a rectangle with the image’s mean value, to improve model generalization and to produce higher variance in the training data. Cosine annealing strategies are common in PReID or VReID networks [21,38]. To further boost performance, we used warm-up cosine annealing [39] as our learning rate strategy. The learning rate first grows linearly from

1 \times 10^{- 4}

to

1 \times 10^{- 3}

in 10 epochs, and then exhibits cosine decay in the remaining epochs until

5 \times 10^{- 4}

. The learning rate

l r (t)

at epoch t with T total epochs is given by

l r (t) = \{\begin{matrix} 1 \times 10^{- 3} \times \frac{t}{10} & t \leq 10 \\ 1 \times 10^{- 3} \times \frac{1}{2} (1 + cos (π \frac{t - 10}{T - 10})) & 10 < t \leq T \end{matrix}

(8)

3.5. Feature Fusion for Inference

In order to effectively integrate the global features and the side features, we calculated the average of the

L 2

-normalized features as the final discriminative feature for inference:

X = \frac{1}{2} (\frac{g}{{∥ g ∥}_{2}} + \frac{g_{s i d e}}{∥ g_{s i d e} ∥_{2}})

(9)

It can be seen that such a feature fusion involves a normalization operation and an average operation. The global feature g and the side feature

g_{s i d e}

are mutually complementary to each other from the perspective of feature fusion. It should be noted that the attention feature is not a specific vehicle feature or side information feature, etc., but a feature that acts on the global spatial information. In this paper, the attention feature is stacked with the discriminative features to expand the dimension of the discriminative characteristics. The above operation is the feature map we used in the testing phase, as shown in Figure 5.

4. Experiments

4.1. Datasets

We conducted extensive experiments on two public large-scale benchmarks for vehicle ReID (i.e., VeRi-776 [1] and VehicleID [40]) and three datasets for person ReID (i.e., Market-1501 [41], DukeMTMC [42], and CUHK03 [43]). All datasets except VehicleID provide camera ID for each image, while only the VeRi-776 dataset provides viewpoint labels for each image. The statistics of the five datasets are summarized in Table 1, and the vehicle attributes for two datasets are summarized in Table 2.

VeRi-776 [1] is a public vehicle dataset which consists of 49,357 images of 776 distinct vehicles that were captured with 20 non-overlapping cameras in a variety of orientations and lighting conditions. We followed the original protocol to retrieve queries in an image-to-track fashion, where queries and the correct gallery samples must be captured from different cameras.

VehicleID [40] is a widely-used vehicle ReID dataset which contains vehicle images captured in the daytime by multiple cameras. There are a total of 221,763 images with 26,267 identities, where each vehicle has either a front or rear view. The training set contains 13,134 identities, while the test set has 13,133 identities. The test set is further divided into three subsets with different sizes: a small subset (800 vehicles and 7332 images) denoted VehicleID 800, a medium subset (1600 vehicles and 12,995 images) denoted VehicleID 1600, and a large subset (2400 vehicles and 20,038 images) denoted VehicleID 2400.

Market-1501 [41] is a commonly used person ReID dataset that contains 1501 identities observed under 6 camera viewpoints, 19,732 gallery images and 12,936 training images.

DukeMTMC [42] contains 1404 pedestrian identities observed under 8 camera viewpoints, 17,661 gallery images and 16,522 training images.

CUHK03 [43] contains 13,164 images of 1467 identities from 2 cameras.

4.2. Evaluation Metrics

The mean average precision (mAP) and the cumulative match curve (CMC) at rank-1 and rank-5 were employed to evaluate the performance of our proposed method (denoted TANet in this paper). Each query image in a subset of test images was tested with other test images. The average precision for each query q was calculated as:

A P (q_{i}) = \frac{\sum_{k = 1}^{n} P (k) \times r e l (k)}{N_{c}^{i}}

(10)

where

P (k)

denotes the precision at the

k

th position of the results. The term

r e l (k)

is an indicator function equal to 1 if the

k

th result is correctly matched or zero otherwise, n is the number of tests, i is the ith query image, and

N_{c}^{i}

is the ith query image label C class, which is the number of the C class in the query image. After experimenting for each query image, the

m A P

was calculated as follows:

m A P = \frac{\sum_{q = 1}^{Q} A P (q)}{Q}

(11)

where Q is the number of all queries.

4.3. Implementation Details

The proposed method was trained by Pytorch [44]. For training, the input images were normalized to channel-wise zero-mean, with a standard variation of 1 and spatial resolution of

256 \times 256

. Data augmentation was performed by resizing images to 105% width and height and random cropping, as well as random horizontal flipping with a probability of 0.5. Models were trained for 150 epochs for VeRi-776 and 260 epochs for VehicleID, with a batch size of 64. A batch consists of eight identities, each containing eight samples. The parameters were optimized using the Adam optimizer [45], with

ϵ

= 1 × 10

^{- 8}

,

λ_{1}

= 0.9, and

λ_{2}

= 0.999. The backbones were pre-trained on ImageNet [33]. We conducted the experiments in Table 3 to obtain the best model results when the loss factor

α

=

β

= 0.5.

The Euclidean distance was utilized to compute the CMC. Query and gallery images were resized to

256 \times 256

pixels and normalized, respectively. For a fair comparison with other existing methods, the CMC rank-1 accuracy (r1) and (mAP) are reported as the evaluation metrics.

4.4. Ablation Study

In this section, we evaluate the effects of some key parameters (such as the number of layers, different branches, etc.) on the performance of the model.

The effect of the number of transformer layers. The number of transformer layers (denoted MHSA layerNum in this paper) may affect the performance of the feature extraction and hence the performance of the whole system. We evaluated the performance of the proposed method with different LayerNum values, i.e., 4, 6, and 8. The results are shown in Table 4. As seen in the results, the performance does not differ much with different LayerNum values. Since mAP was highest when LayerNum was 6, we set the LayerNum as 6 in the following experiments.

The effect of each branch. When the neural network architecture uses a multi-branch structure, the influence between individual branches is substantially increased. Thus, any introduction of branches must be well justified. The experimental results are listed in Table 5, from which we can see that the attention module improves the performance significantly (from 74.7% to 79.5% on VeRi-776 and 76.4% to 87.0% on VehicleID 800) and the side module further improves the performance (from 79.5% to 80.5% on VeRi-776 and 76.4% to 88.2% on VehicleID 800), demonstrating the effectiveness of the proposed transformer-based attention module and the side module.

The effect of the image size. We also evaluated the effect of the image size, as shown in Table 6. The vehicle dataset is constrained by its steel structure, and as a result the resolution of the input image aspect ratio is 1:1. It can be seen from the results that when the image size is set as

256 \times 256

, the best performance is achieved. Hence, in the following experiments, we set the default image size as

256 \times 256

.

4.5. Cross-Domain Dataset Testing

To validate the robustness of the proposed TANet method, we also conducted experiments on the person ReID task using Market-1501. The characteristic attributes of PReID and VReID are different. The results are shown in Table 7. Compared with PCB [22], our proposed method outperforms it by 2% in mAP but is 1.3% less effective in CMC1, because our model tends to focus on the overall information and cannot accurately extract subtle local differences. OSNet [21] is a light-weight multi-scale network for PReID. Compared with OSNet, our method is 2.6% better in mAP but the CMC1 is lower by 1.1%. ABDNet [24] adopts a two-branch structure and introduces a variety of attention mechanisms to focus on more details and effectively integrate them, so the performance is excellent. CNet [25] has better spatial learning methods for PReID.

The performance in mAP is similar to our method, which verifies the robustness of the proposed method.

In order to better prove that our attention branch is universal, we also integrated the proposed transformer-based attention branch into other popular PReID models such as MGN [20], using three PReID datasets. The results are shown in Table 8. We can see that both mAP and CMC1 increase by 1–2 percent in the three datasets, and the increase is more obvious for the CUHK03 [43] dataset.

Moreover, the performance of MGN combined with the attention module of our model is close to the latest SOTA performance.

In addition, we conducted classification tests on the CIFAR-10 and CIFAR-100 datasets, fusing our attention module with ResNet50 as the backbone. The results are listed in Table 9. From the results, it can be observed that the classification performance is significantly improved after fusing our attention module, without using the pre-training model. This demonstrates that our framework has good expansibility and can be easily integrated with different frameworks for different tasks.

4.6. Comparison with the State of the Art

In this section, we compare the proposed TANet with state-of-the-art approaches for the three benchmarks.

For the VeRi-776 dataset, the results are shown in Table 10. Our method achieved 80.5% and 95.4% in mAP and CMC1, respectively. The value for mAP is lower than MCRL + SL [26] by 0.64% points but higher than HPGN [30] and PVEN [13] by 0.32% and 1%, respectively. Compared with SAVER [8], which learns instance-specific discriminative features but ignores the extreme viewpoint changes, our proposed method achieved similar performance on VeRi-776. Our method outperforms VANet [19] by 14.2% on mAP and 5.6% on CMC1. This may stem from the fact that VANet [19] uses limited viewpoints on VeRi-776, and two instances with similar viewpoints near the viewpoint boundary may be wrongly divided into different viewpoints. Compared with HPGN [30], which adopts the graph network structure for ReID, our method performs better on VeRi-776. The proposed TANet also outperforms some typical methods that use extra data annotation, such as PRND [3], PVEN [13], and HPGN [30]. Compared with these methods, our approach shows excellent performance without any other information besides the identity information, which verifies the effectiveness of fusing distinguishing features and attention features using the proposed method.

For the VehicleID dataset, the results are shown in Table 11. When the size is small, our model performance is not outstanding, but as the size increases performances tend to be better. A CMC1 accuracy of 82.9%, 81.5%, and 79.6% is achieved for the three sizes of small, medium, and large, respectively, and the mAP also reached 88.2%, 87%, and 85.9%. Our method uses the transformer in a separate attention branch network that has a better performance in processing size transformation. Hence, when the size is large, the transformer attention branch can capture more detailed features, effectively enhancing the performance of the model and achieving state-of-the-art performance. Compared with MCRL + SL [26] and VANet [19], although our method did not perform as well for small and medium sizes, its performance on the large size was comparable. Our proposed TANet has better performance in handling more and larger datasets. Based on the performance on the above two datasets, the overall performance of our method was excellent.

5. Qualitative Analysis

In this section, we offer an insight into how the three modules improve ReID performance and visually compare the learning content of some branches, showing the impact of each module in the learning process.

5.1. Retrieval with Different Queries

In this section, we visualize the experimental results on VeRi-776, as can be seen in Figure 6, The left column shows query images, while the images on the right-hand side are the top five results obtained by the proposed method. The global feature is good at retrieving images from the same viewpoint, while the proposed attention module enables the model to capture spatial attention features and hence retrieve the same object from different perspectives (e.g., the second row in Figure 6). In summary, our approach has achieved results from different perspectives.

5.2. Visualization of the Rank List

To show the discriminatory ability of the model, we selected the top five images for visualization in descending order of similarity. In addition, the top five were adopted to obtain a more comprehensive performance of the model retrieval results.

Baseline and TANet retrieval results were compared on VeRi-776. It can be seen from Figure 7 that the images with a red border are errors, while other images are correct results. Our proposed TANet can find the same images from different angles/perspectives (e.g., the first and the second examples). Our method is not always efficient when dealing with extreme situations, as shown in Figure 8, where the positive sample and negative sample in the gallery have the same brand, the same model, the same color, and the same viewpoint. In general situations, the overall performance of our method is more reliable. Compared with the results of the baseline query, the query accuracy of our method is significantly improved.

5.3. Activation Map Visualization

In order to verify the effectiveness of the respective branches of the proposed method, we visualized the three branches separately. Examples are shown in Figure 9. The global branch can focus on more obvious features such as the rear of the car, the front face, the lights, etc., and the attention branch enhances the area that the global branch pays attention to. The heat-value distribution indicates the orientation of the car body in space, and it can be observed from the side branch. It is significant that our proposed method effectively combines the three branches to fuse distinguishing features and attention features for discrimination.

6. Conclusions

In this paper, we proposed a transformer-based attention method that used attention to learn certain spatial features and enhance the discriminative ability of the model. For attention and spatial reasons, a transformer was used as our attention branch, and our method attempted to use the transformer and a CNN in combination. Our method is composed of three branches: the global branch, the attention branch, and the side branch, and the experiments proved that each branch can have a different effect on the results. The key network of TANet is the attention branch, which we used for separate extraction and porting to different combined networks, to prove its efficiency and ability. The TANet method does not use additional data annotations, and in order to prove the versatility of the method we conducted experiments on VReID and PReID datasets, achieving excellent results.

In the future, there is hope that re-identification techniques will evolve from 2D image searches into matching 2D targets from real-time surveillance videos or 3D models. However, much research is needed to realize this vision, especially to address the issues of retrieval speed and extreme spatial perspectives.

Author Contributions

Methodology, writing—original draft, writing—review and deiting, J.L.; Supervision, writing—review and editing, D.W.; Supervision, S.Z. and Y.W.; Investigation, validation, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

This work was supported by the Natural Science Foundation of China (No. 61773325 and 61806173), the Industry–University Cooperation Project of Fujian Science and Technology Department (No. 2021H6035), the Natural Science Foundation of Fujian Province (No. 2021J011191), the Joint Funds of the 5th Round of Health and Education Research Program of Fujian Province (No. 2019-WJ-41), the Science and Technology Planning Project of Fujian Province (No. 2020H0023), and the Young Teacher Education Research Project of Fujian Province (No. JT180435).

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, X.; Liu, W.; Mei, T.; Ma, H. A Deep Learning-Based Approach to Progressive Vehicle Re-identification for Urban Surveillance. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016. [Google Scholar]
Ma, X.; Boukerche, A. An Efficient Real-Time Vehicle Re-Identification Scheme Using Urban Surveillance Videos. In Proceedings of the ICC 2021-IEEE International Conference on Communications, Montreal, QC, Canada, 14–23 June 2021; pp. 1–6. [Google Scholar]
He, B.; Li, J.; Zhao, Y.; Tian, Y. Part-Regularized Near-Duplicate Vehicle Re-Identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zhang, X.; Zhang, R.; Cao, J.; Gong, D.; You, M.; Shen, C. Part-Guided Attention Learning for Vehicle Re-Identification. arXiv 2019, arXiv:1909.06023. [Google Scholar]
St, A.; Zd, A.; Hl, B.; Jie, S. Partial attention and multi-attribute learning for vehicle re-identification. Pattern Recognit. Lett. 2020, 138, 290–297. [Google Scholar]
Wang, H.; Peng, J.; Jiang, G.; Xu, F.; Fu, X. Discriminative Feature and Dictionary Learning with Part-aware Model for Vehicle Re-identification. Neurocomputing 2021, 438, 55–62. [Google Scholar]
Liu, X.; Zhang, S.; Huang, Q.; Gao, W. RAM: A Region-Aware Deep Model for Vehicle Re-Identification. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar] [CrossRef] [Green Version]
Khorramshahi, P.; Peri, N.; Chen, J.C.; Chellappa, R. The Devil is in the Details: Self-Supervised Attention for Vehicle Re-Identification. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Peng, J.; Wang, H.; Zhao, T.; Fu, X. Learning multi-region features for vehicle re-identification with context-based ranking method. Neurocomputing 2019, 359, 427–437. [Google Scholar]
Khorramshahi, P.; Kumar, A.; Peri, N.; Rambhatla, S.S.; Chen, J.C.; Chellappa, R. A Dual-Path Model with Adaptive Attention for Vehicle Re-Identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019. [Google Scholar]
Wang, Z.; Tang, L.; Liu, X.; Yao, Z.; Yi, S.; Shao, J.; Yan, J.; Wang, S.; Li, H.; Wang, X. Orientation Invariant Feature Embedding and Spatial Temporal Regularization for Vehicle Re-Identification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Ma, X.; Zhu, K.; Guo, H.; Wang, J.; Huang, M.; Miao, Q. Vehicle re-identification with refined part model. In Proceedings of the 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shanghai, China, 8–12 July 2019; pp. 603–606. [Google Scholar]
Meng, D.; Li, L.; Liu, X.; Li, Y.; Yang, S.; Zha, Z.J.; Gao, X.; Wang, S.; Huang, Q. Parsing-based View-aware Embedding Network for Vehicle Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Zhang, L. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. arXiv 2020, arXiv:2012.15840. [Google Scholar]
Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Gao, W. Pre-Trained Image Processing Transformer. arXiv 2020, arXiv:2012.00364. [Google Scholar]
Chu, R.; Sun, Y.; Li, Y.; Liu, Z.; Wei, Y. Vehicle Re-Identification with Viewpoint-Aware Metric Learning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019. [Google Scholar]
Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning Discriminative Features with Multiple Granularities for Person Re-Identification. arXiv 2018, arXiv:1804.01438. [Google Scholar]
Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-Scale Feature Learning for Person Re-Identification. arXiv 2019, arXiv:1905.00953. [Google Scholar]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline); Springer: Cham, Switzerland, 2017. [Google Scholar]
Hao, L. Bags of Tricks and A Strong Baseline for Deep Person Re-identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Chen, T.; Ding, S.; Xie, J.; Yuan, Y.; Chen, W.; Yang, Y.; Ren, Z.; Wang, Z. ABD-Net: Attentive but Diverse Person Re-Identification. arXiv 2019, arXiv:1908.01114. [Google Scholar]
Li, H.; Wu, G.; Zheng, W.S. Combined Depth Space based Architecture Search For Person Re-identification. arXiv 2021, arXiv:2104.04163. [Google Scholar]
Jin, Y.; Li, C.; Li, Y.; Peng, P.; Giannopoulos, G.A. Model Latent Views With Multi-Center Metric Learning for Vehicle Re-Identification. IEEE Trans. Intell. Transp. Syst. 2021, 22, 1919–1931. [Google Scholar]
Qian, J.; Jiang, W.; Luo, H.; Yu, H. Stripe-based and attribute-aware network: A two-branch deep model for vehicle re-identification. Meas. Sci. Technol. 2020, 31, 095401. [Google Scholar]
Quispe, R.; Lan, C.; Zeng, W.; Pedrini, H. AttributeNet: Attribute Enhanced Vehicle Re-Identification. Neurocomputing 2021, 465, 84–92. [Google Scholar]
Liu, X.; Liu, W.; Zheng, J.; Yan, C.; Mei, T. Beyond the Parts: Learning Multi-view Cross-part Correlation for Vehicle Re-identification. In Proceedings of the MM ’20: The 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020. [Google Scholar]
Shen, F.; Zhu, J.; Zhu, X.; Xie, Y.; Huang, J. Exploring Spatial Significance via Hybrid Pyramidal Graph Network for Vehicle Re-identification. arXiv 2020, arXiv:2005.14684. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 5998–6008. [Google Scholar]
Chen, H.; Lagadec, B.; Bremond, F. Learning Discriminative and Generalizable Representations by Spatial-Channel Partition for Person Re-Identification. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020. [Google Scholar]
Technicolor, T.; Related, S.; Technicolor, T.; Related, S. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Vaswani, A. Bottleneck Transformers for Visual Recognition. arXiv 2021, arXiv:2101.11605. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random Erasing Data Augmentation. arXiv 2017, arXiv:1708.04896. [Google Scholar]
Zhu, X.; Luo, Z.; Fu, P.; Ji, X. VOC-RelD: Vehicle Re-identification based on Vehicle-Orientation-Camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Li, M. Bag of Tricks for Image Classification with Convolutional Neural Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Liu, H.; Tian, Y.; Wang, Y.; Pang, L.; Huang, T. Deep Relative Distance Learning: Tell the Difference between Similar Vehicles. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Zheng, L.; Shen, L.; Lu, T.; Wang, S.; Qi, T. Scalable Person Re-identification: A Benchmark. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Ristani, E.; Solera, F.; Zou, R.S.; Cucchiara, R.; Tomasi, C. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In European Conference on Computer Vision; Springer Science+Business Media: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Wei, L.; Rui, Z.; Tong, X.; Wang, X.G. DeepReID: Deep Filter Pairing Neural Network for Person Re-identification. In Proceedings of the Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in Pytorch. 2017. Available online: https://openreview.net/pdf?id=BJJsrmfCZ (accessed on 6 February 2022).
Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. Illustration of vehicles from VeRi-776. The appearances of the same vehicle are significantly different from different viewpoints. Moreover, the color of the vehicle may shift with different lighting.

Figure 2. Illustration of vehicles from VeRi-776. Two different vehicles with the same car model and color may look very similar from the same viewpoint.

Figure 3. The overall structure of our proposed model. The network uses the first three layers of ResNet50 in the first to third layers, and the parameters are shared. Then, it consists of three modules. The global feature module is mainly used to extract global features using layers 3 and 4 of ResNet50. The attention module is mainly the improved BoTNet transformer structure used to obtain the attention features. The side module is used to learn about auxiliary features such as cameras, models, color attributes, local features, etc. This module is flexible. GAP in the figure means global average pooling. In the training phase, the fully connected layer embeds spatial features for cross-entropy loss, and the feature map before the pre-connected layers uses triplet loss for metric learning.

Figure 4. The left subfigure is the original transformer structure in BoTNet, and the right subfigure is the proposed transformer structure in this paper, produced by adding the MLP structure to BoTNet.

Figure 5. In the testing phase, we input the query image and gallery image to obtain their feature matrices separately. The distance matrix is generated using the feature matrix calculation, and the index of the retrieved images is returned according to the final score sorting.

Figure 6. From the top 5 in the visualized rank list, five different objects were retrieved and different results were obtained. The first column is the input query object, and the next 5 columns are the top 5 results. The visual results are all positive samples.

Figure 7. Visualization of ranking list on vehicle ReID task. The images in the first column are the query images. The rest of the images are the retrieved 5 top-ranking results.

Figure 8. In the visualized rank list, results with a poor classification effect in this category are affected by light and viewing angle.

Figure 9. Visualization for the activation heat maps of each branch on the VeRi-776 dataset. The first column shows the query images, the second and third columns are the global branches, the fourth and fifth columns are the attention branches, and the last two columns are the side branches.

Table 1. Statistics of datasets used in the paper.

Dataset	Object	ID	Image	Cam	View
Market-1501	Person	1501	32,668	6	-
DukeMTMC	Person	1404	34,183	8	-
CUHK03	Person	1467	13,164	2	-
VeRi-776	Vehicle	776	49,357	20	8
VehicleID	Vehicle	26,328	221,567	-	2

Table 2. Statistics of vehicle attributes used in the paper.

Dataset	Object	Color	Type	Model
VeRi-776	Vehicle	10	9	-
VehicleID	Vehicle	6	-	250

Table 3. Loss factors

α

and

β

, verified on the VeRi-776 dataset.

Table 3. Loss factors

α

and

β

, verified on the VeRi-776 dataset.

Method	$α$	$β$	mAP	Rank-1
TANet	0.3	0.7	78.6	95.1
TANet	0.4	0.6	79.3	95.1
TANet	0.5	0.5	80.5	95.4
TANet	0.6	0.4	78.2	95.4
TANet	0.7	0.3	78.7	95.4

Table 4. Verification of the ablation experiment of Num layers on the VeRi-776 dataset.

Method	Size	Num Layer	mAP	CMC1
TANet	$256 \times 256$	4	80.1	95.4
TANet	$256 \times 256$	6	80.5	95.4
TANet	$256 \times 256$	8	80.2	95.5

Table 5. Verification of each module on the VeRi-776 and vehicleID 800 datasets.

Size	Configuration			VeRi-776		VehicleID 800
Size	Main	Attention	Side	mAP	CMC1	mAP	CMC1
$256 \times 256$	✓	-	-	74.7	94.8	76.4	69.1
$256 \times 256$	-	✓	-	73.6	94.1	75.5	68.6
$256 \times 256$	-	-	✓	74.3	94.1	78.0	71.4
$256 \times 256$	✓	✓	-	79.5	95.2	87.0	81.5
$256 \times 256$	✓	-	✓	78.0	94.9	85.9	79.6
$256 \times 256$	-	✓	✓	79.2	95.1	86.3	81.7
$256 \times 256$	✓	✓	✓	80.5	95.4	88.2	82.9

Table 6. Verification of the effect of image size on the VeRi-776 dataset.

Method	Size	Num Layer	mAP	CMC1
TANet	$224 \times 224$	6	79.9	95.5
TANet	$256 \times 256$	6	80.5	95.4
TANet	$384 \times 384$	6	79.4	95.2

Table 7. The mAP and CMC1 on Market1501.

Method	mAP	CMC1
OSNET [21]	81.0	93.6
MGN [20]	85.65	94.48
ABDNet [24]	88.3	95.6
CBN [17]	42.9	72.8
CNet [25]	83.5	93.6
PCB+RPP [22]	81.6	93.8
TANet	83.6	92.5

Table 8. The mAP and CMC1 on PReID datasets with MGN method.

Method	mAP	CMC1	CMC5	Dataset
MGN [20]	85.65	94.48	98.28	Market1501 [41]
MGN [20] + Att module	87.36	95.04	98.28	Market1501 [41]
MGN [20]	77.70	88.06	95.20	DukeMTMC [42]
MGN [20] + Att module	78.94	89.27	94.70	DukeMTMC [42]
MGN [20]	66.71	69.71	85.14	CUHK03 [43]
MGN [20] + Att module	72.26	75.07	89.64	CUHK03 [43]

Table 9. The CMC1 on CIFAR10 and CIFAR100.

Method	CMC1	Dataset
ResNet50 [20]	90.35	CIFAR-10
ResNet50 [20] + Att module	91.86	CIFAR-10
ResNet50 [20]	69.58	CIFAR-100
ResNet50 [20] + Att module	71.26	CIFAR-100

Table 10. The mAP, CMC1, and CMC5 on VeRi-776. (* Method using extra annotation).

Method	mAP	CMC1	CMC5
* AAVER [10]	61.2	89.0	94.7
* PRND [3]	74.3	94.3	98.7
* PVEN [13]	79.5	95.6	98.4
RAM [7]	61.5	88.6	94.0
VANet [19]	66.3	89.8	96.0
MRM [9]	68.5	91.77	95.82
PCRNet [29]	78.6	95.4	98.6
SAVER [8]	79.6	96.4	98.6
TCPM [6]	74.9	93.9	97.1
PART [5]	45.0	72.0	88.8
HPGN [30]	80.1	96.7	-
MCRL + SL [26]	81.1	96.1	99.4
Baseline	74.7	94.8	98.1
TANet	80.5	95.4	98.4

Table 11. The mAP, CMC1, and CMC5 on VehicleID. (* Methods using extra annotation).

Method	Small 800			Medium 1600			Large 2400
Method	mAP	CMC1	CMC5	mAP	CMC1	CMC5	mAP	CMC1	CMC5
* AAVER [10]	-	74.7	93.8	-	68.6	90.0	-	63.5	85.6
* PRND [3]	-	78.4	92.3	-	75.0	88.3	-	74.2	86.4
* PVEN [13]	-	84.7	97.0	-	80.6	94.5	-	77.8	92.0
VANet [19]	-	88.1	97.3	-	83.2	95.1	-	80.4	93.0
GSTN [12]	66.2	65.0	-	63.3	62.5	-	61.2	60.2	-
PART [5]	-	67.7	87.9	-	61.5	82.7	-	54.5	77.2
SAVER [8]	-	79.9	95.2	-	77.6	91.1	-	75.3	88.3
MRM [9]	80.0	76.6	92.3	77.3	74.2	88.5	74.0	70.8	84.8
TCPM [6]	85.1	81.9	96.2	82.1	79.0	94.8	77.9	73.8	90.8
HPGN [30]	-	83.9	-	-	79.9	-	-	77.3	-
MCRL + SL [26]	92.0	88.1	97.4	88.4	83.2	95.7	86.0	80.6	93.2
Baseline	76.4	69.1	85.8	74.1	67.4	80.5	71.4	65.2	78.3
TANet	88.2	82.9	95.7	87.0	81.5	94.1	85.9	79.6	94.3

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lian, J.; Wang, D.; Zhu, S.; Wu, Y.; Li, C. Transformer-Based Attention Network for Vehicle Re-Identification. Electronics 2022, 11, 1016. https://doi.org/10.3390/electronics11071016

AMA Style

Lian J, Wang D, Zhu S, Wu Y, Li C. Transformer-Based Attention Network for Vehicle Re-Identification. Electronics. 2022; 11(7):1016. https://doi.org/10.3390/electronics11071016

Chicago/Turabian Style

Lian, Jiawei, Dahan Wang, Shunzhi Zhu, Yun Wu, and Caixia Li. 2022. "Transformer-Based Attention Network for Vehicle Re-Identification" Electronics 11, no. 7: 1016. https://doi.org/10.3390/electronics11071016

APA Style

Lian, J., Wang, D., Zhu, S., Wu, Y., & Li, C. (2022). Transformer-Based Attention Network for Vehicle Re-Identification. Electronics, 11(7), 1016. https://doi.org/10.3390/electronics11071016

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transformer-Based Attention Network for Vehicle Re-Identification

Abstract

1. Introduction

2. Related Work

2.1. Object ReID

2.2. Vehicle ReID

2.3. Transformer-Based Attention

3. Approach

3.1. Network Architecture

3.2. Transformer Based Attention

3.3. Feature Embedding

3.4. Training and Loss Functions

3.5. Feature Fusion for Inference

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Ablation Study

4.5. Cross-Domain Dataset Testing

4.6. Comparison with the State of the Art

5. Qualitative Analysis

5.1. Retrieval with Different Queries

5.2. Visualization of the Rank List

5.3. Activation Map Visualization

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI