Transformer-Based Attention Network for Vehicle Re-Identiﬁcation

: Vehicle re-identiﬁcation (ReID) focuses on searching for images of the same vehicle across different cameras and can be considered as the most ﬁne-grained ID-level classiﬁcation task. It is fundamentally challenging due to the signiﬁcant differences in appearance presented by a vehicle with the same ID (especially from different viewpoints) coupled with the subtle differences between vehicles with different IDs. Spatial attention mechanisms that have been proven to be effective in computer vision tasks also play an important role in vehicle ReID. However, they often require expensive key-point labels or suffer from noisy attention masks when trained without key-point labels. In this work, we propose a transformer-based attention network (TAN) for learning spatial attention information and hence for facilitating learning of discriminative features for vehicle ReID. Speciﬁcally, in contrast to previous studies that adopted a transformer network, we designed the attention network as an independent branch that can be ﬂexibly utilized in various tasks. Moreover, we combined the TAN with two other branches: one to extract global features that deﬁne the image-level structures, and the other to extract the auxiliary side-attribute features that are invariant to viewpoint, such as color, car type, etc. To validate the proposed approach, experiments were conducted on two vehicle datasets (the VeRi-776 and VehicleID datasets) and a person dataset (Market-1501). The experimental results demonstrated that the proposed TAN is effective in improving the performance of both the vehicle and person ReID tasks, and the proposed method achieves state-of-the-art (SOTA) perfomance.


Introduction
Vehicle re-identification (ReID) aims to match vehicle images in a camera network. Recently, this task has drawn increasing attention due to its wide applications in many fields such as urban surveillance and traffic flow analysis, etc. While deep convolutional neural networks (CNNs) have shown remarkable performance in vehicle ReID in recent years [1,2], various challenges still remain and need further investigation. Among these challenges, the most significant is that a vehicle captured from different viewpoints usually has dramatically different visual appearances. As shown in Figure 1, vehicle images obtained from different perspectives have obviously different appearances. On the other hand, vehicles with different IDs are likely to have very similar appearances in many scenarios, as shown in Figure 2. Hence, vehicle ReID is a challenging classification task with high intra-class variation and high inter-class similarity. Therefore, learning discriminative features that can distinguish different IDs but are invariant to viewpoints will be of vital importance for vehicle ReID.
The idea of part-based methods lies in the fact that the discriminative features that can effectively distinguish different IDs always exist in parts regions only. The key step for part-based methods is to detect the discriminative regions and extract these local features, explicitly or implicitly [3][4][5][6]. However, part-level annotations are expensive. Weakly supervised methods can be considered as an alternative way to overcome the lack of part-level annotations with only slight performance loss [7,8]. It should be noted that, in vehicle ReID, it is difficult to perform part-level annotation due to the high intra-class variation. Consequently, part-based methods that rely too much on regional features for fine-grained discrimination may fail when distinguishing vehicles with the same ID from different viewpoints. Therefore, many existing methods adopt the attention mechanism for discriminative feature extraction, which detects key points to obtain spatial information to guide discriminative feature learning.  The main idea of attention-based methods is to use the key points of the vehicle direction as the supervised information for learning discriminative features [13]. Attentionbased methods have achieved high performance. The key points have a significant effect on the performance of attention-based methods [10,11], and it is desired that the most informative key points are detected and utilized to guide discriminative feature learning, though this is very difficult to achieve by artificial annotation or automated detection.
Uninformative key points may lead to poor and unstable attention.
In this paper, we propose a transformer-based attention network (TAN) to facilitate the learning of discriminative features for vehicle ReID without key point annotation. Transformers were first proposed in the natural language processing (NLP) field and significantly improved performance on the 11 NLP tasks [14], due to their ability to represent very long dependency and enable better parallel learning in sequence learning problems. The transformer has been successfully customized for many computer vision tasks [15][16][17][18], showing promising performance and high potential in computer vision.
Inspired by these studies, in this paper we propose a TAN for vehicle ReID for learning the spatial attention features that have high discriminative power. In contrast to previous studies that adopted the transformer network as a substructure, we designed the attention network as an independent branch that can be flexibly utilized in various tasks. Moreover, we combined the attention features captured by TAN and the global features that define the image-level structures, as well as the auxiliary side-attribute features that are invariant to viewpoint, such as color, car type, etc. To better combine the attention features, the global features, and the side features, we combined the cross-entropy (CE) loss and the triplet loss for training. The integration of these features is capable of enhancing the robustness of the model.
We conducted experiments on both vehicle ReID and person ReID tasks, using the VeRi-776 and VehicleID datasets (for vehicle ReID), and the Market-1501 dataset (for person ReID). Experimental results showed that the proposed TAN could improve the performance significantly on both these tasks. The proposed method, combining three kinds of features, achieved state-of-the-art (SOTA) performance, demonstrating the effectiveness of the combination strategy.
In summary, the main contributions of this work are: * A multi-branch deep learning framework is proposed for vehicle ReID, which contains three main modules. The global module is used to extract the global features for overall discrimination, the side module processes specific side information such as color and type information, and the attention module processes spatial attention features to enhance the discriminative ability of the model. * The transformer is introduced as the structure of our attention branch. As far as we know, this is the first study to exploit the transformer network for the ReID problem. * We evaluate our method on three large-scale ReID benchmark datasets and obtain state-of-the-art performance without using any viewpoint annotations.
The rest of this paper is organized as follows. Section 2 introduces related work with respect to vehicle re-identification. Section 3 describes the details of our proposed approach. Section 4 presents the experiments, and Section 5 addresses some qualitative analysis. Section 6 offers the concluding remarks.

Object ReID
The ReID task is to search for one specific ID from different camera views. Vehicle ReID [7,11,19] and person ReID [20][21][22][23][24] are the two most popular tasks that have captured increasing attention in recent years. Most ReID methods attempt to solve the problem using one of two methods: invariant feature extraction or distance metric study. Invariant feature extraction methods [20][21][22][23][24][25] attempt to learn a more discriminative feature classifier. Distance metric study methods [19,26] compare two images by calculating the feature vector similarity.

Vehicle ReID
In Vehicle ReID, many studies have attempted to adopt metric learning VANet [19], attribute information networks [27,28], generative adversarial networks (GANs) [8], graph networks (GNs) [29,30], horizontal and vertical segmentation networks [6,7], semantic parsing (SP), ref. [13] and vehicle part detection (VPD) [3,4] to enhance the performance. VANet [19] uses two branches to extract different perspective features and propose a viewpoint-aware metric learning method for re-identification. Regarding attribute-based information networks, the authors in [27,28] enhance the model performance by capturing different types of local information by attributes. The GAN-based vehicle ReID method [8] attempts to generate cross-view images to enhance model robustness. The authors in [29,30] build an extractor of spatial information to overcome perspective differences. In [6,7], horizontal and vertical segmentation networks are combined to enhance the performance. PRND and PGAN [3,4] attempt to use vehicle details and local regions, respectively. PRND [3] and PGAN [4] detect predefined regions (e.g., back mirrors, light, wheels, etc.) and describe them with deep features. SAVER [8] modifies the input image and uses GAN to erase the vehicle details. Then, this composite image is integrated with the input image, creating a new version with visually enhanced details for the ReID. Some studies attempt to deal with dramatic changes in viewpoint [13,29]. Meng et al. [13] use semantic parsing to obtain different viewpoints for each vehicle and utilize GN to align the spatial relationships between them. These studies have contributed significantly to the development of vehicle re-identification, but they ignore the fact that the attention mechanisms can enhance the discriminatory performance of the model.
Other studies exploit spatial attention information [10][11][12] or enhance attention features [4,5,8,9]. Most of the previous studies on attention use segmented or detected orientation information as the guide information to generate a kind of spatial information attention [10,11], or they use attention information to enhance the generalization performance of the model, thereby improving the discriminative ability of the model [4,5,8]. In general, when using attention as a guiding method, the model usually needs to rely on some manually processed data labels, and is not stable enough to detect manually labeled local areas or key points in the model processing. Attention is only used as an enhanced plug-in module. It may lose some of the spatial characteristics, and it does not fully combine the discriminative features with the attention features. In this paper, we propose a transformer-based attention network (TAN) for extracting the attention features and combine the attention features and the other features to improve the robustness of the model.

Transformer-Based Attention
In previous years, the transformer has gained more attention in natural language processing models and computer vision. The vision transformer (ViT) [15] is discussed in the earliest work to apply the transformer structure to the visual classification task. In addition to basic image classification, transformers are also used to solve various other computer vision problems, including object detection [16], semantic segmentation [17], image processing [18], and video understanding. ViT shows satisfactory performance with large-scale and more complex datasets. By cutting the image into patches, the transformer effectively captures the global information through the sequence of patches, and the patches of different layers are connected to build global spatial features [31]. However, in ReID, the spatiality is critical for feature learning [22] and has not been fully exploited. This inspired us to introduce the transformer for capturing global spatial information for ReID.
In this paper, we attempt to adopt transformer-based attention to better capture the long-distance dependent features and derive global information, as complementary information to that learned by a convolutional neural network (CNN), which focuses only on local characteristics. Moreover, we design the TAN as a single branch that can be combined with other branches, improving the flexibility of the model.

Approach
Our work adopted the multi-branch architecture, which is a popular design strategy to integrate different branches [5,7,20,24,27,28,32]. These approaches enable the network to attend to different features of individual branches, e.g., distinct spatial parts or channels. Whether they are used in pedestrian or vehicle re-recognition tasks, multi-branch designs usually improve the model performance and enhance the model robustness.

Network Architecture
Similarly to most recent efforts, we used an end-to-end neural network architecture for image feature extraction, pre-trained on ImageNet [33]. In this subsection, we provide a detailed description of the structure and training of the proposed network. As illustrated in Figure 3, our network structure has three major branches: the global branch, the attention branch, and the side branch.
Let X ∈ R w×h×c be an input image, where w, h, c are the width, the height, and the channels of the image, respectively. Initially, the image X is input into the shared backbone network, and the backbone F(I) is constituted from the first layer of the third block of ResNet50 [34]. This design approach was adopted early in ResNet50, which used the first block up to conv4.0. Then, after passing through the shared layer F(I), X passes into three distinct branches, where the global and side branches use the remaining layers of ResNet50 until the end, while the attention branch uses the modified transformer structure. Before accessing each individual branch network, we obtain a tensor of dimensionX ∈ Rŵ ×ĥ×ĉ through the shared layer, after which each branch usesX as input, effectively ensuring the consistency of the branch dimension.X The global branch. In this branch, we obtain the following global representations. First, we obtain the 2048-dimensional features through the last two layers of ResNet50 F g (X) and then obtain the final feature vector g through 2D global average pooling (GAP).
The side branch. This branch captures the auxiliary side-attribute features (such as color, car type, etc.) that are invariant to viewpoint and help to improve the ReID performance. The initialX ∈ Rŵ ×ĥ×ĉ tensor is reduced to a 2048-dimensional vector and is mainly used to learn the auxiliary feature models or colors of the model. Similarly, the 2048-feature vector g side is obtained through the last two layers of ResNet50 F side (X), and then the final vector is obtained through 2D GAP. It is worth noting that, in training, the ID of the cross-entropy loss used by the side branch is the vehicle model or color ID.

Transformer Based Attention
In this paper, we introduce transformer-based attention into the ReID problem, as the attention branch shown in Figure 3. Inspired by work [35] that combined the transformer with CNN for image classification and achieved promising results, we modified the multihead self-attention (MHSA) structures for the ReID task. In Figure 4, the left subfigure is the design structure in BoTNet [35], and the right subfigure is our proposed structure, where the multi-layer perception (MLP) layers and the norm layers are added into the original structure of BoTNet. The norm normalization is added to BoTNet to make the value passed by each layer more stable, and the MLP is added before the final residual to ensure the complexity of the spatial sequence and improve the robustness of the model.
With this design, we transform the initialX ∈ Rŵ ×ĥ×ĉ tensor into the transformer design structure network. The initialized vector is obtained through a structure of six-layer transformer layers F att (X) to obtain the final feature vector g att through 2D global average pooling (GAP).

Feature Embedding
After obtaining the feature representation vectors g, g side , and g att , we must connect the different vectors linearly in different metric spaces. The embedding feature vector obtained before the batch normalization layer is used for optimizing the triplet loss [36], and the embedding classification vectors acquired by the linear layer following the fully connected layerĝ,ĝ side ,ĝ att are applied to optimize the classification loss (e.g., the crossentropy (CE) loss). The embedding feature vector obtained after the batch normalization but before the fully connected layer finds a balance between the representations of the two different metric spaces and is therefore used for inference. The vectors g and g side calculate the cross-entropy loss differently; the target label when calculating g is the ID label of the vehicle, and g side is the attribute or type label of the vehicle. However, the target ID used by g and g att in calculating the CE loss is the same. From the resulting embedding vector we form two sets, given by L := {ĝ,ĝ side ,ĝ att } R := {g, g side , g att }

Training and Loss Functions
For training, we used a combination of the CE loss and the triplet loss [36]. The latter was designed to make the anchor close to the positive sample and far away from the negative sample in the triple relationship of the positive sample and the negative sample at an anchor point. The triplet loss L tri is used to calculate the global embeddings R obtained before batch normalization, and can be written as: whereD ap andD an denote the distance from the anchor point to the positive sample and to the negative sample in the triplet, and γ is the minimum interval between two distances representingD ap andD an . The CE loss L ce on all L obtained after applying softmax activation to the fully connected layer, can be written as: where n is the size of the training batch and f (X) is the output of our network when forwarding X. For a CE loss L ce , the data-tag IDs that we used all came from the standard dataset, which has been marked publicly. Thus, the overall objective loss function is where α and β are suitable weights. Additionally, we used random erasing augmentation (REA) [37], which randomly substitutes a rectangle with the image's mean value, to improve model generalization and to produce higher variance in the training data. Cosine annealing strategies are common in PReID or VReID networks [21,38]. To further boost performance, we used warm-up cosine annealing [39] as our learning rate strategy. The learning rate first grows linearly from 1 × 10 −4 to 1 × 10 −3 in 10 epochs, and then exhibits cosine decay in the remaining epochs until 5 × 10 −4 . The learning rate lr(t) at epoch t with T total epochs is given by

Feature Fusion for Inference
In order to effectively integrate the global features and the side features, we calculated the average of the L2-normalized features as the final discriminative feature for inference: It can be seen that such a feature fusion involves a normalization operation and an average operation. The global feature g and the side feature g side are mutually complementary to each other from the perspective of feature fusion. It should be noted that the attention feature is not a specific vehicle feature or side information feature, etc., but a feature that acts on the global spatial information. In this paper, the attention feature is stacked with the discriminative features to expand the dimension of the discriminative characteristics. The above operation is the feature map we used in the testing phase, as shown in Figure 5. In the testing phase, we input the query image and gallery image to obtain their feature matrices separately. The distance matrix is generated using the feature matrix calculation, and the index of the retrieved images is returned according to the final score sorting.

Datasets
We conducted extensive experiments on two public large-scale benchmarks for vehicle ReID (i.e., VeRi-776 [1] and VehicleID [40]) and three datasets for person ReID (i.e., Market-1501 [41], DukeMTMC [42], and CUHK03 [43]). All datasets except VehicleID provide camera ID for each image, while only the VeRi-776 dataset provides viewpoint labels for each image. The statistics of the five datasets are summarized in Table 1, and the vehicle attributes for two datasets are summarized in Table 2.
VeRi-776 [1] is a public vehicle dataset which consists of 49,357 images of 776 distinct vehicles that were captured with 20 non-overlapping cameras in a variety of orientations and lighting conditions. We followed the original protocol to retrieve queries in an imageto-track fashion, where queries and the correct gallery samples must be captured from different cameras.
VehicleID [40] is a widely-used vehicle ReID dataset which contains vehicle images captured in the daytime by multiple cameras. There are a total of 221,763 images with 26,267 identities, where each vehicle has either a front or rear view. The training set contains 13,134 identities, while the test set has 13,133 identities. The test set is further divided into three subsets with different sizes: a small subset (800 vehicles and 7332 images) denoted VehicleID 800, a medium subset (1600 vehicles and 12,995 images) denoted VehicleID 1600, and a large subset (2400 vehicles and 20,038 images) denoted VehicleID 2400.
Market-1501 [41] is a commonly used person ReID dataset that contains 1501 identities observed under 6 camera viewpoints, 19,732 gallery images and 12,936 training images.

Evaluation Metrics
The mean average precision (mAP) and the cumulative match curve (CMC) at rank-1 and rank-5 were employed to evaluate the performance of our proposed method (denoted TANet in this paper). Each query image in a subset of test images was tested with other test images. The average precision for each query q was calculated as: where P(k) denotes the precision at the kth position of the results. The term rel(k) is an indicator function equal to 1 if the kth result is correctly matched or zero otherwise, n is the number of tests, i is the ith query image, and N i c is the ith query image label C class, which is the number of the C class in the query image. After experimenting for each query image, the mAP was calculated as follows: where Q is the number of all queries.

Implementation Details
The proposed method was trained by Pytorch [44]. For training, the input images were normalized to channel-wise zero-mean, with a standard variation of 1 and spatial resolution of 256 × 256. Data augmentation was performed by resizing images to 105% width and height and random cropping, as well as random horizontal flipping with a probability of 0.5. Models were trained for 150 epochs for VeRi-776 and 260 epochs for VehicleID, with a batch size of 64. A batch consists of eight identities, each containing eight samples. The parameters were optimized using the Adam optimizer [45], with = 1 × 10 −8 , λ 1 = 0.9, and λ 2 = 0.999. The backbones were pre-trained on ImageNet [33]. We conducted the experiments in Table 3 to obtain the best model results when the loss factor α = β = 0.5. The Euclidean distance was utilized to compute the CMC. Query and gallery images were resized to 256 × 256 pixels and normalized, respectively. For a fair comparison with other existing methods, the CMC rank-1 accuracy (r1) and (mAP) are reported as the evaluation metrics.

Ablation Study
In this section, we evaluate the effects of some key parameters (such as the number of layers, different branches, etc.) on the performance of the model.
The effect of the number of transformer layers. The number of transformer layers (denoted MHSA layerNum in this paper) may affect the performance of the feature extraction and hence the performance of the whole system. We evaluated the performance of the proposed method with different LayerNum values, i.e., 4, 6, and 8. The results are shown in Table 4. As seen in the results, the performance does not differ much with different LayerNum values. Since mAP was highest when LayerNum was 6, we set the LayerNum as 6 in the following experiments. The effect of each branch. When the neural network architecture uses a multi-branch structure, the influence between individual branches is substantially increased. Thus, any introduction of branches must be well justified. The experimental results are listed in Table 5, from which we can see that the attention module improves the performance significantly (from 74.7% to 79.5% on VeRi-776 and 76.4% to 87.0% on VehicleID 800) and the side module further improves the performance (from 79.5% to 80.5% on VeRi-776 and 76.4% to 88.2% on VehicleID 800), demonstrating the effectiveness of the proposed transformer-based attention module and the side module.
The effect of the image size. We also evaluated the effect of the image size, as shown in Table 6. The vehicle dataset is constrained by its steel structure, and as a result the resolution of the input image aspect ratio is 1:1. It can be seen from the results that when the image size is set as 256 × 256, the best performance is achieved. Hence, in the following experiments, we set the default image size as 256 × 256.

Cross-Domain Dataset Testing
To validate the robustness of the proposed TANet method, we also conducted experiments on the person ReID task using Market-1501. The characteristic attributes of PReID and VReID are different. The results are shown in Table 7. Compared with PCB [22], our proposed method outperforms it by 2% in mAP but is 1.3% less effective in CMC1, because our model tends to focus on the overall information and cannot accurately extract subtle local differences. OSNet [21] is a light-weight multi-scale network for PReID. Compared with OSNet, our method is 2.6% better in mAP but the CMC1 is lower by 1.1%. ABDNet [24] adopts a two-branch structure and introduces a variety of attention mechanisms to focus on more details and effectively integrate them, so the performance is excellent. CNet [25] has better spatial learning methods for PReID. The performance in mAP is similar to our method, which verifies the robustness of the proposed method.
In order to better prove that our attention branch is universal, we also integrated the proposed transformer-based attention branch into other popular PReID models such as MGN [20], using three PReID datasets. The results are shown in Table 8. We can see that both mAP and CMC1 increase by 1-2 percent in the three datasets, and the increase is more obvious for the CUHK03 [43] dataset. Moreover, the performance of MGN combined with the attention module of our model is close to the latest SOTA performance.
In addition, we conducted classification tests on the CIFAR-10 and CIFAR-100 datasets, fusing our attention module with ResNet50 as the backbone. The results are listed in Table 9. From the results, it can be observed that the classification performance is significantly improved after fusing our attention module, without using the pre-training model. This demonstrates that our framework has good expansibility and can be easily integrated with different frameworks for different tasks. In this section, we compare the proposed TANet with state-of-the-art approaches for the three benchmarks.
For the VeRi-776 dataset, the results are shown in Table 10. Our method achieved 80.5% and 95.4% in mAP and CMC1, respectively. The value for mAP is lower than MCRL + SL [26] by 0.64% points but higher than HPGN [30] and PVEN [13] by 0.32% and 1%, respectively. Compared with SAVER [8], which learns instance-specific discriminative features but ignores the extreme viewpoint changes, our proposed method achieved similar performance on VeRi-776. Our method outperforms VANet [19] by 14.2% on mAP and 5.6% on CMC1. This may stem from the fact that VANet [19] uses limited viewpoints on VeRi-776, and two instances with similar viewpoints near the viewpoint boundary may be wrongly divided into different viewpoints. Compared with HPGN [30], which adopts the graph network structure for ReID, our method performs better on VeRi-776. The proposed TANet also outperforms some typical methods that use extra data annotation, such as PRND [3], PVEN [13], and HPGN [30]. Compared with these methods, our approach shows excellent performance without any other information besides the identity information, which verifies the effectiveness of fusing distinguishing features and attention features using the proposed method. For the VehicleID dataset, the results are shown in Table 11. When the size is small, our model performance is not outstanding, but as the size increases performances tend to be better. A CMC1 accuracy of 82.9%, 81.5%, and 79.6% is achieved for the three sizes of small, medium, and large, respectively, and the mAP also reached 88.2%, 87%, and 85.9%. Our method uses the transformer in a separate attention branch network that has a better performance in processing size transformation. Hence, when the size is large, the transformer attention branch can capture more detailed features, effectively enhancing the performance of the model and achieving state-of-the-art performance. Compared with MCRL + SL [26] and VANet [19], although our method did not perform as well for small and medium sizes, its performance on the large size was comparable. Our proposed TANet has better performance in handling more and larger datasets. Based on the performance on the above two datasets, the overall performance of our method was excellent.

Qualitative Analysis
In this section, we offer an insight into how the three modules improve ReID performance and visually compare the learning content of some branches, showing the impact of each module in the learning process.

Retrieval with Different Queries
In this section, we visualize the experimental results on VeRi-776, as can be seen in Figure 6, The left column shows query images, while the images on the right-hand side are the top five results obtained by the proposed method. The global feature is good at retrieving images from the same viewpoint, while the proposed attention module enables the model to capture spatial attention features and hence retrieve the same object from different perspectives (e.g., the second row in Figure 6). In summary, our approach has achieved results from different perspectives.

Visualization of the Rank List
To show the discriminatory ability of the model, we selected the top five images for visualization in descending order of similarity. In addition, the top five were adopted to obtain a more comprehensive performance of the model retrieval results.
Baseline and TANet retrieval results were compared on VeRi-776. It can be seen from Figure 7 that the images with a red border are errors, while other images are correct results. Our proposed TANet can find the same images from different angles/perspectives (e.g., the first and the second examples). Our method is not always efficient when dealing with extreme situations, as shown in Figure 8, where the positive sample and negative sample in the gallery have the same brand, the same model, the same color, and the same viewpoint. In general situations, the overall performance of our method is more reliable. Compared with the results of the baseline query, the query accuracy of our method is significantly improved.

Activation Map Visualization
In order to verify the effectiveness of the respective branches of the proposed method, we visualized the three branches separately. Examples are shown in Figure 9. The global branch can focus on more obvious features such as the rear of the car, the front face, the lights, etc., and the attention branch enhances the area that the global branch pays attention to. The heat-value distribution indicates the orientation of the car body in space, and it can be observed from the side branch. It is significant that our proposed method effectively combines the three branches to fuse distinguishing features and attention features for discrimination. Figure 9. Visualization for the activation heat maps of each branch on the VeRi-776 dataset. The first column shows the query images, the second and third columns are the global branches, the fourth and fifth columns are the attention branches, and the last two columns are the side branches.

Conclusions
In this paper, we proposed a transformer-based attention method that used attention to learn certain spatial features and enhance the discriminative ability of the model. For attention and spatial reasons, a transformer was used as our attention branch, and our method attempted to use the transformer and a CNN in combination. Our method is composed of three branches: the global branch, the attention branch, and the side branch, and the experiments proved that each branch can have a different effect on the results. The key network of TANet is the attention branch, which we used for separate extraction and porting to different combined networks, to prove its efficiency and ability. The TANet method does not use additional data annotations, and in order to prove the versatility of the method we conducted experiments on VReID and PReID datasets, achieving excellent results.
In the future, there is hope that re-identification techniques will evolve from 2D image searches into matching 2D targets from real-time surveillance videos or 3D models. However, much research is needed to realize this vision, especially to address the issues of retrieval speed and extreme spatial perspectives.