Attention-Aware Adversarial Network for Person Re-Identification

Person re-identification (re-ID) is a fundamental problem in the field of computer vision. The performance of deep learning-based person re-ID models suffers from a lack of training data. In this work, we introduce a novel image-specific data augmentation method on the feature map level to enforce feature diversity in the network. Furthermore, an attention assignment mechanism is proposed to enforce that the person re-ID classifier focuses on nearly all important regions of the input person image. To achieve this, a three-stage framework is proposed. First, a baseline classification network is trained for person re-ID. Second, an attention assignment network is proposed based on the baseline network, in which the attention module learns to suppress the response of the current detected regions and re-assign attentions to other important locations. By this means, multiple important regions for classification are highlighted by the attention map. Finally, the attention map is integrated in the attention-aware adversarial network (AAA-Net), which generates high-performance classification results with an adversarial training strategy. We evaluate the proposed method on two large-scale benchmark datasets, including Market1501 and DukeMTMC-reID. Experimental results show that our algorithm performs favorably against the state-of-the-art methods.


Introduction
Person re-identification (re-ID) is a fundamental problem in computer vision, which aims to re-identify a person of interest in other cameras.In recent years, person re-ID has achieved increasing attention due to its wide applications and great potentials, such as criminal investigation and security enhancement.
Although recent deep learning-based person re-ID networks achieve favorable performance, most of them still suffer from a lack of training samples.Training images in most datasets are collected from manually-cropped person images.Therefore, it is time consuming to create a large-scale database.To tackle this problem, the methods of [1][2][3][4] propose to learn robust and discriminative features from limited training samples in a deep neural network.As implemented by most methods, common data augmentation methods, e.g., randomly cropping, flipping, and color jittering, are not sufficient to achieve high-quality results.One approach to augment training samples is to generate multiple images based on each occluded image by random erasure [5,6].Although the performance is improved by this method, it aggravates the memory and storage requirements.The method [7] enriches the data on the feature map level.As a result, the learned object detector is robust to different conditions, i.e., occlusion, deformation, illumination, etc. Inspired by [7], we perform data augmentation on the level of feature maps by erasing (occluding) different spatial regions of the feature maps.
Existing person re-ID methods face various intrinsic difficulties, such as occlusion, illumination, image resolution, and noisy background, to name a few.Many methods [8][9][10][11] address these problems by representing the input person image with discriminative feature maps and matching them in a task-specific metric space.However, the performance of most of these methods may fall short due to limited training samples, e.g., training images are not sufficient to cover the complex conditions of real scenes.As a result, training models easily over-fit the training datasets.The recently-proposed image generation models, the generative adversarial networks (GANs) [8], provide powerful tools to handle this problem by generating more realistic style images.Many methods benefit from the high-performance GAN model [12,13].The work in [5,6] generated adversarially-occluded samples based on a proposed re-ID model and used these samples to further improve the performance of the model.The work in [10,11,14,15] extracted numbers of discriminative characteristics, based on which end-to-end deep neural networks were trained.However, explicitly generating training samples and enlarging the datasets increase the burden on memory resources.It is feasible and efficient to enrich features on the feature map level when training networks.
To this end, we propose a novel attention-aware adversarial network (AAA-Net) for person re-ID.Instead of explicitly enlarging the size of datasets, AAA-Net generates multiple occluded samples from each input on the feature map level.Specifically, the feature maps are blocked spatially, and a series of occluded feature maps is generated.Actually, this is an image-specific data augmentation method, which diversifies the features and provides sufficient training samples according to the input.Furthermore, an attention assignment mechanism is proposed to enforce attentions to be assigned to more important regions, as Figure 1, and thus, regions are served as convincing samples for the final classification results.Training the proposed framework consists of three stages.First, a baseline (BL) classifier for person re-ID is trained.Second, an attention assignment network is designed.According to the initial predicted attention map, as well as the set of occluded feature maps, attentions are further re-assigned to the rest of the interesting locations, besides the original focused area.This process is implemented in an adversarial strategy.As a result, an updated attention map is generated in the attention assignment network.Finally, an attention-aware adversarial network is proposed.Based on the attention map, an adversarial training strategy is employed to explore the determined feature maps and classify the input person image accurately.The main contributions of this work are:

•
We propose an attention-aware adversarial network for person re-ID, in which data augmentation is implemented on the feature map level.

•
An attention assignment mechanism is proposed to re-assign attentions to more important regions.

•
The proposed method is evaluated on two large benchmark datasets and achieves promising results.

Person Re-ID
Two main technical components of person re-ID methods are feature extraction and metric learning.Existing methods focus on extracting robust and representative features of person images, based on which intra-class images are matched in a learned metric space.The work in [8,9,15,16] represented the input images with feature vectors, which are able to depict their global characteristics.However, these methods fail to capture local detailed information and lead to inferior matching and detection performance.In order to solve this problem, many work proposed to explore local features.The work [17] extracted the local features of key points.The work in [18] detected human pose in a local learning module for person re-ID.In [19], images were first divided into several patches.Then, all image patches were input into a a long short-term memory (LSTM) network in sequence to generate ensembled features, which consist of all local information.
Except for exploring discriminative feature representing methods, it is also crucial to learn a robust metric space, in which the distance of intra-class person images is small, and the distance between inter-class person images is large.This is implemented by elaborately designed loss functions, such as the contrastive loss [20], the triplet loss [21][22][23], the quadruplet loss [24], etc.
In recent years, convolutional neural network (CNN)-based architectures have demonstrated powerful ability for person re-ID.The work in [25] captured the relationship of person images in multi-views with a novel cross-input neighborhood layer.The work in [17] utilized human pose estimation methods to facilitate the person re-ID task.The work in [26] combined CNN and the dictionary learning technique for low-resolution person re-ID.The work in [27] integrated sparse reconstruction learning in a unified CNN framework to solve the partial person re-ID problem.The work in [22] proposed to detect discriminative regions in person images by training a comparative attention network (CAN), which is able to recognize which regions are determined to identify a person.Inspired by this work, our method employs an attention guidance module in the network to assign attentions on the spatial regions of feature maps, which are crucial to recognize a person.

Data Augmentation
The size of the database is crucial for training a robust model.However, it is time consuming and expensive to collect human-labeled samples and create a large-scale database.Most methods employ various data augmentation techniques to expand the dataset.Commonly-used data augmentation methods incorporate image resizing, color jittering, and horizontal or vertical flipping, to name a few.Different from these hard sample generation methods, the work in [28] proposed to occlude parts of the input image with a rectangular box.The occluded position and the size of the box are randomly selected from a range.The recent work [5] occluded a relatively complete part of the human body instead of small and scattered regions.All these methods generate multiple samples from one single image and expand the database to several times the size of the original database.Although these data augmentation strategies help to train robust models and avoid over-fitting to some extent, they require heavy computation and storage resources.
Instead of explicitly expanding the size of the database, we propose a novel data augmentation method on the feature map level.Specifically, occlusions are applied on the extracted representative feature maps of the input person image.This strategy enforces feature diversity in an online manner.By this means, attentions are assigned to different parts of the feature maps, and thus, the entire person is homogeneously highlighted in the adversarial network.

Method
In this work, a three-stage framework is proposed for person re-ID.First, a baseline classification network is trained.Second, an attention-assignment network is proposed to predict an attention map, which enforces the model to focus on more important target regions.Finally, an attention-aware adversarial network is designed to generate a high-performance classifier for person re-ID.In this section, each main component of the proposed framework is elaborated.

Baseline Network
As a foundation, we first trained a baseline network for person re-ID.The person re-ID task can be regarded as a classification problem, and the each person serves as a specific class.Denote the training set as T = {(I i , y i ) |i ∈ {1, 2, ..., N}}, where I i is the i th person image, and y i is the ground truth class label.T contains N labeled images of C persons.The goal of the person re-ID task is to find a mapping function F : I → s, which maps the input person image I i to its classification score vector s i .As a result, the class with the highest possibility is the corresponding category of the input person image.
The baseline network is based on the classical resnet-50 classification network [4].Given the input person image, the convolutional modules of the resnet-50 network extract the representative feature maps.Subsequently, the feature maps are fed to a fully-connected layer to generate the classification vector.Finally, a so f tmax classification loss function is employed to train the baseline classification network.The stochastic gradient descent (SGD) optimization method is utilized to minimize the loss function.

Attention Assignment Network
Based on the baseline classification network, the proposed attention assignment network is trained.The procedure is shown in Algorithm 1.

Algorithm 1
The training strategy of the attention assignment network.

Nperson images;
While N > 0 do 1: The feature maps of the middle layer are divided equally into 16 with a square grid; 2: Each grid in the feature maps is occluded and reproduced in sequence; 3: These 16 groups of feature maps are entered together into the classification network; 4: Select the feature maps with the lowest classification probability to guide the generation of feature maps of the attention assignment mechanism; 5: Update the parameters of the attention assignment mechanism; 6: N ← N − 1;

Retain the weight of the overall network structure
The attention assignment network aims to guide the attentions to be assigned to nearly all parts of the target objects.Concretely, besides the regions that contribute to the final result most, other important regions are further focused on according to the attention assignment mechanism.This is implemented in an adversarial manner.In detail, the occlusion sample with the lowest classification prediction probability is selected as a template in each iterative training process.This allows the attention map to fit the feature response range, which means that some areas of the image that are important to the classification get high values in the attention map.The framework of the attention assignment network is demonstrated in Figure 2. The architecture can be regarded as two branches.Given an input person image, the convolutional modules in the baseline network are employed to extract representative feature maps, denoted as f r .For one branch, the feature maps are input into several stacked convolutional layers to generate an attention map, which indicates the interesting important regions in the person image for detection and recognition.For the other branch, the feature maps are regularly occluded by 16 non-overlapped rectangular boxes, respectively.As a result, 16 groups of feature maps with occluded area are generated.Each group of occluded feature maps is fed into a fully-connected layer, and the corresponding classification scores are output after a so f tmax operation.The occluded feature maps obtaining the lowest classification accuracy are picked out as a difficult sample.Subsequently, both the attention map M att and the difficult sample M occ are input to the adversarial loss, L adversarial (M att , M occ ) = L att (M att , M occ ) + L r (M att ). ( The first term is the data term, which enforces the attentions assigned to other important area, and has the following form: The second term is the regularization term, which prevents the network from over-fitting.Meanwhile, it can also maintain the structure of the original feature maps, During the training phase, the attention assignment network is fine-tuned by the common layers of the baseline network, including the convolution modules of the resnet-50 network and the fully-connected layer.These parameters are fixed in the backward stage.The rest of the parameters are initialized randomly.As a result, an attention map generation mechanism is learned in this network.The attention map is able to re-assign attentions to the area except for the most import region that contributes most to the final result.Therefore, nearly all parts of the target person are highlighted under this attention assignment mechanism.

Attention-Aware Adversarial Network
Established on the attention assignment network, the generated attention map is further integrated with the representative feature maps of the input person image, aiming to assign more attentions to the interesting regions besides the most important one.The framework of the attention-aware adversarial network for final person re-ID is illustrated in Figure 3.This is also a two-stream network, in which the common convolution modules are fine-tuned by the parameters of the attention assignment network pre-trained in Section 3.2.For one stream, the attention map is generated from the input person image.Attributed to the pre-trained network in Section 3.2, instead of only highlighting the most attentive regions, the attention map is able to assign attentions to nearly all target regions of the input person image.For the other stream, the attention map is integrated with the representative feature maps of the person image by an element-wise multiplication operation.The generated feature maps are known as attention-aware feature maps and are occluded by the attention mask.Together with the original representative feature maps, the occluded attention-aware feature maps are entered into the subsequent classification network.Not only will we update the parameters of the attention mask again by the switchable gradient update mechanism, but also we will update the entire baseline synchronously.The switchable gradient update mechanism is realized by the parameter of ω, which is detailed in Section 4.3. .We focus on the feature maps to produce occlusions and might simply abide by the identical strategy to generate occlusions on feature maps with a square mask.Nevertheless, we consider that there may be multiple regions in the feature maps, for instance both the satchel and the T-shirt have equally significant effects when the classifier makes decisions.Consequently, we make use of a flexible policy to solve this weakness.The implementation details are as follows.
We use the raw feature maps and occluded feature maps as the inputs for the rest of the CNN network.These feature maps are used to calculate the predicted probability of classification.In this stage, the two parts of our network are jointly optimized.The loss function of the model on these samples is computed by the cross-entropy.The procedure is shown in Algorithm 2.

Algorithm 2
The training strategy of the attention-aware adversarial network.

Nperson images;
While N > 0 do 1: The feature maps obtained by the pre-training attention assignment mechanism are used as 0-1 attention mask; 2: The feature maps of the middle layer multiplied by the 0-1 mask as occluded feature maps; 3: The occluded feature maps are entered into the classification network along with the original feature maps to obtain p b f , which is the classification probability of the original feature maps, and p a f , which is the classification of occluded feature maps; 4: Update parameters of the entire network by combining p b f and p a f with adversarial loss; 5: N ← N − 1; end The entire adversarial loss function is formulated as follows: where λ 1 , λ 2 are the constants, which are obtained by heuristic methods, respectively, and L 1 , L 2 are the same as the pre-training adversarial loss function (1) in which L 1 represents a regularization term and L 2 represents L att .p b f , p a f , and ω are the classification probability before occluding the feature maps, the classification probability after occluding the feature maps, and an empirical constant, respectively.p b f and p a f determine whether the parameters of the entire network are updated.We use the decentralized training methods.More specially, we take turns to train a number of occluded samples and unabridged samples.The optimized approach is the same as the above, in Sections 3.1 and 3.2.

Experimental Results
In this section, we conduct experiments to analyze the main components of the proposed method and evaluate the performance of our method with the state-of-the-art algorithms.

Datasets and Evaluation Metrics
We evaluated the proposed method and the state-of-the-art algorithms on two large-scale benchmarks, Market1501 [29] and DukeMTMC-reID [30,31].
Market1501 incorporates 12,936 training images and 19,732 test images, with 1501 identities captured with six cameras, and 32,668 bounding boxes are generated by the DPM-detector.Then, 751 identities were used for training and 750 identities for testing.
DukeMTMC-reID has the same format as Mrarket1501, with 16,522 images for training and 19,889 images for testing.DukeMTMC-reID contains 1404 identities, of which 702 identities were used for training and the rest for testing.Manually-annotated pedestrian bounding boxes were provided as the ground truth.
Two common evaluation metrics, Rank-1 and mean average precision (mAP), were employed to evaluate the performance of the re-ID models.

Implementation Details
All input person images were uniformly resized to 224 × 224.The baseline network in Section 3.1 was based on the classical resnet-50 classification network.The learning rate was set to 0.001 and was decayed by 0.1 every two epochs.The baseline network achieved convergence after 10 epochs.Then, the parameters of the baseline network were used to fine-tune the attention assignment network in Section 3.2.The learning rate was 0.0001 and was decayed by 0.1 every three epochs.The attention assignment network converged after 10 epochs.Finally, in order to train the attention-aware adversarial network in Section 3.3, the parameters were fine-tuned by the attention assignment network.The learning rate was 0.0001 and was decayed by 0.1 every three epochs.The network converged after 10 epochs.For the three networks, the SGD optimization method was employed to minimize the training losses.

Analysis of the Parameters
The hyper-parameters λ 1 and λ 2 were used to adjust the importance of the two terms in the loss function Equation (4).According to the quantitative experiments in Table 1, we empirically gave the constraint that λ 1 < λ 2 .In our work, λ 1 and λ 2 were set to 0.3 and 1.7, respectively.

Influence of the Attention Assignment Network
As elaborated in Section 3.2, the attention map predicted in this stage not only focused on the interesting regions that contributed most to the final results, but also the other important regions that indicated the unique features of the person.In the next stage, the parameters of the attention assignment network were employed to fine-tune the attention-aware adversarial network.As a result, the attention map in the attention-aware adversarial network inherited the ability to highlight nearly all attentive regions of the target person.
In this section, we experimentally analyze if the attention map trained in the attention-aware adversarial network was able to capture as many important regions as possible.Namely, we did not fine-tune the attention-aware adversarial network with the parameters of the attention assignment network; instead, we simply occluded one-third of the feature map regions with the highest response.As shown in Table 4, fine-tuning the attention-aware adversarial network with the pre-trained network was able to achieve more favorable performance.The visual occlusion results of the attention map generated in the attention-aware adversarial network are illustrated in Figure 4.According to the results, the occlusion samples presented diversity.Multiple parts of important regions were occluded for the adversarial process.Therefore, the attention-aware adversarial network with fine-tuning operation outperformed the one without fine-tuning and improved the performance of the baseline network by a large margin.As shown in Table 5, the attention-aware adversarial network also outperformed the network with random erasing.

Method GAN OIM ACRN PAN APR Baseline Ours
Rank

Conclusions
In this paper, we introduced an attention-aware adversarial network (AAA-Net) for person re-ID.A novel data augmentation method was proposed to enforce the data diversity on the feature map level.Based on the augmented feature maps, an attention map was generated in an attention assignment network to assign attentions to nearly all important regions of the person image.Subsequently, an attention-aware adversarial network was trained to classify the input hard samples accurately according to the attention map in an adversarial manner.Extensive experimental results on two large-scale benchmarks demonstrated the effectiveness of the proposed framework.In the future, we will enhance the representation of the mask feature or use a recurrent training framework.

Figure 1 .
Figure 1.Our result can focus on more important regions on feature maps.

Figure 2 .
Figure 2. The attention assignment network.fm1 represents that the first part of the feature map is occluded, and s1 represents the classification probability of fm1.An occlusion feature map with the lowest classification probability is selected to guide the generation of the attention map according to the adversarial loss.

Table 4 .Figure 4 .
Figure 4. Occlusion examples generated by the attention map in the attention-aware adversarial network with the fine-tuning operation.The black mask represents the occluded area.

Table 3 .
Performance of inserting the occlusion operation in different locations of the network."Baseline" represents the baseline network.

Table 5 .
Performance of the attention-aware adversarial network and random erasing on the baseline network.