Parts Semantic Segmentation Aware Representation Learning for Person Re-Identification

Person re-identification is a typical computer vision problem which aims at matching pedestrians across disjoint camera views. It is challenging due to the misalignment of body parts caused by pose variations, background clutter, detection errors, camera point of view variation, different accessories and occlusion. In this paper, we propose a person re-identification network which fuses global and local features, to deal with part misalignment problem. The network is a four-branch convolutional neural network (CNN) which learns global person appearance and local features of three human body parts respectively. Local patches, including the head, torso and lower body, are segmented by using a U_Net semantic segmentation CNN architecture. All four feature maps are then concatenated and fused to represent a person image. We propose a DropParts method to solve the parts missing problem, with which the local features are weighed according to the number of parts found by semantic segmentation. Since three body parts are well aligned, the approach significantly improves person re-identification. Experiments on the standard benchmark datasets, such as Market1501, CUHK03 and DukeMTMC-reID datasets, show the effectiveness of our proposed pipeline.


Introduction
Person re-identification is a typical computer vision problem which aims at matching pedestrians across disjoint camera views.It has attracted a lot of research interest due to its significant application potentials, such as in visual recognition and surveillance [1,2].One of the most important tasks that person re-identification is shouldering is to learn generic and robust feature representations of people.

Related Work
In this section, we present a brief review of works in feature exaction and part alignment for person re-identification.
To solve this problem, many scholars focus on person re-identification based on part alignment recently.Some methods divide the person image into many stripes or grids to reduce the effects of part misalignment [7,10].The division of grids or strips is predefined and heuristic, which can't locate the parts precisely.Pose-based methods [5,11] employ a pose estimation model to infer corresponding bounding boxes.However, parts missing is ineluctable; it causes the convolutional neural network to not work properly.
This paper focuses on the problem of body part misalignment.It proposes a human parts semantic segmentation aware representation learning method for person re-identification.We employ semantic segmentation network to infer corresponding bounding boxes, and propose a DropParts method to solve the part missing problem.Experiments on the standard benchmark datasets show the effectiveness of our proposed pipeline.The contributions of this paper are as follows: (1) We design a four-branch convolutional neural network to deal with parts misalignment problem.The four-branch CNN network learns a person's appearance features globally and using the features of three local body parts.The bounding boxes of three body parts are inferred from human parts semantic segmentation results, which are learned with a popular U_Net [12] semantic segmentation network.
(2) We propose a DropParts method to solve the part missing problem, with which the local features are weighed due to the appearance vector and fused with global feature.The DropParts method makes the four-branch convolutional neural network work properly when part missing occurs.On the other hand, it improves the performance of person re-identification.

Related Work
In this section, we present a brief review of works in feature exaction and part alignment for person re-identification.
At the beginning of the study, hand-crafted features extractors, such as color histogram [13], Scale-Invariant Feature Transform (SIFT) [14], Local Binary Patterns (LBP) features [15], Bag of Word (BoW) [8] and Local Maximal Occurrence (LOMO) [16] are employed for the person representations.Recently, the methods based on deep learning learn feature representation directly from tasks and have shown significant improvement compared with hand-crafted feature extractors.All kinds of popular CNN network architectures, such as Inception network [3][4][5], Resnet network [6,7], are applied to learn feature representation for person re-identification.Additionally, different loss functions, such as Softmax loss [17], Siamese loss [4,18], Cluster loss [19], Triplet loss [20] and their combination [21] are used to improve the discriminative feature learning in person re-identification tasks.Softmax loss [17] function is the common loss function used in recognition tasks.
Many scholars focus on person re-identification based on part alignment [7,10,17,22,23].Early works divide the person image into many stripes or grids to reduce the effects of part misalignment.Article [10] divides the person image into three horizontal stripes and extracts CNN features of each strip.After that, they concatenate and fuse them with a fully connected layer to represent a person image.Meanwhile, DeepReID method [17] also divides the person image into horizontal stripes and carries out patches matching within each stripe.On the other hand, SpindleNet [22] takes the human body structure information into person re-identification pipeline to help align body part features of images.The features of different semantic levels are merged by a tree-structured fusion network based on human body region which is guided by multi-stage feature decomposition and tree-structured competitive feature fusion, to represent a person image.IDLA method [23] captures local relationships between the two input images on the basis of mid-level features of each input image, and computes a high-level summary of the outputs of this layer by a layer of patch summary features, which are then spatially integrated with subsequent layers.More stripes-and grids-based methods can be found in [7].Although stripes-and grids-based methods reduce the risk of part misalignments, the division of grids or strips is predefined and heuristic, which can't locate parts precisely.
Pose-based person re-identification methods leverage external cues from human pose estimation.Article [11] incorporates a simple cue of the person's coarse pose (i.e., the captured view with respect to the camera) and the fine body pose (i.e., joint locations) to learn a discriminative representation of person image.PDC method [5] leverages the human part cues to alleviate the pose variations and learn feature representations from both the global image and different local parts.To match the features from global human body and local body parts, a pose driven feature weighting sub-network is further designed to learn adaptive feature fusions.Pose-based methods leverage human pose estimation to infer the location of body parts.However, parts missing is ineluctable, it makes the convolutional neural network not work properly.And it is hard to find the right body part in the crowd because there may be several parts of the same semantic label in an image.
Attention mechanism has a large impact on neural computation, which selects the most pertinent pieces of information and focuses on specific parts of their visual inputs to compute the adequate responses [24][25][26][27][28][29].Article [25] decomposes the human body into regions following the learned person re-identification sensitive attention maps.Accordingly, it computes the representations over the regions, and aggregates the similarities computed between the corresponding regions of a pair of probe and gallery images as the overall matching score.The PersonNet method [26] learns attention map from different scales for each module and applies the attention map to different layers of the network.At the end, they learn features by fusing three attention modules with Softmax loss.Moreover, HydraPlus-Net method [27] has several local feature extraction branches which learn a set of complementary attention maps in which hard attention is used for the local branch and soft attention for the global branch, respectively.More methods based on the attention mechanism can be found in [28,29].Methods based on the attention mechanism highlight the important region information of person images, but they also increase the number of feature maps by several times, and bring risks of over-fitting.
We use a semantic segmentation network to infer human body parts in this paper.Due to the ensemble effects of label of each pixel, bounding boxes inferred from semantic segmentation map are stable.We propose a DropParts method to solve the part missing problem; the method makes the four-branch convolutional neural network work properly when part missing occurs.

Overview of the Proposed Method
Given a probe person image, person re-identification targets the most similar persons from gallery sets according to the distance between appearance representations.Our object is to learn the generic and robust feature representations of person.
Figure 2 illustrates the architecture of the proposed parts aware person re-identification network, consisting of four CNN branches which learn person appearance and three body parts feature maps.The four feature maps are fused to an image descriptor.Three local patches, including head patch, torso patch and lower-body patch, are inferred from a semantic segmentation map.Four image patches, including whole person image and three image patches, are resized to the fixed size and then input into the proposed four-branch network.Each branch learns the representation of one part and finally is fused by a concatenation layer and a fully connected layer.A softmax layer is used to classify person ID.MxN and its semantic segmentation map S∈{0,1,2,3} MxN , where semantic labels 0, 1, 2 and 3 represent background, head, torso and lower-body pixels of person image respectively; M and N are the height and width of person image, respectively.

INPUT: Given an image I∈R
NETWORK: The bounding boxes {BBi}i=1,2,3 of the three local parts are fixed by the minimum enclosing rectangles of pixels with the same semantic label.The corresponding image patches are denoted as {Pi}i=1,2,3 (Pi is a null matrix if its corresponding part is missing).The person image I and three local parts patches {P1, P2, P3} go through four network branches {CNNi}i=1,2,3,4, each image passes through one branch.The feature vectors of four network branches are CNN1(I), CNN2(P1), CNN3(P2) and CNN4(P3) respectively, CNNi(•) ∈ R .This paper uses a 3-dimensional vector to represent the absence of all 3 parts: where  0, size P 0 1,  P 0 .
The proposed DropParts method (detailed in Section 3.2) maps parts absence vector PA to another 3-dimensional vector PA : INPUT: Given an image I ∈ R M×N and its semantic segmentation map S ∈ {0,1,2,3} M×N , where semantic labels 0, 1, 2 and 3 represent background, head, torso and lower-body pixels of person image respectively; M and N are the height and width of person image, respectively.
NETWORK: The bounding boxes {BB i } i=1,2,3 of the three local parts are fixed by the minimum enclosing rectangles of pixels with the same semantic label.The corresponding image patches are denoted as {P i } i=1,2,3 (P i is a null matrix if its corresponding part is missing).The person image I and three local parts patches {P 1 , P 2 , P 3 } go through four network branches {CNN i } i=1,2,3,4 , each image passes through one branch.The feature vectors of four network branches are CNN 1 (I), CNN 2 (P 1 ), CNN 3 (P 2 ) and CNN 4 (P 3 ) respectively, CNN i ( This paper uses a 3-dimensional vector to represent the absence of all 3 parts: Appl.Sci.2019, 9, 1239 5 of 16 where α i = 0, size(P i ) = 0 1, size(P i ) > 0 .
The proposed DropParts method (detailed in Section 3.2) maps parts absence vector PA to another 3-dimensional vector PA: Scale the part feature vectors and concatenate them with the whole image feature vector, get a fusion vector: where Normalize(•) is a normalized operator.This paper uses batch normalization method [27] to normalize features of each part branch.
And then a fully connected layer which functions as metric learning [10,30], is used to fuse the features of the whole person image and three body part patches: where The object of this paper is to learn stable and discriminative person representation F fuse (I, P 1 , P 2 , P 3 |W, b).
OUTPUT: At last, a softmax classifier [17] is used to discriminate different person IDs according to their fused CNN features.

Person Parts Localization and Parts Alignment
Semantic segmentation associates each pixel of an image with a class label.Due to the ensemble effects of label of each pixel, bounding boxes inferred from semantic segmentation map are more stable and accurate than detection methods.This paper uses semantic segmentation map to find the bounding boxes of human body parts.
U_Net [12] is a popular semantic segmentation method which is good at biomedical image segmentation.U_Net architecture consists of a contracting path to capture the context and a symmetric expanding path that enables precise localization.We make three modifications to adopt it for the person parts segmentation.At first, we reduce the number of pooling operators due to the small size of person image.Next, we add two residual structures to compensate for the depth reduction.Third, we do not reduce the size of feature maps by 2 when passing through convolutional layers; as a result, the output segmentation maps have the same size as input images.Figure 3 illustrates the U_Net structure we used.We use its segmentation maps to find the bounding boxes of human body parts.Person images are resized to 192 × 88 and pass through the U_Net network.The size of the output semantic segmentation maps is also 192 × 88, and then the segmentation maps are resized to the same size as the original person images.Figure 4 illustrates some examples of part segmentation by super-pixels.
Bounding boxes of person parts are fixed by the parts semantic segmentation map.For the stable feature extractor, there are two points need to be considered: (1) Large scale differences make extracted feature instable; (2) Large aspect ratio changes lead to part misalignment.This paper gives up two kinds of part regions: (1) the part region whose area bellows 5‰ of its corresponding person image; (2) the part region whose aspect ratio beyond reasonable scope.We set reasonable scopes [0.75, 1.33] for head region, [1,3] for torso region and [1,3] for lower-body region.We crop the person images with minimum circumscribed rectangle of its corresponding parts if they are complete.
The object of this paper is to learn stable and discriminative person representation F I, P , P , P |W, b .
OUTPUT: At last, a softmax classifier [17] is used to discriminate different person IDs according to their fused CNN features.Semantic segmentation associates each pixel of an image with a class label.Due to the ensemble effects of label of each pixel, bounding boxes inferred from semantic segmentation map are more stable and accurate than detection methods.This paper uses semantic segmentation map to find the bounding boxes of human body parts.

Person Parts Localization and Parts Alignment
U_Net [12] is a popular semantic segmentation method which is good at biomedical image segmentation.U_Net architecture consists of a contracting path to capture the context and a symmetric expanding path that enables precise localization.We make three modifications to adopt it for the person parts segmentation.At first, we reduce the number of pooling operators due to the small size of person image.Next, we add two residual structures to compensate for the depth reduction.Third, we do not reduce the size of feature maps by 2 when passing through convolutional layers; as a result, the output segmentation maps have the same size as input images.Figure 3 illustrates the U_Net structure we used.We use its segmentation maps to find the bounding boxes of After parts localization, the person image and three local patches are propagated forward through the proposed four-branch network, which completes parts alignment.An example illustrated in Figure 4. Figure 4a is an example of parts misalignment.It is the head region in red rectangle region of left image while it is background in the same location in right image.We locate three body parts and combine them with the whole image as the input of proposed four-branch CNN network.Figure 4b,c illustrate two input of the proposed network, which corresponds to two images of Figure 4a.As seen from Figure 4b,c, the input patches are well aligned.Bounding boxes of person parts are fixed by the parts semantic segmentation map.For the stable feature extractor, there are two points need to be considered: (1) Large scale differences make extracted feature instable; (2) Large aspect ratio changes lead to part misalignment.This paper gives up two kinds of part regions: (1) the part region whose area bellows 5‰ of its corresponding person image; (2) the part region whose aspect ratio beyond reasonable scope.We set reasonable scopes [0.75, 1.33] for head region, [1,3] for torso region and [1,3] for lower-body region.We crop the person images with minimum circumscribed rectangle of its corresponding parts if they are complete.
After parts localization, the person image and three local patches are propagated forward through the proposed four-branch network, which completes parts alignment.An example illustrated in Figure 4. Figure 4a is an example of parts misalignment.It is the head region in red rectangle region of left image while it is background in the same location in right image.We locate three body parts and combine them with the whole image as the input of proposed four-branch CNN network.Figure 4b,c

Part Missing Representation and DropParts Method
Part missing is another problem of person re-identification in a complex environment, which happens when meeting with occlusion or parts region is small enough.It degrades the performance of person re-identification.This paper proposes the DropParts Method to solve part missing problem.
A normal feature fusion and metric learning are formulated as follows: In Equation ( 5), both normalization and non-normalization of whole person image and part patches vectors CNN  are feasible, because subsequent metric learning Equation (6) layer will reweigh them.

Part Missing Representation and DropParts Method
Part missing is another problem of person re-identification in a complex environment, which happens when meeting with occlusion or parts region is small enough.It degrades the performance of person re-identification.This paper proposes the DropParts Method to solve part missing problem.
A normal feature fusion and metric learning are formulated as follows: Ffuse (I, P 1 , P 2 , P 3 |W, b) = Ŵ Fconcate (I, P 1 , P 2 , P 3 ) + b (6) In Equation ( 5), both normalization and non-normalization of whole person image and part patches vectors CNN 1 (P i ) are feasible, because subsequent metric learning Equation (6) layer will reweigh them.
When meeting with parts missing, the usual method set its corresponding patch or feature a zero matrix or a zero vector.However, it takes the risk of unstable training when all the numbers in a big block are zero.Norms of feature fusion vector Fconcate (I, P 1 , P 2 , P 3 ) with zero blocks and without zero blocks are quite different, as a result, parameters Ŵ and b cannot meet the demands of parts missing and part non-missing and a compromising solution degrades the performance.
The key is to make the norms of feature fusion vector Fconcate (I, P 1 , P 2 , P 3 ) stable when part missing happens.In this paper, inspired by Dropout [31], we propose a DropParts method to deal with the parts missing problem.
Dropout [31] is a technique to deal with the over-fitting problem of deep neural networks with a large number of parameters.For example, the l+1 th original hidden layer is formulated as: The key idea of Dropout is to randomly drop units (along with their connection) with probability p from the neural network during training.During training, dropout samples from an exponential number of different thinned networks.With Dropout, the l + 1 th hidden layer is illustrated as: z y At test time, approximate the effect of averaging the predictions of all these thinned networks by simply using a single un-thinned network that has smaller weights.
Dropout significantly reduces risks of over-fitting and gives major improvements over other regularization methods.
In the proposed DropParts method, we formulate the feature fusion of the whole feature and local part features as Equation ( 14).F concate (I, P 1 , P 2 , where |•| is the L1 norm operator.Normalize(•) is a normalization operator, and this paper uses the batch normalization method [32] to normalize the features.Here, normalization Normalize(•) is important, because it maintains the stability of L2-norm of feature vectors.The character of PA = [α 1 , α 2 , α 3 ] is normalized too by been divided by its L1 norm.After this, norms of feature fusion vector F concate (I, P 1 , P 2 , P 3 ) is stable.
Then, the metric learning is: Parts missing samples are not frequent, which leads to imbalanced sample problem.To solve this problem, during training, we drop bins of the absence vector PA, and normalize it: F I, P 1 , P 2 , Part missing can be regarded as an example of DropParts during training.So, at test time, the fusion feature extractor uses the same parameters W and + b: F(I, P 1 , P 2 , P 3 ) = WF concate (I, P 1 , P 2 , P 3 ) + b (19)

Network Structure and Experiment Settings
Any network can be used as the baseline of our proposed network.Take 34-layer ResNet [33] as an example, the architecture of our four-branch network and its feature map sizes (on Market-1501 dataset [8]) of input, hidden and output layers are illustrated in Table 1.
The person image size of the input layer (Branch01) is fixed by the average aspect ratio of all images of the dataset, and then the sizes of input layer Branch02, Branch03, Branch04 are fixed by width/2 × width/2, height/2 × width/2, and height/2 × width/2 respectively.The person image and part patches are resized to the input sizes of the corresponding CNN branch.In consideration of the small feature size of res4, we remove the res5 module in Branch02, Branch03, and Branch04.Pool5 layer is the results of global pooling of their previous feature map.We apply our DropParts method to pool5 feature maps of Branch02, Branch03 and Branch04 to get their scaled_pool5 feature maps, then concatenate them with pool5 of Branch01 to get F_concate feature map.An inner product operator is used to map the 1280-dimensional F_concate layer to 512-dimensional F_fuse layer.At last, we use Softmax loss function to train the model.When testing, we use F_concate feature map, normalized by L2-norm, as the features of a person for person re-identification experiments.Euclidean distance is employed to measure the differences between person features.
Our CNN networks are trained on Caffe framework [34] with a TITAN X GPU.We perform stochastic gradient descent (SGD) [35] to perform weight updates.Start with a base learning rate of η 0 = 0.01 and gradually decrease it as the training progresses using a step policy: , where γ = 0.0001, step_size = 10,000, i is the current mini-batch iteration.We use a momentum of µ = 0.1 and weight decay λ = 0.0005.
Training data augmenting often leads to better generalization.We carry out several primary kinds of data augmentation in experiments when training our networks: rotation, shifting, blurring, color jittering and flipping.For rotation, we rotate the image by random degrees between −30 • and 30 • .For shifting, we shift the image to the left, right, top and bottom at most 5% of its width or height.For blurring, we blur the image with a 3 × 3, 5 × 5 or 7 × 7 sized Gaussian kernel.For color jittering, we change the brightness, saturation, and contrast by at most 5% of its original value.For flipping, we flip the images horizontally with probability 0.5.

Modified U_Net Performance
At first, we perform experiments on public LIP dataset [36].There are 20 semantic labels in LIP dataset: background, hat, hair, glove, sunglasses, upper clothes, dress, coat, socks, pants, jumpsuits, scarf, skirt, face, left-arm, right-arm, left-leg, right-leg, left-shoe, and right-shoe.We change the output num of the last layer in U-net architecture (Figure 3) from 4 to 20, to adopt it for the semantic segmentation tasks on LIP dataset.Our proposed method is compared with current state-of-the-art methods, including SegNet [37], FCN-8s [38], DeepLabV2 [39], Attention [40], DeepLabV2 + SSL [36], Attention + SSL [36] and standard U_Net [12].From Table 2 it can be observed that standard U_Net network [12] outperforms the state-of-the-art networks on human semantic segmentation dataset LIP [36], our modified U_Net network outperforms the standard U_Net network by 0.23% at overall accuracy, 0.26% at mean accuracy and 0.35% at mean IoU index.
We group the 19 semantic labels of LIP dataset into 3 labels: head (hat, hair, sunglasses, scarf, face), torso (glove, upper clothes, dress, coat, left-arm, right-arm) and lower-body (socks, pants, jumpsuits, skirt, left-leg, right-leg, left-shoe, right-shoe), and train the modified U_Net network on LIP dataset with grouped labels at first.We randomly chose 300 images of people from the trainset of Market-1501 [8], CUHK03 [17], and DukeMTMC-reID [9], and then labelled them with 4 semantic labels.Finally, we fine-tuned the modified U_Net network model on LIP dataset with labeled data.We use the fine-tuned model for part segmentation in the proposed person re-identification method.

Method Overall Accuracy Mean Accuracy Mean IoU
SegNet [37] 0.6904 0.2400 0.1817 FCN-8s [38] 0.7606 0.3675 0.2829 DeepLabV2 [39] 0.8266 0.5167 0.4164 Attention [40] 0.8343 0.5255 0.4244 DeepLabV2 + SSL [36] 0.8316 0.5255 0.4244 Attention + SSL [36] 0.8436 0.5494 0.4473 U_Net [12] 0.8499 0.5625 0.4677 U_Net (ours) 0.8522 0.5651 0.4712 Figure 5 illustrates some examples of parts semantic segmentation map and corresponding bounding boxes of human parts.The top row images are person images, and the bottom images illustrate their part segmentation by super-pixels and bounding boxes of person parts are demonstrated with red rectangles.It illustrates the results of parts localization in different situations, including normal situation (1st column), leg occlusion (2nd and 3rd columns), head occlusion (4th and 5th columns), detection mistakes (6th to 9th columns) and crowds (10th and 11th columns).As seen from the localization results in different situations, semantic segmentation-based part localization is stable and accurate.There are also some mistakes.As seen from the 6th column and the 10th column, there are some segmentation mistakes in the torso part, which result in the width of the bounding box of torso part reduced by 7.14% in the 6th column, and height of bounding box of lower-body part increased by 8.26% in the 10th column.We then randomly chose another images of people from the trainset of Market-1501 [8], CUHK03 [17], and DukeMTMC-reID [9], and labelled their part bounding boxes to evaluate the performance of part location with modified U_Net.The mean IoU between labeled bounding boxes and inferred ones are 69.15% for head, 82.57% for torso and 76.78% for low-body, respectively.This is acceptable for part location and can be treated with data augmentation.
Appl.Sci.2019, 9 FOR PEER REVIEW 10 mean IoU between labeled bounding boxes and inferred ones are 69.15% for head, 82.57% for torso and 76.78% for low-body, respectively.This is acceptable for part location and can be treated with data augmentation.
Market-1501 dataset [8] consists of images of 1,501 persons 32,668 images which cropped with bounding-boxes predicted by DPM detector [41].These images are captured from 6 different cameras,

Discussion
To better understand the proposed method, we analyzed it in two aspects: the effect of part alignment, and the effect of DropParts.
(2) ALIGN network architecture: same as architecture in Table 1 but without scaled_pool5 layer; when meeting part occlusion, a zero patch is used to replace it.
(3) ALIGN +DROP network architecture: same as architecture in Table 1.
In this experiment, we report the loss curve during training and CMC curve to evaluate the performances of three networks above.
From Figure 6a, we can see that after 13000 iterations of training, the losses of ALIGN and ALIGN +DROP network reach the very low level (<0.02) while the loss of BASE network is above 0.12, which signifies under-fitting of BASE network.As a result, seen from Figure 6b, rank-1 accuracy of BASE network is lower than of ALIGN and ALIGN +DROP by almost 19%, and CMC curves of ALIGN and ALIGN +DROP networks are above the CMC curve of BASE network all the time.As seen from the loss curves and CMC curves, the addition of part alignment and feature fusion result in significant improvement in the person re-identification performance.

Effect of DropParts
We analyzed the role of DropParts by comparing ALIGN with the ALIGN +DROP network.In Figure 6a, the loss of the ALIGN network can be very low, but the loss of ALIGN +DROP network is even lower, i.e., below 0.01.Another point is that the losses of ALIGN +DROP network during training are more stable than ones of ALIGN network which oscillate even at the end of training.These two points signify well-fitting and easy training of ALIGN +DROP network.This validates the first function of the DropParts method that it makes the four-branch convolutional neural network work properly when part missing occurs.
In Figure 6b, CMC curve of ALIGN +DROP network is always above the CMC curve of BASE network.The addition of DropParts results in improvement in the training and recognition performance of person re-identification, by 1.37% at rank-1.
0.12, which signifies under-fitting of BASE network.As a result, seen from Figure 6b, rank-1 accuracy of BASE network is lower than of ALIGN and ALIGN +DROP by almost 19%, and CMC curves of ALIGN and ALIGN +DROP networks are above the CMC curve of BASE network all the time.As seen from the loss curves and CMC curves, the addition of part alignment and feature fusion result in significant improvement in the person re-identification performance.

Effect of DropParts
We analyzed the role of DropParts by comparing ALIGN with the ALIGN +DROP network.In Figure 6a, the loss of the ALIGN network can be very low, but the loss of ALIGN +DROP network is even lower, i.e., below 0.01.Another point is that the losses of ALIGN +DROP network during training are more stable than ones of ALIGN network which oscillate even at the end of training.These two points signify well-fitting and easy training of ALIGN +DROP network.This validates the first function of the DropParts method that it makes the four-branch convolutional neural network work properly when part missing occurs.
In In order to further analyze the role of DropParts, we investigate the performances on samples with missing part.Table 6 demonstrates statistics of part missing on Market-1501 [8], CUHK03 [17] and DukeMTMC-reID [9] datasets.As seen from Table 6, part missing does not always occur.We evaluate the performance of proposed method with DropParts and without DropParts on samples with missing parts.The results are summarized in Table 7. From Table 7, it can be seen that the proposed algorithm outperforms the method without DropParts by 7.32% at rank-1, 3.79% at rank-5 and 4.93% at mAP on average.On the Market-1501 dataset, our method outperforms our method without DropParts by 11.86% at rank-1.It validates that the DropParts method can improve the performance of person re-identification.

Conclusions
In this paper, we present a new deep architecture deal with parts misalignment, and propose a DropParts method firstly to solve the parts missing problem.Experiments on standard pedestrian datasets show the effectiveness of our proposed method.
For the future work, we will continue to improve the models of part localization and matching, by: (1) Dividing person images into more parts, and improving the performance of parts localization.
(2) Designing an end-to-end model that includes both parts segmentation and re-identification tasks.

Figure 1 .
Figure 1.Examples of part misalignment caused by pose variation, background clutter, detection errors, camera point of view variation, different accessories and occlusion.Images in top row are from Market1501 dataset [8], and images in bottom row from DukeMTMC-reID dataset [9].

Figure 2
Figure 2 illustrates the architecture of the proposed parts aware person re-identification network, consisting of four CNN branches which learn person appearance and three body parts feature maps.The four feature maps are fused to an image descriptor.Three local patches, including head patch, torso patch and lower-body patch, are inferred from a semantic segmentation map.Four image patches, including whole person image and three image patches, are resized to the fixed size and then input into the proposed four-branch network.Each branch learns the representation of one part and finally is fused by a concatenation layer and a fully connected layer.A softmax layer is used to classify person ID.

Figure 2 .
Figure 2. The architecture of proposed parts aware person re-identification network.The network consists of four convolutional neural network branches which learn person appearance and three body parts feature maps respectively, and then fuses four feature maps to an image descriptor.Its input images include a whole person image, head part patch, torso part patch and lower-body part patch.Each branch learns the representation of the whole person image or part patch and finally is fused by a concatenation layer and a fully connected layer.

Figure 2 .
Figure 2. The architecture of proposed parts aware person re-identification network.The network consists of four convolutional neural network branches which learn person appearance and three body parts feature maps respectively, and then fuses four feature maps to an image descriptor.Its input images include a whole person image, head part patch, torso part patch and lower-body part patch.Each branch learns the representation of the whole person image or part patch and finally is fused by a concatenation layer and a fully connected layer.

Figure 3 .
Figure 3. U-net architecture used for part segmentation.Each box corresponds to a multi-channel feature map.The number of channels is denoted on top of the box.The height and width are provided at the lower left edge of the box.The arrows denote the different operations.

Figure 3 .
Figure 3. U-net architecture used for part segmentation.Each box corresponds to a multi-channel feature map.The number of channels is denoted on top of the box.The height and width are provided at the lower left edge of the box.The arrows denote the different operations.
Appl.Sci.2019, 9 FOR PEERREVIEW  6    human body parts.Person images are resized to 192 × 88 and pass through the U_Net network.The size of the output semantic segmentation maps is also 192 × 88, and then the segmentation maps are resized to the same size as the original person images.Figure4illustrates some examples of part segmentation by super-pixels.
illustrate two input of the proposed network, which corresponds to two images of Figure4a.As seen from Figure4b,c, the input patches are well aligned.

Figure 4 .
Figure 4. Example of body part alignment based on part segmentation.(a) An example of part misalignment; (b) Aligned image and part patches of left image in (a); (c) Aligned image and part patches of right image in (a).

Figure 4 .
Figure 4. Example of body part alignment based on part segmentation.(a) An example of part misalignment; (b) Aligned image and part patches of left image in (a); (c) Aligned image and part patches of right image in (a).
ith node of layer l + 1, y (l+1) i denotes the ith active value of layer l + 1, W and biases of layer l + 1 respectively.f (•) is the active function.

Figure 5 .
Figure 5. Examples of part semantic segmentation maps and corresponding bounding boxes of parts.The top row images are RGB images of person, and the bottom images illustrate their part segmentation by super-pixels and bounding boxes of human parts are demonstrated with red rectangles.
including 5 high-resolution cameras, and one low-resolution camera.Overlap exists among different cameras.The whole dataset is divided into training set with 12,936 images of 751 persons and testing set with 3,368 query images and 19,732 gallery images of 750 persons.The CHUK03 dataset [17] includes 13,164 images of 1,360 people captured by six cameras.Each identity appears in two disjoint camera views (i.e., 4.8 images in each view on average).Our dataset is partitioned into training set (1160 persons), validation set (100 persons), and test set (100 persons).

Figure 5 .
Figure 5. Examples of part semantic segmentation maps and corresponding bounding boxes of parts.The top row images are RGB images of person, and the bottom images illustrate their part segmentation by super-pixels and bounding boxes of human parts are demonstrated with red rectangles.

Figure 6 .
Figure 6.Loss curves and CMC curves.(a) Loss curves of BASE, ALIGN and ALIGN +DROP network during training; (b) CMC curves of BASE, ALIGN and ALIGN +DROP network during testing.

Table 1 .
The architecture of our four-branch network and its feature map size (Market-1501).

Table 2 .
Performance of human semantic segmentation on the validation split of LIP.
Overlap exists among different cameras.The whole dataset is divided into training set with 12,936 images of 751 persons and testing set with 3368 query images and 19,732 gallery images of 750 persons.The CHUK03 dataset [17] includes 13,164 images of 1,360 people captured by six cameras.Each identity appears in two disjoint camera views (i.e., 4.8 images in each view on average).Our dataset is partitioned into training set (1160 persons), validation set (100 persons), and test set (100 persons).The DukeMTMC-reID dataset [9] consists of 1,404 identities appearing in more than two cameras and 408 identities (distractor ID) who appear in only one camera.There are 16,522 training images of 702 identities, 2,228 query images of the other 702 identities and 17,661 gallery images (702 ID + 408 distractor ID).

Table 7 .
Performance of ALIGN and ALIGN +DROP methods on part missing samples.