PARNet: A Joint Loss Function and Dynamic Weights Network for Pedestrian Semantic Attributes Recognition of Smart Surveillance Image

: The capability for recognizing pedestrian semantic attributes, such as gender, clothes color and other semantic attributes is of practical signiﬁcance in bank smart surveillance, intelligent transportation and so on. In order to recognize the key multi attributes of pedestrians in indoor and outdoor scenes, this paper proposes a deep network with dynamic weights and joint loss function for pedestrian key attribute recognition. First, a new multi-label and multi-attribute pedestrian dataset, which is named NEU-dataset, is built. Second, we propose a new deep model based on DeepMAR model. The new network develops a loss function, which joins the sigmoid function and the softmax loss to solve the multi-label and multi-attribute problem. Furthermore, the dynamic weight in the loss function is adopted to solve the unbalanced samples problem. The experiment results show that the new attribute recognition method has good generalization performance.


Introduction
In recent years, video surveillance has been widely used in various fields, which brings convenience and protection to many aspects of human life.However, at the same time, the massive video surveillance flow has troubled the rapid search for effective information, and processing these data requires a large amount of manpower and material resources.While for image or video analysis, the recognition of semantic information is a key step in the intelligent processing and analysis of big data.Pedestrians are an important target in images or video, as we know that the pedestrian attributes are the semantic information of pedestrian.The recognition of pedestrians' semantic attributes has important application value in many fields.As an important target of video surveillance, the effective recognition of pedestrians and their semantic attributes can not only improve the working efficiency of video surveillance for the staff, but also play an important role in video retrieval [1], pedestrian behavior analysis, identity recognition, scene analysis, and pedestrian re-identification [2].In addition, pedestrian semantic attribute recognition has also been widely used in intelligent transportation, banking, safe city, public safety and so on [3,4].
At present, there is no complete definition of pedestrian attributes (semantic information) in the surveillance scene.For the study of pedestrian's semantic attribute recognition, gender, appearance, action, etc. are usually to be identified as semantic attributes [2][3][4].In recent years, many researchers have proposed a number of effective methods for the recognition of basic pedestrian semantic attributes in videos acquired by a camera sensor.These methods are mainly divided into three kinds of methods: pedestrian parts-based method, the whole pedestrian-based method and the global and local fusion method [4].The pedestrian parts-based method first detects the position of the pedestrian's head, upper body, lower body, feet, hat, bag and other sub-components and appendages, and then attributes can be identified according to the parts that have been detected.The whole pedestrian-based method is used for semantic analysis of the whole pedestrian image, the whole outline of the sub-components and attachments is segmented, and then the segmented outline is identified.The global and local fusion method is used to combine the characteristics of local information with the global characteristics to identify the attributes.These three approaches not only can adopt the machine learning method but also can use the deep learning methods.
The pedestrian attribute recognition using the machine learning method mainly aims at the pedestrian region features extraction, and then the classifier can be used to identify the attributes.For example, Layne et al. [5,6] used pedestal features and Support Vector Machines (SVM) to identify pedestrian attributes.Deng et al. [4] adopted the cross-kernel support vector machine model and Markov Random Field (MRF) for pedestrian attributes recognition.Those methods can detect all the pedestrians and appendages, and extract the traditional features (such as grayscale, texture, SIFT, HOG, LBP, etc.), or directly extract the pedestrian characteristics.Then the classifier is used for classification.Therefore, those methods rely on the design and extraction of efficient feature descriptors.In particular, the pedestrian appearance characteristics of the actual scene can change dramatically under different camera conditions, such as changes in viewing angle, illumination changes, scale scaling, occlusion objects, and attitude changes, which affect the expression ability of the feature descriptors.This will result in decreased search accuracy.
In recent years, deep learning has also been widely used in pedestrian semantic attributes recognition.The common method based on the whole pedestrian image is making the pedestrian region as a whole to identify the pedestrian attributes by the deep network such as CNN and RNN.For example, Wang et al. [2] proposed a JRL model based on RNN network to study the correlation of attributes in a pedestrian image.That is, the correlation of attributes prediction sequences.The JRL model is used to dig the attribute context information and relationship between each attribute to improve the recognition accuracy.Patrick et al. proposed the Attribute Convolutional Net (ACN) network, which through the joint training the whole CNN model to learn different attributes [7].Tian et al. proposed a TA-CNN network that uses a variety of databases to learn many types of attributes [8].The model combines pedestrian attributes with scene attributes and pedestrian detection for the whole pedestrian image.Li et al. [9] proposed DeepMAR network model (A Deep learning-based Multi-attribute joint recognition model), which uses the prior knowledge in the objective function to identify the attributes.This kind of method usually crops out pedestrian samples, and inputs the samples into the CNN classifier and outputs the multiple pedestrian attribute labels.In addition, there are other pedestrian attributes recognition methods based on parts of information by CNN networks.For example, Georgia Gkioxari et al. [10] used CNN network for human body parts detection, and then the human attributes and motion classified by CNN.Yu et al. [11] designed a weakly supervised pedestrian attributes location network based on GoogleNet, and the attributes labels are predicted by the detection results of mid-level attributes-related features instead of directly predicting the whole human sample.In addition, Li et al. used parts that detect by poselet to integrate with the whole pedestrian, and used human-centric and scene-centric context information, and the deep features are extracted to identify pedestrian attributes [12].However, this contextual information cannot always be used in monitoring scenarios.
In the above research, the recognition of pedestrian attributes in the monitoring scene has some problems.For example, the quality of the image acquired in the dataset is poor, the change of appearance and the attributes may be in different spatial positions, fewer training samples are marked, the attributes usually do not have the same distribution, and there is an imbalance of samples problem [13][14][15].These problems affect the network model training and the accuracy of the pedestrian attributes recognition model.Therefore, we present a new pedestrian attributes recognition algorithm for important attributes in video surveillance as shown in Figure 1.This method improves the loss function to reduce the impact of sample imbalance and to achieve multi-attribute and multi-label pedestrian attributes recognition.
Appl.Sci.2019, 9, x FOR PEER REVIEW 3 of 12 method improves the loss function to reduce the impact of sample imbalance and to achieve multiattribute and multi-label pedestrian attributes recognition.The main contributions of this paper are as follows: (1) Construct a new pedestrian attributes dataset with multi-attribute and multi-label as the same attribute.The built dataset not only has indoor images, but also has more outdoor scenes with pedestrian images.The dataset is much richer and it includes a binary-class label and multiple-class label.
(2) Combine multi-tasking learning and multi-label learning.This is different from other existing methods which express and identify the pedestrian attributes in a binary classification way.The proposed method includes both the binary classification problem of the same attribute and the multiple classification problem of the same attribute.Here, we propose a loss function based on the combination of Sigmoid and Softmax loss, which solves the multi-label in the same attribute problem and multi-attribute identification problem at the same time.
(3) Aiming at the problem of imbalanced samples in data samples, a dynamic weight in the loss function method has been proposed, which can adaptively adjust the weight proportion of the positive and negative in the data samples.

Pedestrian Attributes Dataset Construction
Common pedestrian attributes recognition datasets include PRID (400 images) [16], GRID (500 images) [17], APiS dataset (3661 images) [18], VIPeR dataset (1264 images) [19] (annotated by Layne et al. [6]), PETA dataset (19000 images), RAP dataset (41585 images) and so on.Among them, PRID, GRID and APiS datasets are outdoor scenes; the PETA dataset is indoor and outdoor mixed scenes, including 8705 persons, containing 10 datasets such as the VIPeR dataset and 3DpeS and so on.RAP dataset is a dataset of indoor scenes, which is the largest pedestrian attributes dataset, containing 72 attributes, different perspectives, different lighting, and different body parts information.Several pedestrian samples are shown in Figure 1; the several corresponding pedestrian attributes are presented in Table .1.Each pedestrian attribute represents pedestrian semantic information.In order to meet the application scenarios in complex scenes, the dataset needs both indoor and outdoor scenes.While the labels of the existing datasets are different, we cannot directly merge existing pedestrian attribute datasets of indoor and outdoor.Besides, the labels of existing datasets only have two values, that is, the datasets are used for binary classification.However, a multi-classification problem is more challenging and practical.Moreover, the image number of the PETA dataset is not large enough.Thus, we add a large number of samples which are selected from Internet and video surveillance images based on the RAP dataset.In these selected images, the spatial positions of different pedestrians are different, and the resolutions of pedestrian images are different.Then we select the high degree of attention eleven attributes labels (each label corresponds to a semantic attribute), and a new dataset with more information is created for pedestrian attribute recognition.The main contributions of this paper are as follows: (1) Construct a new pedestrian attributes dataset with multi-attribute and multi-label as the same attribute.The built dataset not only has indoor images, but also has more outdoor scenes with pedestrian images.The dataset is much richer and it includes a binary-class label and multiple-class label.
(2) Combine multi-tasking learning and multi-label learning.This is different from other existing methods which express and identify the pedestrian attributes in a binary classification way.The proposed method includes both the binary classification problem of the same attribute and the multiple classification problem of the same attribute.Here, we propose a loss function based on the combination of Sigmoid and Softmax loss, which solves the multi-label in the same attribute problem and multi-attribute identification problem at the same time.
(3) Aiming at the problem of imbalanced samples in data samples, a dynamic weight in the loss function method has been proposed, which can adaptively adjust the weight proportion of the positive and negative in the data samples.

Pedestrian Attributes Dataset Construction
Common pedestrian attributes recognition datasets include PRID (400 images) [16], GRID (500 images) [17], APiS dataset (3661 images) [18], VIPeR dataset (1264 images) [19] (annotated by Layne et al. [6]), PETA dataset (19,000 images), RAP dataset (41,585 images) and so on.Among them, PRID, GRID and APiS datasets are outdoor scenes; the PETA dataset is indoor and outdoor mixed scenes, including 8705 persons, containing 10 datasets such as the VIPeR dataset and 3DpeS and so on.RAP dataset is a dataset of indoor scenes, which is the largest pedestrian attributes dataset, containing 72 attributes, different perspectives, different lighting, and different body parts information.Several pedestrian samples are shown in Figure 1; the several corresponding pedestrian attributes are presented in Table 1.Each pedestrian attribute represents pedestrian semantic information.In order to meet the application scenarios in complex scenes, the dataset needs both indoor and outdoor scenes.While the labels of the existing datasets are different, we cannot directly merge existing pedestrian attribute datasets of indoor and outdoor.Besides, the labels of existing datasets only have two values, that is, the datasets are used for binary classification.However, a multi-classification problem is more challenging and practical.Moreover, the image number of the PETA dataset is not large enough.Thus, we add a large number of samples which are selected from Internet and video surveillance images based on the RAP dataset.In these selected images, the spatial positions of different pedestrians are different, and the resolutions of pedestrian images are different.Then we select the high degree of attention eleven attributes labels (each label corresponds to a semantic attribute), and a new dataset with more information is created for pedestrian attribute recognition.
Table 1.Several examples of pedestrian attributes labels (The corresponding label of YES is 1 and the corresponding label of NO is 0).

Number
Gender Hat Upcolor White Upcolor Black

Lowercolor White
Lowercolor Black In this paper, we use the ground truth creation tool interface shown in Figure 2 to make the labels.Firstly, we input a sample of pedestrian images, and crop out the pedestrian part.Then, we mark each attribute of the pedestrian part separately.The attributes, the value of each attribute label, and the number of samples of each attribute in the dataset are shown in Table 2. Different from the attribute labels shown in Table 1, the dataset labels made in this paper contain multi-label and multi-attribute pedestrian attributes.For example, 11 kinds of color are included in the tops and underwear color attributes, and the label value from 0 to 10 represents a different color, respectively.In order to describe the attributes of a coat better, the coat attribute is divided into Upcolor 1 and Upcolor 2 in our dataset.For those color attributes, the positive or negative samples are not represented, and the default number of samples in each color is similar.
In this paper, we use the ground truth creation tool interface shown in Figure 2 to make the labels.Firstly, we input a sample of pedestrian images, and crop out the pedestrian part.Then, we mark each attribute of the pedestrian part separately.The attributes, the value of each attribute label, and the number of samples of each attribute in the dataset are shown in Table 2. Different from the attribute labels shown in Table 1, the dataset labels made in this paper contain multi-label and multiattribute pedestrian attributes.For example, 11 kinds of color are included in the tops and underwear color attributes, and the label value from 0 to 10 represents a different color, respectively.In order to describe the attributes of a coat better, the coat attribute is divided into Upcolor 1 and Upcolor 2 in our dataset.For those color attributes, the positive or negative samples are not represented, and the default number of samples in each color is similar.In this paper, we name our dataset as NEU-dataset.The dataset contains a total of 25, 893 pedestrian images, these images have large variation in background, illumination and viewpoints, and the dataset contains a total of 11 attributes, of which three of them have more than two labels.

Proposed Method
In the more common networks, each attribute is usually considered as independent.In fact, there is a certain correlation between every attribute.As shown in Figure 1 and Table 1, it is obvious that there are several related attributes in an image, such as gender, the color of the clothes, and the type of backpack.In order to solve the problem of the relevance of each attribute in the same image, Ref. [9] proposed a DeepMAR network model, which learns all the attributes in the same image at the same time and makes full use of the correlation between each attribute.Different from the method of attribute classification for each attribute, we improve on the DeepMAR network model and propose a multi-attribute and multi-label network model.The pedestrian attributes recognition network model is shown in Figure 3. Lowercolor 0-10 --In this paper, we name our dataset as NEU-dataset.The dataset contains a total of 25, 893 pedestrian images, these images have large variation in background, illumination and viewpoints, and the dataset contains a total of 11 attributes, of which three of them have more than two labels.

Proposed Method
In the more common networks, each attribute is usually considered as independent.In fact, there is a certain correlation between every attribute.As shown in Figure 1 and Table 1, it is obvious that there are several related attributes in an image, such as gender, the color of the clothes, and the type of backpack.In order to solve the problem of the relevance of each attribute in the same image, Ref. [9] proposed a DeepMAR network model, which learns all the attributes in the same image at the same time and makes full use of the correlation between each attribute.Different from the method of attribute classification for each attribute, we improve on the DeepMAR network model and propose a multi-attribute and multi-label network model.The pedestrian attributes recognition network model is shown in Figure 3.The attributes recognition network is shown in Figure 3.The proposed network model includes convolutional layers, max pooling layers, norm layers, a fully connected layer, and the ReLU activation function.For the input image of the network, the features of the image can be extracted by ConvNet.The ConvNet is mainly composed of convolution and pooling operation, e.g.Alexnet [20], VGG-16 [21] and ResNet [22].Then the feature maps extracted by ConvNet are connected using Fc7 (fully connected layer) to generate a high-demission feature vector.Next, low-demission feature vectors for two-class and multi-class classification can be generated by 1*1 convolution.In order to prevent over-fitting, the Dropout method is used in the fully connected layer.Finally, the two-class attributes and multi-class attributes can be predicted using the Sigmoid function and Softmax function, respectively.The loss of the Sigmoid function and Softmax function are joint for the network model training.
In the proposed network, we use the convolutional network model based on the VGG-16 model [21]; the ConvNet in Figure 3 is the VGG-16 model.
For the data that contains N pedestrian images, each image has K pedestrian attributes, and the maximum value of each single attribute label is L. Then a pedestrian image xi (i = 1, 2, ... N) in the data, the corresponding label to the l-th attribute is yil (l = 1,2, ... K), and the value of the label yil∈{0,1, ..., L}.
In the whole network, m represents the index number of the layer.The output of xi at the m-th layer is as follows: where  ( ) and  ( ) are the weight matrix and bias of the parameters in the m-th layer.And ∅(• ) is the activation function.The attributes recognition network is shown in Figure 3.The proposed network model includes convolutional layers, max pooling layers, norm layers, a fully connected layer, and the ReLU activation function.For the input image of the network, the features of the image can be extracted by ConvNet.The ConvNet is mainly composed of convolution and pooling operation, e.g., Alexnet [20], VGG-16 [21] and ResNet [22].Then the feature maps extracted by ConvNet are connected using Fc7 (fully connected layer) to generate a high-demission feature vector.Next, low-demission feature vectors for two-class and multi-class classification can be generated by 1*1 convolution.In order to prevent over-fitting, the Dropout method is used in the fully connected layer.Finally, the two-class attributes and multi-class attributes can be predicted using the Sigmoid function and Softmax function, respectively.The loss of the Sigmoid function and Softmax function are joint for the network model training.
In the proposed network, we use the convolutional network model based on the VGG-16 model [21]; the ConvNet in Figure 3 is the VGG-16 model.
For the data that contains N pedestrian images, each image has K pedestrian attributes, and the maximum value of each single attribute label is L. Then a pedestrian image x i (i = 1, 2, . . .N) in the data, the corresponding label to the l-th attribute is y il (l = 1,2, . . .K), and the value of the label y il ∈{0,1, . . ., L}.
In the whole network, m represents the index number of the layer.The output of x i at the m-th layer is as follows: where W (m) and b (m) are the weight matrix and bias of the parameters in the m-th layer.And ∅(•) is the activation function.

Joint Loss Function
In the DeepMAR network model, the loss function considers all the attributes at the same time, and the Sigmoid cross entropy loss is used.The function expression is: Among them, pil is the output probability value of the l-th attribute of sample x i .Generally, the color attributes of clothes have many kinds of categories; if only two categories have been used, conflicts of clothes' color attributes recognition will be caused.Because each attribute in our dataset has labeled not just two values of 0 and 1, some attributes have multi-label problems.However, Softmax loss can solve the multi-label problem well.Therefore, we improved the above loss function and proposed a cross entropy loss function which is combined with Sigmoid and Softmax.Loss function expression is: where k is the neuron corresponding to the input image label, and z i is the input of the neuron.

Dynamic Weights for Joint Loss Function
Due to the combined use of all the attributes of the same image, the dataset poses the problem of positive and negative sample imbalance as shown in Table 2.For example, the positive and negative samples of the Hat attribute and Calling attribute have more differences in quantity.The imbalance of positive and negative samples will skew the training results toward the one with the larger number of samples.That will have a huge impact on the recognition effect.In order to solve the problem of sample imbalance, for the whole samples, the model of DeepMAR statistics has been made, and the model weights the samples to balance the samples.In order to make algorithm weights suitable for different datasets, the positive and negative sample weights are set as dynamic parameters, and dynamic parameters are used to replace fixed parameters.Based on the formulas (2), ( 4) and the DeepMAR model, we improve the loss function and propose a dynamic sample weight method.In the forward transmission, the number of positive and negative samples that have only two categories in each batch.The predicted values of the samples are calculated according to formula (9).When j = b, that is, all the batch after the forward transfer, the positive sample ratio of the entire sample is calculated, then the positive sample ratio is put into the Gaussian function to obtain the sample weight.The improved Loss function expression is: where B is the number of samples in the j-th batch.In addition, p lj is the positive sample of the l-th attribute in the j-th batch. 1 y il = 1 indicates that we count 1 when the l-th attribute label value of the i-th sample is 1.In this paper, the weight is set to a Gaussian function, where σ is the adjustment parameter; in the experiment of this paper, σ = 0.01.In addition, we use the modified Loss function as the final Loss function.By combining ( 5) and ( 7), the proposed network can be solved as the following optimization problem:

Optimization
To optimize (11), we employ the stochastic sub-gradient descent method to obtain the parameters W (m) and b (m) .Then, they can be updated by using the gradient descent algorithm as follows until convergence: where λ is the learning rate.
To optimize all the network parameters, Adam algorithm is selected in this paper to optimize the proposed network.Moreover, in order to normalize the local input regions, make the supervised learning algorithm fast, and increase network performance, we add LRN (Local Response Normalization) after each convolution layer in the network.

Results
In this section, the experiments on two datasets are implemented for attribute recognition.We briefly introduce the datasets, experimental environment, metrics, and results of the proposed method and comparison methods.The experimental results empirically validate the effectiveness of the proposed method.

Datasets
In order to verify the effectiveness of the proposed algorithm, we experiment on two datasets, i.e., the proposed NEU-dataset and RAP datasets.The NEU-dataset is mainly divided into three parts, that is, randomly selected 20,000 images in the dataset as the training samples, a selected 2707 images for verification, and the remaining 3086 images as the testing sample.In the RAP datasets, 75% of the total images are randomly selected as training samples and the rest of the images are test samples.In the data training and testing, we resize the image into 256 * 256 and then crop out its 227 * 227 region to put into the network.

Implementation
Our method is based on Caffe framework to experiment.The experimental environment is in the Windows 7 64bit operating system environment and the processor is Intel (TM) i5-4660 CPU @ 3.20 GHz, memory is 8 GB, GPU: NVIDIA GeForce GTX TITAN X.
To expedite the proposed method and obtain a better approximation of the network parameters, all network layers in the proposed algorithm are fine-tuned based on the VGG-16 Caffe model.It has been widely implemented on deep networks, which can shorten the learning time.The network is optimized by Adam algorithm.The initial learning rate used is 0.0001 and weight decay is 0.005 in this experiment.Besides, the dropout ratio is 0.5.

Evaluation Metrics
The NEU-dataset in our experiment is evaluated basically.It mainly includes Average accuracy, Positive sample recognition rate (PRR), Negative sample recognition rate (NRR) and so on for each attribute on the dataset.In addition, we use one label-based metric (mA) and four example-based metrics: Acc (Accuracy), Prec (Precision), Rec (Recall), and F1.These five indicators evaluate the algorithm of attribute recognition [4,23,24], and the calculation of these five indicators is as follows: where K is the number of attributes, |TP i | and |TN i | are the number of positive samples and number of negative samples correctly predicted respectively for the ith attribute.|P i | and |N i | are the number of positive samples and negative samples in the ith attribute of ground truth.N is the number of samples.Y i is the positive sample of the ith attribute in ground truth, and f (x i ) is the positive sample label of the ith attribute in the predicted result.

Figure 2 .
Figure 2. Ground truth production tools interface map.

Figure 2 .
Figure 2. Ground truth production tools interface map.

Figure 3 .
Figure 3.The proposed attributes recognition network model.

Figure 3 .
Figure 3.The proposed attributes recognition network model.

Table 1 .
Several examples of pedestrian attributes labels(The corresponding label of YES is 1 and the corresponding label of NO is 0).