You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

25 December 2019

Person Search via Deep Integrated Networks

,
,
and
Department of Computer Science and Information Engineering, National Kaohsiung University of Science and Technology, Kaohsiung city 8078, Taiwan
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Actionable Pattern-Driven Analytics and Prediction

Abstract

This study proposes an integrated deep network consisting of a detection and identification module for person search. Person search is a very challenging problem because of the large appearance variation caused by occlusion, background clutter, pose variations, etc., and it is still an active research issue in the academic and industrial fields. Although various studies have been proposed, following the protocols of the person re-identification (ReID) benchmarks, most existing works take cropped pedestrian images either from manual labelling or a perfect detection assumption. However, for person search, manual processing is unavailable in practical applications, thereby causing a gap between the ReID problem setting and practical applications. One fact is also ignored: an imperfect auto-detected bounding box or misalignment is inevitable. We design herein a framework for the practical surveillance scenarios in which the scene images are captured. For person search, detection is a necessary step before ReID, and previous studies have shown that the precision of detection results has an influence on person ReID. The detection module based on the Faster R-CNN is used to detect persons in a scene image. For identifying and extracting discriminative features, a multi-class CNN network is trained with the auto-detected bounding boxes from the detection module, instead of the manually cropped data. The distance metric is then learned from the discriminative features output by the identification module. According to the experimental results of the test performed in the scene images, the multi-class CNN network for the identification module can provide a 62.7% accuracy rate, which is higher than that for the two-class CNN network.

1. Introduction

With the rapid development of information technology, security and surveillance systems have been installed in many places, including schools, department stores, train stations, airports, office buildings and so on, for public safety. Cameras are usually installed with non-overlap views to increase the security area and strike a balance between equipment cost and security. From the view of the data mining field, surveillance intelligence can be seen from the sequential data (videos) [1,2,3] and abnormal or specific patterns can be found; e.g., vehicle, pedestrian. Several subjects (e.g., abnormal event detection, object detection, tracking, face recognition, and person re-identification) have been explored in computer vision [4,5,6]. Person search plays an important role in the intelligent surveillance system, which was first introduced in [7]. Xu et al. [7] proposed a sliding window searching strategy based on person detection and person matching scores to find a person’s image captured by one camera (i.e., probe image) in a gallery of scene images [8,9], where probe and gallery images are captured with different viewpoints.
Person search is an extended form of person re-identification (person ReID) and is designed to find a probe person in a gallery of scene images. Large visual variations caused by changes in illumination, occlusion, background clutter, different viewpoints, and human poses pose a challenging problem [8,9]. Large intra-class variations, which include different appearances, increase the difficulty of feature representation. Besides this, the class imbalance problem is a problem in machine learning; i.e., the amount of data for one class is far less than the amount of data for another class [10]. Person search has the imbalance problem of having a set of “different” pairs (i.e., a combination of two images from different persons) while having a limited set of “same” pairs (i.e., a combination of two images from the same person). The similarity measurement is therefore difficult to learn, and data sampling techniques are often utilized during the training process [11,12]. Existing research can be roughly divided into two categories for person re-identification. One extracts the discriminative feature for a person’s image [13,14], while the other learns the metric with a form of a matrix to measure the similarity [15] between persons’ images. With the great success achieved by the deep learning network in computer vision [16,17,18], some works have recently been proposed to address person ReID with deep learning networks in an end-to-end architecture, which refers to giving input data and, with the desired output label, a system is automatically trained as a whole [19]. Three kinds of deep models are used in person ReID: Siamese networks [13,15,20], classification networks [21,22,23], and triplet networks [24,25]. Additionally, many works have focused on coping with specific issues, such as occlusion [26], misalignment [27,28,29] and over-fitting [19,23,30,31].
Although numerous works have been proposed for person ReID, the gallery or probe pedestrian images are manually cropped in most benchmarks [13,32,33,34]. Following the protocols of these benchmarks, most existing works have taken cropped pedestrian images either from manual labelling or a perfect detection assumption [9]. However, in practical applications, manual processing is time consuming and unavailable, thereby causing a gap between person ReID and practical applications [9]. Although the performance of pedestrian detection by deep learning models such as the Faster R-CNN [35], Single Shot MultiBox Detector (SSD) [36] and Person of Interest [37] is significantly better than the traditional method [38], imperfect bounding boxes, background clutter, misdetections, misalignment or false alarms are inevitable. In other words, the detection results affect the person ReID; however, most studies consider pedestrian detection and person ReID as separate issues [9]. The person search method has recently been introduced to close the gap. Person search is more challenging than person ReID because of the imprecise detection cropping and misalignment [8,9]. In 2017, Xiao et al. [9] proposed a deep learning framework to jointly optimize pedestrian detection and person ReID in a single convolutional neural network (CNN). An online instance matching loss function has been proposed to train the network. Liu et al. [39] proposed a neural search model to recursively refine the location of the target person in the scene. In 2018, Lan et al. [8] discovered that auto-detected bounding boxes often vary from traditional benchmarks in scale (resolution). Accordingly, they proposed a cross-level semantic alignment to address the multi-scale matching problem in person search.
In this study, we designed a framework for practical surveillance scenarios in which the scene images captured different viewpoints and regions, and the pedestrian matching is automatically performed in these images. Pedestrian detection is a necessary step before person matching. The computational efficiency could be improved if the extracted features in the detection step could be further used in the following ReID process. Moreover, the image variances are considered by training the ReID model from the auto-detected bounding boxes, which will increase the model tolerance in the test process. We proposed an integrated deep network consisting of a detection and identification module. Two scene images are input to the detection network. The detection module is based on the Faster R-CNN [35], which can provide a more precise bounding box. The person regions are detected and cropped. Each cropped region is further input to the identification module. In previous studies [13,15,20,40,41], person ReID was cast as a two-class classification problem, and each training pair was given the same or different class label if the two images were from the same person or not. Instead of training the identification module as a two-class classification, we apply a multi-class convolutional neural network (CNN) network to train the identification module, in which each person is considered as a class and an identity (ID) is assigned. The amount of training data for each class is almost the same. The discriminative features are then extracted in the identification module [11]. Finally, the person matching is performed by computing the similarity between the probe image and the gallery images based on the learned distance metric. After sorting the similarities, the top-k results (k is a user-defined value) are shown. Note that the ideas of the proposed system not only add a detection module before identification, but the use of shared features is considered in the system design to improve the computational efficiency. Moreover, in a practical scenario, it would not be feasible to retrain the network when a newcomer’s images are captured. Hence, a flexible system framework was designed by training a multi-class CNN for discriminative feature extraction and learning a distance metric for similarity measurement.
The remainder of this paper is organised as follows: Section 2 reviews the issues of person re-identification; Section 3 presents the proposed system consisting of two modules, person detection and person re-identification, with the details of each module also introduced; Section 4 presents the network parameters and experimental results of the proposed system and discusses the performance with various feature vectors; and finally, Section 5 concludes the paper.

4. Experimental Results

We evaluated the performance of the proposed identification system with different feature combinations. The network parameters and experimental setup are first introduced, including the dataset and the computing environment. The feature combinations and experiments are then introduced. Finally, the performance measurement and quantitative results are shown.

4.1. Experimental Setup and Dataset

In the pedestrian detection network, when one image frame with an arbitrary size is input, it would be resized to 563 × 1000 pixels. The detailed parameters of sharing convolutional layers were set as follows: 96 kernels with a kernel size of 7 × 7 pixels, two pixels of the stride size, and three pixels of padding were used in the first convolutional layer. Hence, feature maps with a size of 282 × 500 × 96 could be obtained. The feature map size was unchanged after the LRN layer and downsampled to 142 × 251 × 96 after max pooling layer with a kernel size of 3 × 3 pixels, two pixels of the stride size and one pixel of padding. The size of the feature maps output by the fifth convolutional layer was 36 × 64 × 256 pixels. In the ROI pooling layer, the parameters of pool_w and pool_h were set to 6, and spatial_scale was 6 × 6 for all the proposals. Besides this, in the identification network, the input size was 227 × 227 × 3. The output sizes of FC6, FC7 and FC8 were 4096, 4096 and 2048, respectively.
In order to evaluate the system performance, several benchmarks have been used for person re-identification. Among them, the dataset PRW [70] is a challenging dataset that includes images captured by the six cameras with different scene views (Figure 5); hence, the poses of the pedestrian images varied. The image resolution captured by the sixth camera was 720 × 576 pixels and 1920 × 1080 pixels for the other five cameras. The PRW comprised 11,816 RGB images, and 43,110 bounding boxes were manually labelled. Among them, 34,304 bounding boxes were assigned with 932 person IDs, and the remaining boxes for which the models were unsure about a person’s ID were assigned an ID of −2. Each bounding box contained five pieces of information [i.e., x, y, w, h and s], where (x, y) is the left-corner coordinate of the bounding box, (w, h) denotes the width and height of the bounding box, and s is the person ID.
Figure 5. Example images captured by six cameras in the PRW dataset [70].

4.2. Experiments with Various Features

All the pedestrian images in the PRW dataset are automatically cropped by the detection module. Three kinds of experiments were applied to investigate the identification module configuration. Table 2 lists the experimental settings in detail. Note that the training images in all the experiments were obtained from cameras 1 to 6, the probe images in the test process were from camera 2, and the gallery images were from camera 3. We divided the dataset into the training and test sets with different person IDs. Note that to simulate the real scenario, in which several images are often captured by cameras, and to understand how the probe and gallery size would affect the ReID performance, the scene images captured by cameras 2 and 3 were used for the test set because, in the PRW dataset, the numbers of matching subjects in the combination of cameras 2 and 3 were more than in other camera combinations.
Table 2. List of experimental settings for the identification module evaluation. CNN: convolutional neural network.
As stated in Section 3, the identification configuration was implemented in Experiment 1, and the network was trained in 306 pedestrian IDs with 7508 RGB images. Figure 6a shows the example images. The network configuration of two-class CNN in Experiment 2 was evaluated. The network was trained with only two classes: same and different. Hence, the training images used in Experiment 1 were sampled and vertically concatenated to form the training data pairs. The pair numbers of the different class (different IDs) were 87,363 and 29,226 for the same class. (same ID). The test images were concatenated to form the test set with 28,224 pairs. Figure 6b shows the example images used in Experiment 2. As in Experiment 2, Experiment 3 was also designed to evaluate the network configuration of two-class CNN; the only difference was that, rather than using RGB images, the feature maps for each image output from the fifth sharing convolutional layer are vertically concatenated in Experiment 3. Hence, the numbers of training and test pairs were the same as Experiment 2. Figure 6c illustrates the example images used in Experiment 3.
Figure 6. Training and test examples in (a) Experiment 1, (b) Experiment 2 and (c) Experiment 3.

4.3. Performance Metrics and Experimental Results

The pedestrian detection performance was measured by the detection rate and false alarm rates. For the detection rate, the bounding box was correctly detected if the overlapped area between the bounding box and the ground truth was larger than the pre-set threshold λ :
D e t e c t i o n       r a t e = t = 1 N k = 1 N G t I [ | G k t     D k t | | G k t     D k t | > λ ] i = 1 N j = 1 N G i G j i
where I [ ] is an indicator function; N is the test image number; N G t is the number of ground truth boxes in the tth image; G k t and D k t are the kth ground truth and the detected box in the tth image, respectively; and | G k t   D k t | | G k t   D k t | is the IOU. If the IOU is smaller than λ , the bounding box is incorrect, and will be shown as a false alarm. The higher the detection rate and the lower the false alarm rate, the better the detection performance. Table 3 shows the detection results using the images captured by cameras 2 and 3. The threshold was set to λ = 0.5 and 0.7. The detection performance was better when λ was set to 0.5.
Table 3. Detection results with different IOU thresholds.
The performance evaluation metrics of ReID are the cumulative matching characteristic (CMC, top-K), in which a correct matching is counted if there is at least one of the top-k gallery images with the same identity as the probe image. In Experiment 1, three test sets (i.e., TSet1, TSet 2 and TSet 3) containing different numbers of images for each test subject were used. In TSet 1, for each subject (168 subjects are in Experiment 1, as shown in Table 2), one image is randomly selected from the scene images captured by camera 3 to form the gallery set, while in TSet 2, two images are randomly selected. In TSet 3, all images for each of the 168 subjects captured by camera 3 are used. Hence, it is expected that each test identifies more images in the gallery and can achieve a higher identification rate. Note that, in our work, the pedestrian images in the probe and gallery sets are detected and cropped automatically in the detection module. Table 4 lists the top, top-5, top-10 and top-20 identification rates with different feature vectors and distance metric. As shown, TSet 3 was better compared to TSet 1 and TSet 2, and for each subject, more images in the gallery set improved performance. To investigate the detection effects in a real-world application, instead of using the detected bounding box, a multi-class CNN was trained by the manual-labelled ground-truth (GT) bounding box. In the test process, all the probe and gallery pedestrian images used the ground-truth bounding box and input this to the newly trained multi-class CNN to extract the feature vectors from the FC6 layer. For TSet 3, the identification rate using the GT bounding box is presented in the last two rows of Table 4. Approximately an identification rate difference of 11.1% in the top set between the auto-detected results and manual labelling was observed. Hence, in practical applications, the results of the pedestrian detection indeed influence the identification results, and closing this gap is an important issue.
Table 4. Identification rate with different feature vectors and distance metric.
Experiment 2 and Experiment 3 are used to investigate the network configuration. The input to the network in Experiment 2 and Experiment 3 was the concatenated RGB image or feature map, respectively. The softmax layer output the probability of the image being the same class. The higher the output value, the higher the probability that the input will be formed by the same identity datum. In addition, for Experiment 2, and Experiment 3, a two-class CNN was trained individually to determine whether the input belonged to the same or different class. The hyperparameter, specifying the network configuration such as the layer number, kernel size and kernel number, was the same as used for Experiment 1, except for the number of output nodes in the softmax layer. In the test process, 168 gallery images were concatenated with the probe image for each probe image. The resulting 168 probability values were sorted. The probe image was correctly verified if the concatenation of the probe and gallery images with the same ID was retrieved in the top-k results. Table 5 lists the identification rate with the top value in Experiment 2 and Experiment 3. Note that the results of the best performance in Experiment 1 were listed to compare the feature extraction performances. The results show that the performance of Experiment 3 was the worst. The top identification rate was very low (i.e., only 4.76%). We believe this is because the input in Experiment 3 was the concatenated feature maps from the fifth sharing convolutional layer in the detection network, and hence the feature maps contained little and insufficient subject information for identification, as shown in Figure 6c. For Experiment 2 and Experiment 3, the amount of data from combining different images from either the same or different class is large, and the contents of combination are diverse, which resulted in learning difficulties for the two-class CNN models. The performance of the multi-class identification network (Experiment 1) was significantly higher than the two-class CNN network (Experiment 2) in analysing the network configuration for the identification module. The experimental results with different training/test settings from other studies [16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66] are listed below Table 5.
Table 5. List of the identification rate in Experiment 1, Experiment 2 and Experiment 3 for the PRW dataset.
Figure 7 and Figure 8 show the successful and failure cases in Experiment 1 and Experiment 2, respectively. In the failure case in Figure 7, no instance of the same identity is retrieved in the top-k results, while in Figure 8, the same pair is not retrieved in the top-k results.
Figure 7. Examples of the identification results in Experiment 1 (identification network): (a) successful and (b) failure cases.
Figure 8. Examples of the identification results in Experiment 2 (identification network): (a) successful and (b) failure cases.

5. Conclusions

Person search is a practical application which includes two kernel parts, person detection and re-identification, and it is more challenging than person ReID because most existing ReID studies have only processed and classified input pedestrian images enclosed within a bounding box, but have ignored the fact that an imperfect bounding box or background clutter is inevitable. For practical surveillance scenarios, where scene images are captured by two or multiple cameras, this study proposed an integrated network consisting of a detection and identification module to implicitly consider the effect of the detection results. Two images captured by non-overlapped cameras were input to the Faster R-CNN detection network. A multi-class CNN network was trained with the auto-detected bounding boxes from the detection module instead of manually cropped data to extract discriminative features. Hence, the FC layers of the multi-class CNN can be utilized as the discriminative features, and the similarity of the two feature vectors are calculated via a learnable distance metric. In addition, the proposed framework is more flexible; when one new person is detected, the feature vector is then extracted by the multi-class CNN and compared with other images in the dataset without retraining the network. The experimental result show that the multi-class CNN network for the identification module can provide a 62.7% accuracy, which is significantly higher than that in the two-class CNN network. Moreover, more experiments have been designed to apply sharing features to improve the computational efficiency. However, the results of the identification are not satisfactory because the feature maps of the detection network contained less and unclear information about subjects. In the future, we would like to design a search strategy for real-world applications and re-design the network configuration such that the detection and identification module can utilize shared convolution feature maps for computational efficiency.

Author Contributions

Conceptualization, C.-F.W., J.-C.C. and C.-H.C.; Methodology, C.-F.W. and J.-C.C.; Software, C.-H.C. and C.-R.L.; Writing—original draft, C.-F.W.; Writing—review & editing, J.-C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Science and Technology of Taiwan, R.O.C., under grant No. 108-2221-E-992-032-.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Gan, W.; Lin, J.C.W.; Fournier-Viger, P.; Chao, H.C.; Yu, P.S. HUOPM: High-utility occupancy pattern mining. IEEE Trans. Cybern. 2019, 1–14. [Google Scholar] [CrossRef] [PubMed]
  2. Lin, J.C.W.; Yang, L.; Fournier-Viger, P.; Hong, T.P. Mining of skyline patterns by considering both frequent and utility constraints. Eng. Appl. Artif. Intell. 2019, 77, 229–238. [Google Scholar] [CrossRef]
  3. Gan, W.; Lin, J.C.W.; Fournier-Viger, P.; Chao, H.C.; Yu, P.S. A survey of parallel sequential pattern mining. ACM Trans. Knowl. Discov. Data (TKDD) 2019, 13, 1–34. [Google Scholar] [CrossRef]
  4. Zou, Z.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. arXiv 2019, arXiv:1905.05055. [Google Scholar]
  5. Bouindour, S.; Snoussi, H.; Hittawe, M.M.; Tazi, N.; Wang, T. An on-line and adaptive method for detection abnormal events in videos using spatio-temporal convent. Appl. Sci. 2019, 9, 757. [Google Scholar] [CrossRef]
  6. Wang, M.; Deng, W. Deep face recognition: A survey. arXiv 2019, arXiv:1804.06655. [Google Scholar]
  7. Xu, Y.; Ma, B.; Huang, R.; Lin, L. Person search in a scene by jointly modeling people commonness and person uniqueness. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014. [Google Scholar]
  8. Lan, X.; Zhu, X.; Gong, S. Person search by multi-scale matching. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2018; pp. 536–552. [Google Scholar]
  9. Xiao, T.; Li, S.; Wang, B.; Lin, L.; Wang, X. Joint detection and identification feature learning for person search. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3415–3424. [Google Scholar]
  10. Buda, M.; Maki, A.; Mazurowski, M.A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018, 106, 249–259. [Google Scholar] [CrossRef]
  11. Liao, S.; Hu, Y.; Zhu, X.; Li, S.Z. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2197–2206. [Google Scholar]
  12. Koestinger, M.; Hirzer, M.; Wohlhart, P.; Roth, P.M.; Bischof, H. Large scale metric learning from equivalence constraints. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2288–2295. [Google Scholar]
  13. Li, W.; Zhao, R.; Xiao, T.; Wang, X. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 152–159. [Google Scholar]
  14. Ahmed, E.; Jones, M.; Marks, T.K. An improved deep learning architecture for person re-identificatio. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3908–3916. [Google Scholar]
  15. Yi, D.; Lei, Z.; Liao, S.; Li, S.Z. Deep metric learning for person re-identification. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 34–39. [Google Scholar]
  16. Krizhevsky, I.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Neural Inf. Process. Syst. 2012. [Google Scholar] [CrossRef]
  17. Hoang, T.; Do, T.; Tan, D.; Cheung, N. Selective deep convolutional features for image retrieval. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017. [Google Scholar]
  18. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  19. Glasmachers, T. Limits of end-to-end learning. Mach. Learn. Res. 2017, 77, 17–32. [Google Scholar]
  20. Varior, R.R.; Haloi, M.; Wang, G. Gated Siamese convolutional neural network architecture for human reidentification. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016. [Google Scholar]
  21. Xiao, T.; Li, H.; Ouyang, W.; Wang, X. Learning deep feature representations with domain guided dropout for person re-identification. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  22. Zheng, L.; Yang, Y.; Hauptmann, A.G. Person reidentification: Past, present and future. arXiv 2016, arXiv:1610.02984. [Google Scholar]
  23. Zheng, Z.; Zheng, L.; Yang, Y. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  24. Zhuo, J.; Chen, Z.; Lai, J.; Wang, G. Occluded person reidentification. arXiv 2018, arXiv:1804.02792. [Google Scholar]
  25. Wang, Y.; Wang, L.; You, Y.; Zou, X.; Chen, V.; Li, S.; Huang, G.; Hariharan, B.; Weinberger, K.Q. Resource aware person re-identification across multiple resolutions. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8042–8051. [Google Scholar]
  26. Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random erasing data augmentation. arXiv 2017, arXiv:1708.04896. [Google Scholar]
  27. Li, D.; Chen, X.; Zhang, Z.; Huang, K. Learning deep context-aware features over body and latent parts for person re-identification. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  28. Su, C.; Li, J.; Zhang, S.; Xing, J.; Gao, W.; Tian, Q. Pose-driven deep convolutional model for person re-identification. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  29. Zhao, L.; Li, X.; Wang, J.; Zhuang, Y. Deeply-learned part-aligned representations for person re-identification. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  30. McLaughlin, N.; del Rincon, J.M.; Miller, P. Data augmentation for reducing dataset bias in person reidentification. In Proceedings of the 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Karlsruhe, Germany, 25–28 August 2015. [Google Scholar]
  31. Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  32. Gray, D.; Brennan, S.; Tao, H. Evaluating appearance models for recognition, reacquisition, and tracking. Int. Workshop Perform. Eval. Track. Surveill. 2007, 3, 1–7. [Google Scholar]
  33. Hirzer, M.; Beleznai, C.; Roth, P.M.; Bischof, H. Person re-identification by descriptive and discriminative classification. In Image Analysis; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
  34. Li, W.; Wang, X. Locally aligned feature transforms across views. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. [Google Scholar]
  35. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  36. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
  37. Yu, F.; Li, W.; Li, Q.; Liu, Y.; Shi, X.; Yan, J. Poi: Multiple object tracking with high performance detection and appearance feature. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 36–42. [Google Scholar]
  38. Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed]
  39. Liu, H.; Feng, J.; Jie, Z.; Jayashree, K.; Zhao, B.; Qi, M.; Jiang, J.; Yan, S. Neural person search machines. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  40. Zheng, W.S.; Gong, S.; Xiang, T. Re-identification by relative distance comparison. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 653–668. [Google Scholar] [CrossRef] [PubMed]
  41. Davis, J.V.; Kulis, B.; Jain, P.; Sra, S.; Dhillon, I.S. Information-theoretic metric learning. In Proceedings of the 24th International Conference on Machine Learning, Corvalis, OR, USA; 2007; pp. 209–216. [Google Scholar]
  42. Gray, D.; Tao, H. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2008; Volume 5302, pp. 262–275. [Google Scholar]
  43. Farenzena, M.; Bazzani, L.; Perina, A.; Murino, V.; Cristani, M. Person re-identification by symmetry-driven accumulation of local features. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2360–2367. [Google Scholar]
  44. Yang, Y.; Yang, J.; Yan, J.; Liao, S.; Yi, D.; Li, S.Z. Salient color names for person re-identification. Eur. Conf. Comput. Vis. 2014, 8689, 536–551. [Google Scholar]
  45. Kviatkovsky, I.; Adam, A.; Rivlin, E. Color invariants for person reidentification. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1622–1634. [Google Scholar] [CrossRef]
  46. Liu, Y.; Zhang, D.; Lu, G.; Ma, W.Y. Region-based image retrieval with high-level semantic color names. In Proceedings of the 11th International Multimedia Modelling Conference, Melbourne, Australia, 12–14 January 2005; pp. 180–187. [Google Scholar]
  47. Kuo, C.H.; Khamis, S.; Shet, V. Person re-identification using semantic color names and rankboost. In Proceedings of the 2013 IEEE Workshop on Applications of Computer Vision (WACV), Tampa, FL, USA, 15–17 January 2013; pp. 281–287. [Google Scholar]
  48. Weinberger, K.Q.; Saul, L.K. Fast solvers and efficient implementations for distance metric learning. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 1160–1167. [Google Scholar]
  49. Zhong, Z.; Zheng, L.; Zheng, Z.; Li, S.; Yang, Y. Camera style adaptation for person re-identification. arXiv 2017, arXiv:1711.10295. [Google Scholar]
  50. Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  51. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005. [Google Scholar]
  52. Varior, R.R.; Shuai, B.; Lu, J.; Xu, D.; Wang, G. A siamese long short-term memory architecture for human reidentification. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016. [Google Scholar]
  53. Cheng, D.; Gong, Y.; Zhou, S.; Wang, I.; Zheng, N. Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1335–1344. [Google Scholar]
  54. Hermans, L.B.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
  55. Wang, G.C.; Lai, J.H.; Xie, X.H. P2snet: Can an image match a video for person re-identification in an end-to-end way? IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 2777–2787. [Google Scholar] [CrossRef]
  56. Wu, S.; Chen, Y.-C.; Li, X.; Wu, A.C.; You, J.J.; Zheng, W.S. An enhanced deep feature representation for person re-identification. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016. [Google Scholar]
  57. Shen, Y.; Lin, W.; Yan, J.; Xu, M.; Wu, J.; Wang, J. Person re-identification with correspondence structure learning. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
  58. Zheng, W.S.; Li, X.; Xiang, T.; Liao, S.; Lai, J.; Gong, S. Partial person re-identification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4678–4686. [Google Scholar]
  59. Zhao, R.; Ouyang, W.; Wang, X. Unsupervised salience learning for person re-identification. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. [Google Scholar]
  60. Wei, L.; Zhang, S.; Yao, H.; Gao, W.; Tian, Q. Glad: Global-local-alignment descriptor for pedestrian retrieval. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017. [Google Scholar]
  61. Li, W.; Zhu, X.; Gong, S. Harmonious attention network for person re-identification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; p. 2285. [Google Scholar]
  62. Girshick, R. Fast R-CNN. In International Conference on Computer Vision; Springer: Cham, Switzerland, 2015. [Google Scholar]
  63. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  64. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  65. Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional neural networks. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 818–833. [Google Scholar]
  66. Wu, J. Introduction to Convolutional Neural Networks; National Key Lab for Novel Software Technology: Nanjing, China, 2017. [Google Scholar]
  67. Weber, B. Generic Object Detection Using Adaboost; Department of Computer Science University of California: Santa Cruz, CA, USA, 2008. [Google Scholar]
  68. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  69. Uijlings, J.R.; van de Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. Conf. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
  70. Zheng, L.; Zhang, H.; Sun, S.; Chandraker, M.; Yang, Y.; Tian, Q. Person re-identification in the wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  71. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. ECCV 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
  72. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.