Abstract
Pedestrian detection for complex scenes suffers from pedestrian occlusion issues, such as occlusions between pedestrians. As well-known, compared with the variability of the human body, the shape of a human head and their shoulders changes minimally and has high stability. Therefore, head detection is an important research area in the field of pedestrian detection. The translational invariance of neural network enables us to design a deep convolutional neural network, which means that, even if the appearance and location of the target changes, it can still be recognized effectively. However, the problems of scale invariance and high miss detection rates for small targets still exist. In this paper, a feature extraction network DR-Net based on Darknet-53 is proposed to improve the information transmission rate between convolutional layers and to extract more semantic information. In addition, the MDC (mixed dilated convolution) with different sampling rates of dilated convolution is embedded to improve the detection rate of small targets. We evaluated our method on three publicly available datasets and achieved excellent results. The AP (Average Precision) value on the Brainwash dataset, HollywoodHeads dataset, and SCUT-HEAD dataset reached 92.1%, 84.8%, and 90% respectively.
1. Introduction
Pedestrian detection is an important research area in computer vision and extends to many related applications. These applications include person recognition [1], pedestrian tracking [2], autonomous driving [3,4] etc. Generic pedestrian detection is poor in crowded and heavily occluded scenes, so head detection methods have emerged, and crowd counting in surveillance video [5,6] is considered one of the most important applications for head detection. Object detection has made great progress with the development of CNN and deep learning, but head detection is still a daunting task in complex crowd scenes [7] where the objects have high diversity, strong occlusion, dynamic blur, low resolution, and rare features.
In recent years, many head detectors based on deep learning have emerged [8,9], and now, head detection is considered a special type of object detection for most approaches. Since a head’s scale and appearance vary, extracting features effectively to localize the head and to separate it from the background is still a challenge. One of the better performers is the RCNN [10] head detector [9], which uses two models for head detection. The first global model provides a multi-scale heat map to determine the probability of head presence, and the second local model uses search suggestions (SS) to limit the set of target hypotheses, completing the RCNN head detection framework by combining clues from both models. Hariharan et al. [11] used Hypercolumn, the activation cascade of all nodes of a network corresponding to pixels, as a feature for fine-grained localization of targets. SSD [12] and YOLO [13] use multi-scale features to detect objects to derive the probability of category and bounding box coordinates, where YOLO uses a residual network (ResNet) [14] for feature extraction while using spatial feature pyramids to accomplish contextual feature fusion; both detectors are much faster than Faster RCNN [15]. However, the accuracy is still very poor in small object detection.
In general, in complex scenes, head detection still faces many challenges because the proportion, appearance, and pose of human heads vary in size. Therefore, we propose an improved feature extraction network DR-Net based on Darknet-53 [13], which is based on the principle that the DenseNet [16] is embedded in Darknet-53 to add subsequent layer values by creating an identity mapping, and then DenseNet connects all layers for channel merging to be able to reuse features. The backpropagation of gradients is improved compared to ResNet to utilize the feature information better and to improve the transmission rate of information between layers.
2. Related Works
2.1. Image Feature Extraction Network
For object detection, the extraction of image features is crucial for model training. Convolutional neural networks for feature extraction have evolved from the earliest scale-invariant feature transform (SIFT) [17] and histogram of oriented gradients (HOG). In 1998, LeCun [18] proposed LeNet-5, which combined convolution with the neural network, an epoch-making and far-reaching convolutional neural network structure, and introduced two new concepts of convolution and pooling. Later, with the development of deep learning theory, Alex Krizhevshy proposed the heavyweight network AlexNet [19], an eight-layer deep convolutional neural network, and the classification accuracy of this network model on the ImageNet dataset has been substantially improved. The breakthrough is that the network is implemented on a GPU for the first time, making the training time much shorter. Since AlexNet, researchers have proposed various convolutional neural network models with increasingly better performance from different network structures. The most famous are VGGNet [20], GoogLeNet [21], and ResNet etc. Darknet-53 is a fully convolutional network, borrowing from the concept of Resnet, using a large number of residual modules for leapfrogging connections and using convolutional strides to achieve downsampling for complete extraction of features in order to reduce the negative effects of pooling gradients.
DenseNet is a dense network that borrows from the previous idea of short-circuit connected networks. DenseNet connects all layers, i.e., the outputs of all preceding layers are merged to generate the current layer’s input, significantly improving the network information transit. In this paper, we propose a method that combines DenseNet and ResNet to create a novel feature extraction network capable of increasing the depth of the network model and of improving information flow transmission.
2.2. Head Detection
Head detection is one of the important research directions for computer vision, and the head detection issue is seen as a specific type of object detection by most related works. Traditional head detection methods use hand-designed complex feature operators to extract features and then use Support Vector Machine (SVM) [22], Adaptive Boosting (AdaBoost) [23], Deformable Part Model (DPM) [24], and other algorithms to classify the extracted features. In [24], the DPM proposed by the authors utilizes the HOG [25], feature and has been extensively used in the field of object detection. However, the traditional methods have poor portability and unsatisfactory performance and often fail to meet our detection requirements in realistic scenarios.
In recent years, convolutional neural networks have made great progress in object detection and deep neural networks have become the preferred method for object detection tasks. Current mainstream CNN object detection methods can be divided into two categories: single-stage and two-stage frameworks. Single-stage frameworks such as YOLO and SSD use regression to generate bounding boxes and to classify them by scoring them for the bounding box categories. Two-stage frameworks include RCNN and their improved versions Fast RCNN [26], Faster RCNN, and SPPNet [27]. RCNN uses the candidate region proposal method to create a region of interest (ROI) for object detection. Then, the proposals generated by the selective search (SS) [28] method are sent to the feedforward network. Shaoqing Ren et al. continued to make innovations based on Fast RCNN and proposed the Faster RCNN model. Faster RCNN removes the SS algorithm in extracting candidate regions and proposes an RPN network. YOLO outperforms Faster RCNN in detection speed on datasets, but it comes at the cost of accuracy.
Although the abovementioned existing models have achieved significant performance in classifying multiple objects in images, they face challenges in identifying tiny targets since most models use the last convolutional layer features for object detection. However, the final convolutional layer contains insufficient information about tiny targets. As the size of the targets is usually small (up to 10 to 20 pixels) in the head detection problem, the current form of the method is not suitable for detecting tiny targets.
3. Proposed Methodology
3.1. DR-Net: ResNet Combined with DenseNet for Feature Extraction Network
The image feature extraction network of YOLOv3 is primarily based on the concept of residual network. Darknet-53 uses multiple residual modules to form a ResNet containing a large number of parameters, which takes up most of YOLO3’s computing resources. Using ResNet for image feature extraction is a common method. However, the deep network it builds is computationally intensive, which leads to a decrease in computational speed. DenseNet adds the values of subsequent layers through identity mapping that connects all channel merging layers to reuse the feature. Compared to ResNet, DenseNet increases the information efficiency and network gradient transmission. Each layer obtains a gradient directly from the loss function and receives a input signal directly so that a deeper network can be trained. Moreover, this network also has the effect of regularization. Other networks aim to improve the network performance from depth and width, while DenseNet works to improve the network performance from the perspective of feature reuse. The structural diagram of DenseNet is shown in Figure 1.
Figure 1.
The structure of DenseNet.
An obvious difference between DenseNet and ResNet is that ResNet is superimposed by a shortcut while DenseNet is a concatenation. Each layer in DenseNet receives the output of all preceding layers as its input. In Figure 1, , , , and represent the output feature map while , , , and are nonlinear transformations. The common convolutional L-layer is composed of L connections, while Dense-Net’s L-layer contains connections, and each layer is linked to all layers. Thus, all feature maps of the previous layers can be received by each layer. The relationship between each layer of DenseNet’s feature maps is shown in Equation (1).
The DR-Net network proposed in this paper is an image feature extraction network based on the combination of DenseNet and Resnet. The DBL (Conv2d-BN-ReLU) module is composed of convolution, Batch Normalization, and Leaky-ReLU. Two DBL modules constitute the Double-DBL (D-DBL) module, using the D-DBL module as the transport layer . A detailed composition of the transport layer is shown in Table 1.
Table 1.
The internal channel information from the transport layer.
We use convolution to compress the number of channels and convolution to expand the number of channels. However, the DenseBlocks without clipping has (6, 12, 24, and 32) convolutional layers. Such a multi-layer DenseNet makes redundant feature maps and slows down the detection speed. Considering the combination with the residual block, we set up a four-layer structured DenseBlock. Figure 2 shows the structures of DenseBlock1 and DensBlock2, where DenseBlock1 has 128 feature map increments and DenseBlock2 has 256 feature map increments for each layer. After completing its mapping to connect all channel numbers, convolution was used for dimensionality reduction, decreasing the number of feature maps to connect the subsequent layers and to reduce computational cost. The specific structure of DR-Net is shown in Figure 3. The proposed DR-Net can reduce the reliance of the network on residual modules and can obtain more semantic information. In order to better demonstrate the internal information of DR-Net, Table 2 records the specific parameter variations of the original image for feature extraction.
Figure 2.
The internal parameters and structure of DenseBlock.
Figure 3.
The structure of the DR-Net feature extraction network.
Table 2.
Parameters of the DR-Net feature extraction network.
Some partial feature maps selected randomly are shown in Figure 4. Deep feature maps have a lower resolution with more abstract sematic information, which make it difficult for human comparison and analysis. Therefore, shallow feature maps that contain more original features are used for visualization and human analysis. The left side is the original input image, the middle shows the shallow feature maps extracted by our DR-Net, and the right side shows the shallow feature maps extracted by Darknet53. We used pseudo-color to visualize the feature maps, and the brighter the color, the more interesting the features are in that region. Through the feature map visualization, we can find that the feature visualization maps extracted by DR-Net show the brighter colors of the classroom area where the students are located while Darknet-53 is a bit darker, and it shows that more detailed information is available in our DR-Net.
Figure 4.
Comparison of DR-Net and Darknet-53 partial shallow feature map visualization.
3.2. Mixed Dilated Convolution
Although the human head and shoulder region has minimal shape variation and high stability, the head is a small target compared to other objects for a single image and the ratio of the pixels is smaller. Feature Pyramid Networks (FPN) [29] mainly address the deficiencies of the model in its ability to handle multi-scale variation problems in object detection tasks. It is capable of handling multi-scale variation problems in object detection with a very small computational increase. In addition, the dilated convolution [30] can extract different feature maps with different sematic information and according to different dilated rates. After feature fusion in the FPN part of YOLOv3, the mixed dilated convolution is further used to fuse the context information to expand the receptive field of feature maps for better detection of small targets. From [31], it is known that the mixed dilated convolution has a gridding effect due to the dilated rate stacking, which leads to some missing pixels and loss of information continuity. The dilated rate stacking needs to satisfy Equation (2).
where is the dilated rate of layer i and is the maximum dilated rate of layer i. Assuming that there are n layers in total, to satisfy , a simple example is . Therefore, in this section, we propose that the MDC structure is composed of dilated rates . The MDC structure shown in Figure 5 is the module after the first FPN fusion. In this MDC module, we use three different dilated rates of the dilated convolution in parallel and the dilated rate size reflects the corresponding size of the receptive field. First, we compress the number of channels in feature map after feature fusion by a convolution and obtain feature map . The dilated convolutions with different dilated rates () are then sampled on the feature map to obtain , , and . Finally, we connect , , , and to obtain the feature map . The model can receive information from various receptive fields and can then be fused to extract more semanticized information, in particular, feature information for small targets.
Figure 5.
The structure of MDC.
We productively integrate two MDC modules in front of the YOLO-head. When the connection of the spatial feature pyramid is completed, the four contextual information-aware module features are fused using MDC and then sent to YOLO-head for detection. Our proposed MDC-based spatial feature pyramid structure is shown in Figure 6. By extracting the image features through DR-Net, the feature maps with richer semantic information are obtained. However, the target location is not accurate, so the fusion of deep semantic information with shallow layers by upsampling can obtain detailed information and rich semantic information. The MDC module is used to sample and fuse the feature maps with dilated convolutions of different dilated rates to expand the receptive field and to utilize more fine-grained feature information.
Figure 6.
The complete network structure diagram. The features are first extracted by DR-Net, and then, the FRN is completed by a series of upsampling and fusion and then sent to MDC to obtain a larger receptive field.
3.3. K-Means for Anchor Boxes
YOLOv3 uses the idea of faster RCNN anchor boxes, which are a set of original candidate boxes with fixed widths and heights. The selection of the original anchor boxes affects the accuracy and the detection speed directly. Instead of manually selecting anchor frames, YOLOv3 runs K-means clustering on the dataset to find priority boxes automatically. K-means is the most commonly used clustering algorithm. The main idea is as follows: Given the value of K and K initial cluster centroids, each point is allocated to the nearest cluster centroid. After all points are allocated, the cluster centroids are recalculated based on all points within the same cluster. Then, the steps for assigning points and updating cluster centroids are performed iteratively until the change in cluster centroids is minimal or the specified number of iterations is reached. K-means generated clusters can reflect the sample distribution in each dataset, making good predictions easier for the network. Since the target we detect is the human head, which is a small target for other objects or backgrounds in the picture, the original anchor boxes of YOLOv3 do not apply to the current scene, so we need to re-cluster the anchor boxes.
In this paper, we use average IOU instead of distance to make anchor boxes, and adjacent real boxes have higher IOU values. The anchor boxes are calculated using Equation (3).
We calculate the IOU of two boxes, i.e., the similarity of two boxes. The smaller the d, the more similar is to . Assign to ; then, assign the next box and update until the cluster centroid change is slight or the number of iterations specified is achieved. IOU is the intersection over union, defined in Equation (4).
where is the area of overlap between the bounding box and the ground truth, and is the merged area between them. Algorithm 1 shows the K-means pseudo code algorithm in this paper.
| Algorithm 1 Pseudocode for the algorithm to generate the new anchor boxes. |
| 1. Randomly create K cluster centers:, refers to the width and height of the anchor boxes. [t] 2. . initial For to do If { } End For 3. Regenerate new cluster centers: 4. Repeat step 2 and step 3 until the clusters converge. |
We applied K-means clustering on the Brainwash dataset, HollywoodHeads datasets, and SCUT-HEAD dataset. The average IOU we obtained using various k values is shown in Figure 7, and as k increases, the variation in the objective function becomes more and more stable. Since there are three detection layers in our method, we chose nine anchor boxes. Table 3 shows the width and height of the Brainwash dataset, HollywoodHeads dataset, and SCUT-HEAD dataset for the corresponding clusters.
Figure 7.
K-means clustering analysis results on the Brainwash, Hollywoodheads, and SCUT-HEAD datasets.
Table 3.
The corresponding clusters on the Brainwash, Hollywoodheads, and SCUT-HEAD datasets.
4. Experiment
4.1. Datasets
The HollywoodHeads dataset was proposed by Tuan-Hung [9] et al. The dataset was collected from 21 Hollywood movie scenes, containing 224,740 images. Each image in that scene has fewer heads, containing a total of 369,846 head annotations. The HollywoodHeads training set consists of 216,719 image frames from 15 movies, the validation set consists of 6719 image frames from three movies, and the test set consists of 1032 image frames from three other movies. The same dataset partitioning was used when evaluating our proposed method.
The Brainwash dataset [8] is a huge collection of image data from crowded scenes captured by public cameras. It consists of 11,917 images obtained from coffee shop surveillance videos at a fixed interval of 100 s, with a high average number of heads (7.89) in the scene and 91,146 heads with markers in these images. There are 10,917 images with 82,906 annotations in the training set, 500 images with 3318 annotations in the validation set, and 500 images with 4992 annotations in the test set. The Brainwash dataset contains a larger number of human heads in each image compared with the HollywoodHeads dataset.
The SCUT-HEAD dataset [32] is a large-scale dataset for head detection containing 4405 images with 111,251 heads labeled. This scene has an extremely dense average number of heads (25.2), thus making this dataset very challenging. Part A includes 2000 images, with 67,321 head annotations, composed of images sampled from classroom surveillance videos. In Part B, 2405 images, with 43,930 head annotations, are obtained from Internet crawlers. Both parts of the SCUT-HEAD dataset are divided into the training and test sets. Table 4 shows a detailed comparison of the three datasets.
Table 4.
Comparison of the three head datasets.
4.2. Evaluation Metrics
For a fair comparison, our experimental results were evaluated using average precision (AP). We took an IOU threshold of 0.5 as the benchmark for plotting precision–recall (PR) curves, and the area under the curve represents the AP value. We chose 0.5 as a specific IOU threshold, and the head can be considered correctly detected when the IOU of the predicted border is ≥0.5 [33]. The values of precision, recall, and F1-score (applied to the SCUT-HEAD dataset) can be obtained under this criterion, where precision indicates the percentage of positive samples among the images identified and recall indicates the percentage of positive samples predicted among all test set images as positive images. In general, to evaluate our approach, we used three standard metrics (precision (P), recall (R), and AP-IOU). All experiments were performed with 32G of memory on an NVIDIA Tesla V100 GPU. The formulas to calculate each evaluation index are as follows:
4.3. Ablation Study
As mentioned above, based on the combination of ResNet and DenseNet, we propose a DR-Net feature extraction network to achieve feature reuse, to reduce the number of parameters, and to improve the rate of transmission of information between layers. Furthermore, to extract more information about small targets and to improve the accuracy of head detection, we introduce the MDC spatial feature pyramid feature fusion, which is a structure located in front of the YOLO head. Compared with Darknet-53, our method improved AP by 0.086, 0.107, and 0.118 in the HollywoodHeads, Brainwash, and SCUT-HEAD test sets, respectively.
Moreover, we consider the use of multiple object detection scales, and we are aware that YOLOv3 has three object detection layers, which correspond to large, medium, and small objects. The small object-sensitive detection layer (the default image input size is 416 × 416) is 52 × 52, and we expand the detection layer to a larger scale 104 × 104 (the fourth detection layer) to detect the head. After the experiment, we found that the AP values improved by 0.003, 0.004, and 0.01, respectively. However, due to the increase in computation caused by the addition of detection layers, the FPS decreased from 15 (DR-Net+MDC) to 9. We finally embrace the DR-Net+MDC structure given the speed and accuracy. A comparison of our tests with the baseline Darknet-53 is shown in detail in Figure 8 and Table 5, which demonstrates the method’s validity.
Figure 8.
The PR curves of the ablation experiments on three datasets.
Table 5.
Comparison of the ablation experiment results on three datasets.
5. Results
5.1. Results on the Brainwash Dataset
We conducted experiments with DR-Net+MDC as the backbone after reviewing the aforementioned ablation studies. The Brainwash dataset represents real scenes, all images are taken from a coffee shop surveillance camera, and the distribution of head is dense. In this paper, we compare the method proposed with several representative detectors, such as FCHD [34], E2PD [8], SSD [12], HeadNet [35], FRCN [15], etc. Our proposed method exceeds other baselines, and we improved the AP value by 1.1% on the Brainwash test set compared to the current best HeadNet at an IOU threshold of 0.5. The comparison of different methods is shown in Table 6, and the PR curves are shown in Figure 9.
Table 6.
Results of our method compared with other methods on the Brainwash dataset.
Figure 9.
The PR curves of our method versus other methods from the experiments on the Brainwash dataset.
5.2. Results on the HollywoodHeads Dataset
The HoolywoodHeads dataset was tested with five different methods. First, we used general CNN-based object detection methods, FRCN and SSD, and then we considered that head detection is similar to face detection, so we select a CNN-based face detector DPM-Face [36]. Second, we also compared our model with the novel method of TKD [37] using LSTM [38], and the experimental results show superiority over TKD. Moreover, two of the latest head detectors, FCHD and HeadNet, were selected for testing and comparison. Since DMP-Face uses a multi-task cascaded convolutional network for face detection and alignment, it results in a low detection rate. The presence of a large number of side faces and backs of heads in this dataset results in a low detection rate. Our method achieves AP@84.8% on the HollywoodHeads dataset, which is better than other methods (Table 7 and Figure 10).
Table 7.
Results of our method compared with other methods on the HollywoodHeads dataset.
Figure 10.
The PR curves of our method versus other methods from the experiments on the HollywoodHeads dataset.
5.3. Results on the SCUT-HEAD Dataset
We evaluated our model using precision, recall, and F1-score on the SCUT-HEAD dataset. The equation for F1-score is shown in Equation (8). We conducted the comparison experiments with the methods FRCN, YOLOv3, SSD, SMD [39], and R-FCN+FRN [40]. Our precision is slightly lower than that of SMD, probably because we maintained YOLOv3’s three-scale detection while SMD uses a scale map to obtain the proportion of each head in the image for the scene and separates the head into candidate regions, focusing on the head size and the accuracy rate. In contrast, our technique uses one-time regression and prediction without candidate regions to slightly lower the accuracy rate. The detailed comparison information is shown in Table 8.
Table 8.
Results of our method compared with other methods on the SCUT-HEAD dataset.
5.4. Qualitative Results on Three Datasets
We display some of the visualization results of our method and other methods on the three publicly datasets test sets. As shown in Figure 11, Figure 12 and Figure 13, the yellow bounding boxes indicate the results predicted using YOLOv3, FRCN, and our proposed method; the green bounding boxes are the ground-truth labels of the test images, the red ones indicate missed detections or false detections of heads and the blue boxes indicate that heads that are not labeled in the dataset due to dataset errors but were detected by our method. From the comparison, we can see that our method has lower missed detection rates and false detection rates than the other methods, proving that our proposed DR-Net+MDC has a better detection rate for small targets.
Figure 11.
Qualitative results on the Brainwash dataset. The green box is the ground truth, the yellow box is the result of the detection, the red box is the result of missed and false detections, and the blue box is our method used to detect heads that are not labeled in the picture. Different rows show the experimental results of different methods on the Brainwash dataset.
Figure 12.
Qualitative results with different methods on the HollywoodHeads dataset.
Figure 13.
Qualitative results with different methods on the SCUT-HEAD dataset.
6. Conclusions
In this paper, we proposed a feature extraction network DR-Net based on the combination of DenseNet and ResNet, which was used to reduce the number of parameters and to improve the information transmission rate between layers to obtain more semantic information. At present, most of neural networks still have problems with missed detection of small objects. We propose embedding the MDC structure after the FPN and fusing the dilated convolution module with different dilated rates to improve the sensitivity to head detection. Extensive experiments were conducted on three challenging datasets, and the experimental results showed that the proposed method in this paper significantly improves the accuracy of head detection.
Author Contributions
This paper proposes an image feature extraction network based on DR-Net and an MDC module that uses fine-grained feature information to improve the detection accuracy of small targets. Conceptualization and methodology, J.L. and Y.Z.; formal analysis and writing—original draft preparation, J.L.; dataset collection and analysis, Z.W. and M.N.; supervision and writing—review and editing, Y.Z., Y.W., and J.X.; All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the Research Foundation for Advanced Talents of Guizhou University under grant (2016) No. 49, Key Disciplines of Guizhou Province-Computer Science and Technology (ZDXK[2018] 007) and Key Supported Disciplines of Guizhou Province-Computer Application Technology (No.Qian Xue WeiHeZi ZDXK [2016]20), and the work was also supported by the National Natural Science Foundation of China (61462013 and 61661010).
Data Availability Statement
The HollywoodHeads dataset: https://www.di.ens.fr/willow/research/headdetection/; the SCUT-HEAD dataset: https://github.com/wanjinchang/SCUT-HEAD-Dataset-Release, and the Brainwash dataset: http://datasets.d2.mpiinf.mpg.de/brainwash/brainwash.tar.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Zheng, L.; Huang, Y.; Lu, H.; Yang, Y. Pose-invariant embedding for deep person re-identification. IEEE Trans. Image Process. 2019, 28, 4500–4509. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Tian, Y.; Dehghan, A.; Shah, M. On detection, data association and segmentation for multi-target tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2146–2160. [Google Scholar] [CrossRef] [PubMed]
- Yao, H.; Zhang, S.; Hong, R.; Zhang, Y.; Xu, C.; Tian, Q. Deep representation learning with part loss for person re-identification. IEEE Trans. Image Process. 2019, 28, 2860–2871. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kondo, Y. Automatic Drive Assist System, Automatic Drive Assist Method, and Computer Program. U.S. Patent App. 15/324,582, 20 July 2017. [Google Scholar]
- Basalamah, S.; Khan, S.D.; Ullah, H. Scale driven convolutional neural network model for people counting and localization in crowd scenes. IEEE Access 2019, 7, 71576–71584. [Google Scholar] [CrossRef]
- Choi, J.W.; Yim, D.H.; Cho, S.H. People counting based on an IR-UWB radar sensor. IEEE Sens. J. 2017, 17, 5717–5727. [Google Scholar] [CrossRef]
- Ballotta, D.; Borghi, G.; Vezzani, R.; Cucchiara, R. Fully convolutional network for head detection with depth images. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 752–757. [Google Scholar]
- Stewart, R.; Andriluka, M.; Ng, A.Y. End-to-end people detection in crowded scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2325–2333. [Google Scholar]
- Vu, T.H.; Osokin, A.; Laptev, I. Context-aware CNNs for person head detection. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2893–2901. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Hariharan, B.; Arbeláez, P.; Girshick, R.; Malik, J. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 447–456. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Shalev-Shwartz, S.; Singer, Y.; Srebro, N.; Cotter, A. Pegasos: Primal estimated sub-gradient solver for svm. Math. Program. 2011, 127, 3–30. [Google Scholar] [CrossRef] [Green Version]
- Margineantu, D.D.; Dietterich, T.G. Pruning adaptive boosting. In ICML; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1997; Volume 97, pp. 211–218. [Google Scholar]
- Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer SOCIETY conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef] [Green Version]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
- Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding convolution for semantic segmentation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1451–1460. [Google Scholar]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. arXiv 2016, arXiv:1605.06409. [Google Scholar]
- Everingham, M.; Eslami, S.A.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
- Vora, A.; Chilaka, V. FCHD: Fast and accurate head detection in crowded scenes. arXiv 2018, arXiv:1809.08766. [Google Scholar]
- Li, W.; Li, H.; Wu, Q.; Meng, F.; Xu, L.; Ngan, K.N. Headnet: An end-to-end adaptive relational network for head detection. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 482–494. [Google Scholar] [CrossRef]
- Ranjan, R.; Patel, V.M.; Chellappa, R. A deep pyramid deformable part model for face detection. In Proceedings of the 2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS), Arlington, VA, USA, 8–11 September 2015; pp. 1–8. [Google Scholar]
- Bajestani, M.F.; Yang, Y. Tkd: Temporal knowledge distillation for active perception. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Village, CO, USA, 1–5 March 2020; pp. 953–962. [Google Scholar]
- Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1227–1236. [Google Scholar]
- Sun, Z.; Peng, D.; Cai, Z.; Chen, Z.; Jin, L. Scale mapping and dynamic re-detecting in dense head detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 1902–1906. [Google Scholar]
- Peng, D.; Sun, Z.; Chen, Z.; Cai, Z.; Xie, L.; Jin, L. Detecting heads using feature refine net and cascaded multi-scale architecture. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 2528–2533. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).