Head Detection Based on DR Feature Extraction Network and Mixed Dilated Convolution Module

Liu, Junwen; Zhang, Yongjun; Xie, Jianbin; Wei, Yan; Wang, Zewei; Niu, Mengjia

doi:10.3390/electronics10131565

Open AccessArticle

Head Detection Based on DR Feature Extraction Network and Mixed Dilated Convolution Module

by

Junwen Liu

¹,

Yongjun Zhang

^1,*

,

Jianbin Xie

²,

Yan Wei

²,

Zewei Wang

¹

and

Mengjia Niu

¹

Key Laboratory of Intelligent Medical Image Analysis and Precise Diagnosis of Guizhou Province, College of Computer Science and Technology, Guizhou University, Guiyang 550000, China

²

School of Electronic Science, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Electronics 2021, 10(13), 1565; https://doi.org/10.3390/electronics10131565

Submission received: 10 May 2021 / Revised: 17 June 2021 / Accepted: 21 June 2021 / Published: 29 June 2021

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Pedestrian detection for complex scenes suffers from pedestrian occlusion issues, such as occlusions between pedestrians. As well-known, compared with the variability of the human body, the shape of a human head and their shoulders changes minimally and has high stability. Therefore, head detection is an important research area in the field of pedestrian detection. The translational invariance of neural network enables us to design a deep convolutional neural network, which means that, even if the appearance and location of the target changes, it can still be recognized effectively. However, the problems of scale invariance and high miss detection rates for small targets still exist. In this paper, a feature extraction network DR-Net based on Darknet-53 is proposed to improve the information transmission rate between convolutional layers and to extract more semantic information. In addition, the MDC (mixed dilated convolution) with different sampling rates of dilated convolution is embedded to improve the detection rate of small targets. We evaluated our method on three publicly available datasets and achieved excellent results. The AP (Average Precision) value on the Brainwash dataset, HollywoodHeads dataset, and SCUT-HEAD dataset reached 92.1%, 84.8%, and 90% respectively.

Keywords:

head detection; small targets; convolutional neural networks; DR-Net; MDC

1. Introduction

Pedestrian detection is an important research area in computer vision and extends to many related applications. These applications include person recognition [1], pedestrian tracking [2], autonomous driving [3,4] etc. Generic pedestrian detection is poor in crowded and heavily occluded scenes, so head detection methods have emerged, and crowd counting in surveillance video [5,6] is considered one of the most important applications for head detection. Object detection has made great progress with the development of CNN and deep learning, but head detection is still a daunting task in complex crowd scenes [7] where the objects have high diversity, strong occlusion, dynamic blur, low resolution, and rare features.

In recent years, many head detectors based on deep learning have emerged [8,9], and now, head detection is considered a special type of object detection for most approaches. Since a head’s scale and appearance vary, extracting features effectively to localize the head and to separate it from the background is still a challenge. One of the better performers is the RCNN [10] head detector [9], which uses two models for head detection. The first global model provides a multi-scale heat map to determine the probability of head presence, and the second local model uses search suggestions (SS) to limit the set of target hypotheses, completing the RCNN head detection framework by combining clues from both models. Hariharan et al. [11] used Hypercolumn, the activation cascade of all nodes of a network corresponding to pixels, as a feature for fine-grained localization of targets. SSD [12] and YOLO [13] use multi-scale features to detect objects to derive the probability of category and bounding box coordinates, where YOLO uses a residual network (ResNet) [14] for feature extraction while using spatial feature pyramids to accomplish contextual feature fusion; both detectors are much faster than Faster RCNN [15]. However, the accuracy is still very poor in small object detection.

In general, in complex scenes, head detection still faces many challenges because the proportion, appearance, and pose of human heads vary in size. Therefore, we propose an improved feature extraction network DR-Net based on Darknet-53 [13], which is based on the principle that the DenseNet [16] is embedded in Darknet-53 to add subsequent layer values by creating an identity mapping, and then DenseNet connects all layers for channel merging to be able to reuse features. The backpropagation of gradients is improved compared to ResNet to utilize the feature information better and to improve the transmission rate of information between layers.

2. Related Works

2.1. Image Feature Extraction Network

For object detection, the extraction of image features is crucial for model training. Convolutional neural networks for feature extraction have evolved from the earliest scale-invariant feature transform (SIFT) [17] and histogram of oriented gradients (HOG). In 1998, LeCun [18] proposed LeNet-5, which combined convolution with the neural network, an epoch-making and far-reaching convolutional neural network structure, and introduced two new concepts of convolution and pooling. Later, with the development of deep learning theory, Alex Krizhevshy proposed the heavyweight network AlexNet [19], an eight-layer deep convolutional neural network, and the classification accuracy of this network model on the ImageNet dataset has been substantially improved. The breakthrough is that the network is implemented on a GPU for the first time, making the training time much shorter. Since AlexNet, researchers have proposed various convolutional neural network models with increasingly better performance from different network structures. The most famous are VGGNet [20], GoogLeNet [21], and ResNet etc. Darknet-53 is a fully convolutional network, borrowing from the concept of Resnet, using a large number of residual modules for leapfrogging connections and using convolutional strides to achieve downsampling for complete extraction of features in order to reduce the negative effects of pooling gradients.

DenseNet is a dense network that borrows from the previous idea of short-circuit connected networks. DenseNet connects all layers, i.e., the outputs of all preceding layers are merged to generate the current layer’s input, significantly improving the network information transit. In this paper, we propose a method that combines DenseNet and ResNet to create a novel feature extraction network capable of increasing the depth of the network model and of improving information flow transmission.

2.2. Head Detection

Head detection is one of the important research directions for computer vision, and the head detection issue is seen as a specific type of object detection by most related works. Traditional head detection methods use hand-designed complex feature operators to extract features and then use Support Vector Machine (SVM) [22], Adaptive Boosting (AdaBoost) [23], Deformable Part Model (DPM) [24], and other algorithms to classify the extracted features. In [24], the DPM proposed by the authors utilizes the HOG [25], feature and has been extensively used in the field of object detection. However, the traditional methods have poor portability and unsatisfactory performance and often fail to meet our detection requirements in realistic scenarios.

In recent years, convolutional neural networks have made great progress in object detection and deep neural networks have become the preferred method for object detection tasks. Current mainstream CNN object detection methods can be divided into two categories: single-stage and two-stage frameworks. Single-stage frameworks such as YOLO and SSD use regression to generate bounding boxes and to classify them by scoring them for the bounding box categories. Two-stage frameworks include RCNN and their improved versions Fast RCNN [26], Faster RCNN, and SPPNet [27]. RCNN uses the candidate region proposal method to create a region of interest (ROI) for object detection. Then, the proposals generated by the selective search (SS) [28] method are sent to the feedforward network. Shaoqing Ren et al. continued to make innovations based on Fast RCNN and proposed the Faster RCNN model. Faster RCNN removes the SS algorithm in extracting candidate regions and proposes an RPN network. YOLO outperforms Faster RCNN in detection speed on datasets, but it comes at the cost of accuracy.

Although the abovementioned existing models have achieved significant performance in classifying multiple objects in images, they face challenges in identifying tiny targets since most models use the last convolutional layer features for object detection. However, the final convolutional layer contains insufficient information about tiny targets. As the size of the targets is usually small (up to 10 to 20 pixels) in the head detection problem, the current form of the method is not suitable for detecting tiny targets.

3. Proposed Methodology

3.1. DR-Net: ResNet Combined with DenseNet for Feature Extraction Network

The image feature extraction network of YOLOv3 is primarily based on the concept of residual network. Darknet-53 uses multiple residual modules to form a ResNet containing a large number of parameters, which takes up most of YOLO3’s computing resources. Using ResNet for image feature extraction is a common method. However, the deep network it builds is computationally intensive, which leads to a decrease in computational speed. DenseNet adds the values of subsequent layers through identity mapping that connects all channel merging layers to reuse the feature. Compared to ResNet, DenseNet increases the information efficiency and network gradient transmission. Each layer obtains a gradient directly from the loss function and receives a input signal directly so that a deeper network can be trained. Moreover, this network also has the effect of regularization. Other networks aim to improve the network performance from depth and width, while DenseNet works to improve the network performance from the perspective of feature reuse. The structural diagram of DenseNet is shown in Figure 1.

An obvious difference between DenseNet and ResNet is that ResNet is superimposed by a shortcut while DenseNet is a concatenation. Each layer in DenseNet receives the output of all preceding layers as its input. In Figure 1,

x_{1}

,

x_{2}

,

x_{3}

, and

x_{4}

represent the output feature map while

H_{1}

,

H_{2}

,

H_{3}

, and

H_{4}

are nonlinear transformations. The common convolutional L-layer is composed of L connections, while Dense-Net’s L-layer contains

L (L + 1) / 2

connections, and each layer is linked to all layers. Thus, all feature maps of the previous

L - 1

layers can be received by each layer. The relationship between each layer of DenseNet’s feature maps is shown in Equation (1).

x_{l} = H_{l} [x_{0}, x_{1}, \dots, x_{l - 1}]

(1)

The DR-Net network proposed in this paper is an image feature extraction network based on the combination of DenseNet and Resnet. The DBL (Conv2d-BN-ReLU) module is composed of convolution, Batch Normalization, and Leaky-ReLU. Two DBL modules constitute the Double-DBL (D-DBL) module, using the D-DBL module as the transport layer

H_{i}

. A detailed composition of the transport layer is shown in Table 1.

We use

1 \times 1

convolution to compress the number of channels and

3 \times 3

convolution to expand the number of channels. However, the DenseBlocks without clipping has (6, 12, 24, and 32) convolutional layers. Such a multi-layer DenseNet makes redundant feature maps and slows down the detection speed. Considering the combination with the residual block, we set up a four-layer structured DenseBlock. Figure 2 shows the structures of DenseBlock1 and DensBlock2, where DenseBlock1 has 128 feature map increments and DenseBlock2 has 256 feature map increments for each layer. After completing its mapping to connect all channel numbers,

1 \times 1

convolution was used for dimensionality reduction, decreasing the number of feature maps to connect the subsequent layers and to reduce computational cost. The specific structure of DR-Net is shown in Figure 3. The proposed DR-Net can reduce the reliance of the network on residual modules and can obtain more semantic information. In order to better demonstrate the internal information of DR-Net, Table 2 records the specific parameter variations of the original image for feature extraction.

Some partial feature maps selected randomly are shown in Figure 4. Deep feature maps have a lower resolution with more abstract sematic information, which make it difficult for human comparison and analysis. Therefore, shallow feature maps that contain more original features are used for visualization and human analysis. The left side is the original input image, the middle shows the shallow feature maps extracted by our DR-Net, and the right side shows the shallow feature maps extracted by Darknet53. We used pseudo-color to visualize the feature maps, and the brighter the color, the more interesting the features are in that region. Through the feature map visualization, we can find that the feature visualization maps extracted by DR-Net show the brighter colors of the classroom area where the students are located while Darknet-53 is a bit darker, and it shows that more detailed information is available in our DR-Net.

3.2. Mixed Dilated Convolution

Although the human head and shoulder region has minimal shape variation and high stability, the head is a small target compared to other objects for a single image and the ratio of the pixels is smaller. Feature Pyramid Networks (FPN) [29] mainly address the deficiencies of the model in its ability to handle multi-scale variation problems in object detection tasks. It is capable of handling multi-scale variation problems in object detection with a very small computational increase. In addition, the dilated convolution [30] can extract different feature maps with different sematic information and according to different dilated rates. After feature fusion in the FPN part of YOLOv3, the mixed dilated convolution is further used to fuse the context information to expand the receptive field of feature maps for better detection of small targets. From [31], it is known that the mixed dilated convolution has a gridding effect due to the dilated rate stacking, which leads to some missing pixels and loss of information continuity. The dilated rate stacking needs to satisfy Equation (2).

M_{i} = max [M_{i + 1} - 2 r_{i}, M_{i + 1} - r_{i}, r_{i}]

(2)

where

r_{i}

is the dilated rate of layer i and

M_{i}

is the maximum dilated rate of layer i. Assuming that there are n layers in total, to satisfy

M_{n} = r_{n}

, a simple example is

r = 1, 2, 4

. Therefore, in this section, we propose that the MDC structure is composed of dilated rates

r = 1, 2, 4

. The MDC structure shown in Figure 5 is the module after the first FPN fusion. In this MDC module, we use three different dilated rates of the dilated convolution in parallel and the dilated rate size reflects the corresponding size of the receptive field. First, we compress the number of channels in feature map

C \in R^{W \times H \times 768}

after feature fusion by a

1 \times 1

convolution and obtain feature map

C_{1} \in R^{H \times W \times 256}

. The dilated convolutions with different dilated rates (

r = 1, 2, 4

) are then sampled on the feature map

C_{1}

to obtain

C_{2} \in R^{H \times W \times 256}

,

C_{3} \in R^{H \times W \times 256}

, and

C_{4} \in R^{H \times W \times 256}

. Finally, we connect

C_{1}

,

C_{2}

,

C_{3}

, and

C_{4}

to obtain the feature map

M \in R^{H \times W \times 1024}, M = [C_{1}, C_{2}, C_{3}, C_{4}]

. The model can receive information from various receptive fields and can then be fused to extract more semanticized information, in particular, feature information for small targets.

We productively integrate two MDC modules in front of the YOLO-head. When the connection of the spatial feature pyramid is completed, the four contextual information-aware module features are fused using MDC and then sent to YOLO-head for detection. Our proposed MDC-based spatial feature pyramid structure is shown in Figure 6. By extracting the image features through DR-Net, the feature maps with richer semantic information are obtained. However, the target location is not accurate, so the fusion of deep semantic information with shallow layers by upsampling can obtain detailed information and rich semantic information. The MDC module is used to sample and fuse the feature maps with dilated convolutions of different dilated rates to expand the receptive field and to utilize more fine-grained feature information.

3.3. K-Means for Anchor Boxes

YOLOv3 uses the idea of faster RCNN anchor boxes, which are a set of original candidate boxes with fixed widths and heights. The selection of the original anchor boxes affects the accuracy and the detection speed directly. Instead of manually selecting anchor frames, YOLOv3 runs K-means clustering on the dataset to find priority boxes automatically. K-means is the most commonly used clustering algorithm. The main idea is as follows: Given the value of K and K initial cluster centroids, each point is allocated to the nearest cluster centroid. After all points are allocated, the cluster centroids are recalculated based on all points within the same cluster. Then, the steps for assigning points and updating cluster centroids are performed iteratively until the change in cluster centroids is minimal or the specified number of iterations is reached. K-means generated clusters can reflect the sample distribution in each dataset, making good predictions easier for the network. Since the target we detect is the human head, which is a small target for other objects or backgrounds in the picture, the original anchor boxes of YOLOv3 do not apply to the current scene, so we need to re-cluster the anchor boxes.

In this paper, we use average IOU instead of distance to make anchor boxes, and adjacent real boxes have higher IOU values. The anchor boxes are calculated using Equation (3).

D = 1 - I O U (b o x 1, b o x_c l u s t e r)

(3)

We calculate the IOU of two boxes, i.e., the similarity of two boxes. The smaller the d, the more similar

b o x 1

is to

b o x_c l u s t e r

. Assign

b o x 1

to

b o x_c l u s t e r

; then, assign the next box and update

b o x_c l u s t e r

until the cluster centroid change is slight or the number of iterations specified is achieved. IOU is the intersection over union, defined in Equation (4).

I O U = \frac{a r e a_{o v e r l a p}}{a r e a_{u n i o n}}

(4)

where

a r e a_{o v e r l a p}

is the area of overlap between the bounding box and the ground truth, and

a r e a_{u n i o n}

is the merged area between them. Algorithm 1 shows the K-means pseudo code algorithm in this paper.

Algorithm 1 Pseudocode for the algorithm to generate the new anchor boxes.

1. Randomly create K cluster centers:

(W_{i}, H_{i}), i ϵ \{1, 2, 3, . . ., k\}

,

W_{i}, H_{i}

refers to the width and height of the anchor boxes. [t]
2.

D = 1 - I O U (b o x 1, b o x_c l u s t e r)

.
initial

i n d e x = 0

For

i = 0

to

k - 1

do

d i s = C o m p u t e D i s t (b o x 1, b o x_c l u s t e r [i])

If

d i s < m i n D i s

{

m i n D i s = d i s;

i n d e x = i;

}
End For
3. Regenerate new cluster centers:

W_{i}^{n e w} = \frac{1}{M_{i}} Σ w_{i}, H_{i}^{n e w} = \frac{1}{M_{i}} Σ H_{i}

4. Repeat step 2 and step 3 until the clusters converge.

We applied K-means clustering on the Brainwash dataset, HollywoodHeads datasets, and SCUT-HEAD dataset. The average IOU we obtained using various k values is shown in Figure 7, and as k increases, the variation in the objective function becomes more and more stable. Since there are three detection layers in our method, we chose nine anchor boxes. Table 3 shows the width and height of the Brainwash dataset, HollywoodHeads dataset, and SCUT-HEAD dataset for the corresponding clusters.

4. Experiment

4.1. Datasets

The HollywoodHeads dataset was proposed by Tuan-Hung [9] et al. The dataset was collected from 21 Hollywood movie scenes, containing 224,740 images. Each image in that scene has fewer heads, containing a total of 369,846 head annotations. The HollywoodHeads training set consists of 216,719 image frames from 15 movies, the validation set consists of 6719 image frames from three movies, and the test set consists of 1032 image frames from three other movies. The same dataset partitioning was used when evaluating our proposed method.

The Brainwash dataset [8] is a huge collection of image data from crowded scenes captured by public cameras. It consists of 11,917 images obtained from coffee shop surveillance videos at a fixed interval of 100 s, with a high average number of heads (7.89) in the scene and 91,146 heads with markers in these images. There are 10,917 images with 82,906 annotations in the training set, 500 images with 3318 annotations in the validation set, and 500 images with 4992 annotations in the test set. The Brainwash dataset contains a larger number of human heads in each image compared with the HollywoodHeads dataset.

The SCUT-HEAD dataset [32] is a large-scale dataset for head detection containing 4405 images with 111,251 heads labeled. This scene has an extremely dense average number of heads (25.2), thus making this dataset very challenging. Part A includes 2000 images, with 67,321 head annotations, composed of images sampled from classroom surveillance videos. In Part B, 2405 images, with 43,930 head annotations, are obtained from Internet crawlers. Both parts of the SCUT-HEAD dataset are divided into the training and test sets. Table 4 shows a detailed comparison of the three datasets.

4.2. Evaluation Metrics

For a fair comparison, our experimental results were evaluated using average precision (AP). We took an IOU threshold of 0.5 as the benchmark for plotting precision–recall (PR) curves, and the area under the curve represents the AP value. We chose 0.5 as a specific IOU threshold, and the head can be considered correctly detected when the IOU of the predicted border is ≥0.5 [33]. The values of precision, recall, and F1-score (applied to the SCUT-HEAD dataset) can be obtained under this criterion, where precision indicates the percentage of positive samples among the images identified and recall indicates the percentage of positive samples predicted among all test set images as positive images. In general, to evaluate our approach, we used three standard metrics (precision (P), recall (R), and AP-IOU). All experiments were performed with 32G of memory on an NVIDIA Tesla V100 GPU. The formulas to calculate each evaluation index are as follows:

I O U = \frac{T P}{T P + F N + F P}

(5)

Precision = \frac{T P}{T P + F P}

(6)

R e c a l l = \frac{T P}{T P + F N}

(7)

F 1 = \frac{2 \times p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

(8)

4.3. Ablation Study

As mentioned above, based on the combination of ResNet and DenseNet, we propose a DR-Net feature extraction network to achieve feature reuse, to reduce the number of parameters, and to improve the rate of transmission of information between layers. Furthermore, to extract more information about small targets and to improve the accuracy of head detection, we introduce the MDC spatial feature pyramid feature fusion, which is a structure located in front of the YOLO head. Compared with Darknet-53, our method improved AP by 0.086, 0.107, and 0.118 in the HollywoodHeads, Brainwash, and SCUT-HEAD test sets, respectively.

Moreover, we consider the use of multiple object detection scales, and we are aware that YOLOv3 has three object detection layers, which correspond to large, medium, and small objects. The small object-sensitive detection layer (the default image input size is 416 × 416) is 52 × 52, and we expand the detection layer to a larger scale 104 × 104 (the fourth detection layer) to detect the head. After the experiment, we found that the AP values improved by 0.003, 0.004, and 0.01, respectively. However, due to the increase in computation caused by the addition of detection layers, the FPS decreased from 15 (DR-Net+MDC) to 9. We finally embrace the DR-Net+MDC structure given the speed and accuracy. A comparison of our tests with the baseline Darknet-53 is shown in detail in Figure 8 and Table 5, which demonstrates the method’s validity.

5. Results

5.1. Results on the Brainwash Dataset

We conducted experiments with DR-Net+MDC as the backbone after reviewing the aforementioned ablation studies. The Brainwash dataset represents real scenes, all images are taken from a coffee shop surveillance camera, and the distribution of head is dense. In this paper, we compare the method proposed with several representative detectors, such as FCHD [34], E2PD [8], SSD [12], HeadNet [35], FRCN [15], etc. Our proposed method exceeds other baselines, and we improved the AP value by 1.1% on the Brainwash test set compared to the current best HeadNet at an IOU threshold of 0.5. The comparison of different methods is shown in Table 6, and the PR curves are shown in Figure 9.

5.2. Results on the HollywoodHeads Dataset

The HoolywoodHeads dataset was tested with five different methods. First, we used general CNN-based object detection methods, FRCN and SSD, and then we considered that head detection is similar to face detection, so we select a CNN-based face detector DPM-Face [36]. Second, we also compared our model with the novel method of TKD [37] using LSTM [38], and the experimental results show superiority over TKD. Moreover, two of the latest head detectors, FCHD and HeadNet, were selected for testing and comparison. Since DMP-Face uses a multi-task cascaded convolutional network for face detection and alignment, it results in a low detection rate. The presence of a large number of side faces and backs of heads in this dataset results in a low detection rate. Our method achieves AP@84.8% on the HollywoodHeads dataset, which is better than other methods (Table 7 and Figure 10).

5.3. Results on the SCUT-HEAD Dataset

We evaluated our model using precision, recall, and F1-score on the SCUT-HEAD dataset. The equation for F1-score is shown in Equation (8). We conducted the comparison experiments with the methods FRCN, YOLOv3, SSD, SMD [39], and R-FCN+FRN [40]. Our precision is slightly lower than that of SMD, probably because we maintained YOLOv3’s three-scale detection while SMD uses a scale map to obtain the proportion of each head in the image for the scene and separates the head into candidate regions, focusing on the head size and the accuracy rate. In contrast, our technique uses one-time regression and prediction without candidate regions to slightly lower the accuracy rate. The detailed comparison information is shown in Table 8.

5.4. Qualitative Results on Three Datasets

We display some of the visualization results of our method and other methods on the three publicly datasets test sets. As shown in Figure 11, Figure 12 and Figure 13, the yellow bounding boxes indicate the results predicted using YOLOv3, FRCN, and our proposed method; the green bounding boxes are the ground-truth labels of the test images, the red ones indicate missed detections or false detections of heads and the blue boxes indicate that heads that are not labeled in the dataset due to dataset errors but were detected by our method. From the comparison, we can see that our method has lower missed detection rates and false detection rates than the other methods, proving that our proposed DR-Net+MDC has a better detection rate for small targets.

6. Conclusions

In this paper, we proposed a feature extraction network DR-Net based on the combination of DenseNet and ResNet, which was used to reduce the number of parameters and to improve the information transmission rate between layers to obtain more semantic information. At present, most of neural networks still have problems with missed detection of small objects. We propose embedding the MDC structure after the FPN and fusing the dilated convolution module with different dilated rates to improve the sensitivity to head detection. Extensive experiments were conducted on three challenging datasets, and the experimental results showed that the proposed method in this paper significantly improves the accuracy of head detection.

Author Contributions

This paper proposes an image feature extraction network based on DR-Net and an MDC module that uses fine-grained feature information to improve the detection accuracy of small targets. Conceptualization and methodology, J.L. and Y.Z.; formal analysis and writing—original draft preparation, J.L.; dataset collection and analysis, Z.W. and M.N.; supervision and writing—review and editing, Y.Z., Y.W., and J.X.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Research Foundation for Advanced Talents of Guizhou University under grant (2016) No. 49, Key Disciplines of Guizhou Province-Computer Science and Technology (ZDXK[2018] 007) and Key Supported Disciplines of Guizhou Province-Computer Application Technology (No.Qian Xue WeiHeZi ZDXK [2016]20), and the work was also supported by the National Natural Science Foundation of China (61462013 and 61661010).

Data Availability Statement

The HollywoodHeads dataset: https://www.di.ens.fr/willow/research/headdetection/; the SCUT-HEAD dataset: https://github.com/wanjinchang/SCUT-HEAD-Dataset-Release, and the Brainwash dataset: http://datasets.d2.mpiinf.mpg.de/brainwash/brainwash.tar.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zheng, L.; Huang, Y.; Lu, H.; Yang, Y. Pose-invariant embedding for deep person re-identification. IEEE Trans. Image Process. 2019, 28, 4500–4509. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Tian, Y.; Dehghan, A.; Shah, M. On detection, data association and segmentation for multi-target tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2146–2160. [Google Scholar] [CrossRef] [PubMed]
Yao, H.; Zhang, S.; Hong, R.; Zhang, Y.; Xu, C.; Tian, Q. Deep representation learning with part loss for person re-identification. IEEE Trans. Image Process. 2019, 28, 2860–2871. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kondo, Y. Automatic Drive Assist System, Automatic Drive Assist Method, and Computer Program. U.S. Patent App. 15/324,582, 20 July 2017. [Google Scholar]
Basalamah, S.; Khan, S.D.; Ullah, H. Scale driven convolutional neural network model for people counting and localization in crowd scenes. IEEE Access 2019, 7, 71576–71584. [Google Scholar] [CrossRef]
Choi, J.W.; Yim, D.H.; Cho, S.H. People counting based on an IR-UWB radar sensor. IEEE Sens. J. 2017, 17, 5717–5727. [Google Scholar] [CrossRef]
Ballotta, D.; Borghi, G.; Vezzani, R.; Cucchiara, R. Fully convolutional network for head detection with depth images. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 752–757. [Google Scholar]
Stewart, R.; Andriluka, M.; Ng, A.Y. End-to-end people detection in crowded scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2325–2333. [Google Scholar]
Vu, T.H.; Osokin, A.; Laptev, I. Context-aware CNNs for person head detection. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2893–2901. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Hariharan, B.; Arbeláez, P.; Girshick, R.; Malik, J. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 447–456. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Shalev-Shwartz, S.; Singer, Y.; Srebro, N.; Cotter, A. Pegasos: Primal estimated sub-gradient solver for svm. Math. Program. 2011, 127, 3–30. [Google Scholar] [CrossRef] [Green Version]
Margineantu, D.D.; Dietterich, T.G. Pruning adaptive boosting. In ICML; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1997; Volume 97, pp. 211–218. [Google Scholar]
Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer SOCIETY conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef] [Green Version]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding convolution for semantic segmentation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1451–1460. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. arXiv 2016, arXiv:1605.06409. [Google Scholar]
Everingham, M.; Eslami, S.A.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
Vora, A.; Chilaka, V. FCHD: Fast and accurate head detection in crowded scenes. arXiv 2018, arXiv:1809.08766. [Google Scholar]
Li, W.; Li, H.; Wu, Q.; Meng, F.; Xu, L.; Ngan, K.N. Headnet: An end-to-end adaptive relational network for head detection. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 482–494. [Google Scholar] [CrossRef]
Ranjan, R.; Patel, V.M.; Chellappa, R. A deep pyramid deformable part model for face detection. In Proceedings of the 2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS), Arlington, VA, USA, 8–11 September 2015; pp. 1–8. [Google Scholar]
Bajestani, M.F.; Yang, Y. Tkd: Temporal knowledge distillation for active perception. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Village, CO, USA, 1–5 March 2020; pp. 953–962. [Google Scholar]
Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1227–1236. [Google Scholar]
Sun, Z.; Peng, D.; Cai, Z.; Chen, Z.; Jin, L. Scale mapping and dynamic re-detecting in dense head detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 1902–1906. [Google Scholar]
Peng, D.; Sun, Z.; Chen, Z.; Cai, Z.; Xie, L.; Jin, L. Detecting heads using feature refine net and cascaded multi-scale architecture. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 2528–2533. [Google Scholar]

Figure 1. The structure of DenseNet.

Figure 2. The internal parameters and structure of DenseBlock.

Figure 3. The structure of the DR-Net feature extraction network.

Figure 4. Comparison of DR-Net and Darknet-53 partial shallow feature map visualization.

Figure 5. The structure of MDC.

Figure 6. The complete network structure diagram. The features are first extracted by DR-Net, and then, the FRN is completed by a series of upsampling and fusion and then sent to MDC to obtain a larger receptive field.

Figure 7. K-means clustering analysis results on the Brainwash, Hollywoodheads, and SCUT-HEAD datasets.

Figure 8. The PR curves of the ablation experiments on three datasets.

Figure 9. The PR curves of our method versus other methods from the experiments on the Brainwash dataset.

Figure 10. The PR curves of our method versus other methods from the experiments on the HollywoodHeads dataset.

Figure 11. Qualitative results on the Brainwash dataset. The green box is the ground truth, the yellow box is the result of the detection, the red box is the result of missed and false detections, and the blue box is our method used to detect heads that are not labeled in the picture. Different rows show the experimental results of different methods on the Brainwash dataset.

Figure 12. Qualitative results with different methods on the HollywoodHeads dataset.

Figure 13. Qualitative results with different methods on the SCUT-HEAD dataset.

Table 1. The internal channel information from the transport layer.

D-DBL of DenseBlock1	D-DBL of DenseBlock2
Conv(1 × 1 × 64)	Conv(1 × 1 × 128)
BN	BN
Leaky-ReLU	Leaky-ReLU
Conv(3 × 3 × 128)	Conv(3 × 3 × 256)
BN	BN
Leaky-ReLU	Leaky-ReLU

Table 2. Parameters of the DR-Net feature extraction network.

Number	Block Name	Layers	Filter	Size	Output
	Conv Block	Convolutional	32	3 × 3	416 × 416 × 32
	Conv Block	Convolutional	32	3 × 3/2	208 × 208 × 64
		Convolutional	32	1 × 1
1 ×	Res Block	Convolutional	64	3 × 3
		Residual			208 × 208 × 64
	Downsampling	Convolutional	128	3 × 3/2	104 × 104 × 128
		Convolutional	64	1 × 1
2 ×	Res Block	Convolutional	128	3 × 3
		Residual			104 × 104 × 128
	Downsampling	Convolutional	256	3 × 3/2	52 × 52 × 256
		Convolutional	128	1 × 1
4 ×	Res Block	Convolutional	256	3 × 3
		Residual			52 × 52 × 256
		Convolutional	64	1 × 1
4 ×	DenseBlock1	Convolutional	128	3 × 3
		Residual			52 × 52 × 768
	Compression	Convolutional	1024	1 × 1	52 × 52 × 512
	Downsampling	Convolutional	512	3 × 3/2	26 × 26 × 512
		Convolutional	256	1 × 1
4 ×	Res Block	Convolutional	512	3 × 3
		Residual			26 × 26 × 512
		Convolutional	128	1 × 1
4 ×	DenseBlock2	Convolutional	256	3 × 3
		Residual			26 × 26 × 1536
	Compression	Convolutional	2048	1 × 1	26 × 26 × 1024
	Downsampling	Convolutional	1024	3 × 3/2	13 × 13 × 1024
		Convolutional	512	1 × 1
4 ×	Res Block	Convolutional	1024	3 × 3
		Residual			13 × 13 × 1024

Table 3. The corresponding clusters on the Brainwash, Hollywoodheads, and SCUT-HEAD datasets.

Dataset	Brainwash	HollywoodHeads	SCUT-HEAD
	(9,13), (11,17), (13,20),	(5,10), (6,13), (7,16),	(4,10), (6,12), (7,15),
Clusters	(13,16), (14,19), (16,24),	(9,21), (14,30), (26,55),	(9,18),(12,23), (16,29),
	(20,28), (24,34), (33,44)	(45,91), (74,162), (127,263)	(23,39), (33,57), (54,94)

Table 4. Comparison of the three head datasets.

	Hollywood	Brainwash	(Part A, Part B)
Year	2015	2015	2018
Number of images	224,740	11,917	(2000, 2405)
Heads per picture(Avg)	1.65	7.89	(33.6, 18.26)
Head Density	Sparse	Medium Dense	Dense
Training Set	216,719	10,917	(1600, 1600)
Validation Set	6719	500	(200, 200)
Test Set	1032	500	(200, 205)

Table 5. Comparison of the ablation experiment results on three datasets.

Method	AP@0.5			FPS (Average)
Method	Brainwash	HollywoodHeads	SCUT-HEAD	FPS (Average)
Darknet-53 (Baseline)	0.835	0.757	0.782	21
DR-Net	0.874	0.807	0.863	23
DR-Net+MDC	0.921	0.848	0.90	15
DR-Net+MDC+ Multiscale (4-Layer)	0.924	0.852	0.91	9

Table 6. Results of our method compared with other methods on the Brainwash dataset.

Methods	Backbone	AP@0.5
SSD [12]	VGG16	0.568
FCHD [34]	VGG16	0.70
E2PD [8]	GoogLeNet+LSTM	0.821
FRCN [15]	VGG16	0.878
HeadNet [35]	ResNet-101	0.91
Proposed	DR-Net+MDC	0.921

Table 7. Results of our method compared with other methods on the HollywoodHeads dataset.

Methods	Backbone	AP@0.5
DPM-Face [36]	-	0.37
SSD [12]	VGG16	0.621
FRCN [15]	VGG16+ResNet	0.698
FCHD [34]	VGG16	0.74
TKD [37]	LSTM	0.75
HeadNet [35]	ResNet-101	0.83
Proposed	DR-Net+MDC	0.848

Table 8. Results of our method compared with other methods on the SCUT-HEAD dataset.

Methods	PartA			PartB
Methods	P	R	F1	P	R	F1
FRCN [15]	0.86	0.78	0.82	0.87	0.81	0.84
SSD [12]	0.84	0.68	0.76	0.80	0.66	0.72
YOLOv3 [13]	0.91	0.78	0.82	0.74	0.67	0.70
R-FCN [32]	0.87	0.78	0.82	0.90	0.84	0.86
R-FCN+FRN [40]	0.89	0.83	0.86	0.92	0.84	0.88
SMD [39]	0.92	0.90	0.90	0.94	0.89	0.91
Proposed	0.91	0.92	0.91	0.92	0.92	0.92

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Zhang, Y.; Xie, J.; Wei, Y.; Wang, Z.; Niu, M. Head Detection Based on DR Feature Extraction Network and Mixed Dilated Convolution Module. Electronics 2021, 10, 1565. https://doi.org/10.3390/electronics10131565

AMA Style

Liu J, Zhang Y, Xie J, Wei Y, Wang Z, Niu M. Head Detection Based on DR Feature Extraction Network and Mixed Dilated Convolution Module. Electronics. 2021; 10(13):1565. https://doi.org/10.3390/electronics10131565

Chicago/Turabian Style

Liu, Junwen, Yongjun Zhang, Jianbin Xie, Yan Wei, Zewei Wang, and Mengjia Niu. 2021. "Head Detection Based on DR Feature Extraction Network and Mixed Dilated Convolution Module" Electronics 10, no. 13: 1565. https://doi.org/10.3390/electronics10131565

APA Style

Liu, J., Zhang, Y., Xie, J., Wei, Y., Wang, Z., & Niu, M. (2021). Head Detection Based on DR Feature Extraction Network and Mixed Dilated Convolution Module. Electronics, 10(13), 1565. https://doi.org/10.3390/electronics10131565

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Head Detection Based on DR Feature Extraction Network and Mixed Dilated Convolution Module

Abstract

1. Introduction

2. Related Works

2.1. Image Feature Extraction Network

2.2. Head Detection

3. Proposed Methodology

3.1. DR-Net: ResNet Combined with DenseNet for Feature Extraction Network

3.2. Mixed Dilated Convolution

3.3. K-Means for Anchor Boxes

4. Experiment

4.1. Datasets

4.2. Evaluation Metrics

4.3. Ablation Study

5. Results

5.1. Results on the Brainwash Dataset

5.2. Results on the HollywoodHeads Dataset

5.3. Results on the SCUT-HEAD Dataset

5.4. Qualitative Results on Three Datasets

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI