Head Detection Based on DR Feature Extraction Network and Mixed Dilated Convolution Module

: Pedestrian detection for complex scenes suffers from pedestrian occlusion issues, such as occlusions between pedestrians. As well-known, compared with the variability of the human body, the shape of a human head and their shoulders changes minimally and has high stability. Therefore, head detection is an important research area in the ﬁeld of pedestrian detection. The translational invariance of neural network enables us to design a deep convolutional neural network, which means that, even if the appearance and location of the target changes, it can still be recognized effectively. However, the problems of scale invariance and high miss detection rates for small targets still exist. In this paper, a feature extraction network DR-Net based on Darknet-53 is proposed to improve the information transmission rate between convolutional layers and to extract more semantic information. In addition, the MDC (mixed dilated convolution) with different sampling rates of dilated convolution is embedded to improve the detection rate of small targets. We evaluated our method on three publicly available datasets and achieved excellent results. The AP (Average Precision) value on the Brainwash dataset, HollywoodHeads dataset, and SCUT-HEAD dataset reached 92.1%, 84.8%, and 90% respectively.


Introduction
Pedestrian detection is an important research area in computer vision and extends to many related applications. These applications include person recognition [1], pedestrian tracking [2], autonomous driving [3,4] etc. Generic pedestrian detection is poor in crowded and heavily occluded scenes, so head detection methods have emerged, and crowd counting in surveillance video [5,6] is considered one of the most important applications for head detection. Object detection has made great progress with the development of CNN and deep learning, but head detection is still a daunting task in complex crowd scenes [7] where the objects have high diversity, strong occlusion, dynamic blur, low resolution, and rare features.
In recent years, many head detectors based on deep learning have emerged [8,9], and now, head detection is considered a special type of object detection for most approaches. Since a head's scale and appearance vary, extracting features effectively to localize the head and to separate it from the background is still a challenge. One of the better performers is the RCNN [10] head detector [9], which uses two models for head detection. The first global model provides a multi-scale heat map to determine the probability of head presence, and the second local model uses search suggestions (SS) to limit the set of target hypotheses, completing the RCNN head detection framework by combining clues from both models. Hariharan et al. [11] used Hypercolumn, the activation cascade of all nodes of a network corresponding to pixels, as a feature for fine-grained localization of targets. SSD [12] and YOLO [13] use multi-scale features to detect objects to derive the probability of category and bounding box coordinates, where YOLO uses a residual network (ResNet) [14] for feature extraction while using spatial feature pyramids to accomplish contextual feature fusion; both detectors are much faster than Faster RCNN [15]. However, the accuracy is still very poor in small object detection.
In general, in complex scenes, head detection still faces many challenges because the proportion, appearance, and pose of human heads vary in size. Therefore, we propose an improved feature extraction network DR-Net based on Darknet-53 [13], which is based on the principle that the DenseNet [16] is embedded in Darknet-53 to add subsequent layer values by creating an identity mapping, and then DenseNet connects all layers for channel merging to be able to reuse features. The backpropagation of gradients is improved compared to ResNet to utilize the feature information better and to improve the transmission rate of information between layers.

Image Feature Extraction Network
For object detection, the extraction of image features is crucial for model training. Convolutional neural networks for feature extraction have evolved from the earliest scaleinvariant feature transform (SIFT) [17] and histogram of oriented gradients (HOG). In 1998, LeCun [18] proposed LeNet-5, which combined convolution with the neural network, an epoch-making and far-reaching convolutional neural network structure, and introduced two new concepts of convolution and pooling. Later, with the development of deep learning theory, Alex Krizhevshy proposed the heavyweight network AlexNet [19], an eight-layer deep convolutional neural network, and the classification accuracy of this network model on the ImageNet dataset has been substantially improved. The breakthrough is that the network is implemented on a GPU for the first time, making the training time much shorter. Since AlexNet, researchers have proposed various convolutional neural network models with increasingly better performance from different network structures. The most famous are VGGNet [20], GoogLeNet [21], and ResNet etc. Darknet-53 is a fully convolutional network, borrowing from the concept of Resnet, using a large number of residual modules for leapfrogging connections and using convolutional strides to achieve downsampling for complete extraction of features in order to reduce the negative effects of pooling gradients.
DenseNet is a dense network that borrows from the previous idea of short-circuit connected networks. DenseNet connects all layers, i.e., the outputs of all preceding layers are merged to generate the current layer's input, significantly improving the network information transit. In this paper, we propose a method that combines DenseNet and ResNet to create a novel feature extraction network capable of increasing the depth of the network model and of improving information flow transmission.

Head Detection
Head detection is one of the important research directions for computer vision, and the head detection issue is seen as a specific type of object detection by most related works. Traditional head detection methods use hand-designed complex feature operators to extract features and then use Support Vector Machine (SVM) [22], Adaptive Boosting (AdaBoost) [23], Deformable Part Model (DPM) [24], and other algorithms to classify the extracted features. In [24], the DPM proposed by the authors utilizes the HOG [25], feature and has been extensively used in the field of object detection. However, the traditional methods have poor portability and unsatisfactory performance and often fail to meet our detection requirements in realistic scenarios.
In recent years, convolutional neural networks have made great progress in object detection and deep neural networks have become the preferred method for object detection tasks. Current mainstream CNN object detection methods can be divided into two categories: single-stage and two-stage frameworks. Single-stage frameworks such as YOLO and SSD use regression to generate bounding boxes and to classify them by scoring them for the bounding box categories. Two-stage frameworks include RCNN and their improved versions Fast RCNN [26], Faster RCNN, and SPPNet [27]. RCNN uses the candidate region proposal method to create a region of interest (ROI) for object detection. Then, the proposals generated by the selective search (SS) [28] method are sent to the feedforward network. Shaoqing Ren et al. continued to make innovations based on Fast RCNN and proposed the Faster RCNN model. Faster RCNN removes the SS algorithm in extracting candidate regions and proposes an RPN network. YOLO outperforms Faster RCNN in detection speed on datasets, but it comes at the cost of accuracy.
Although the abovementioned existing models have achieved significant performance in classifying multiple objects in images, they face challenges in identifying tiny targets since most models use the last convolutional layer features for object detection. However, the final convolutional layer contains insufficient information about tiny targets. As the size of the targets is usually small (up to 10 to 20 pixels) in the head detection problem, the current form of the method is not suitable for detecting tiny targets.

DR-Net: ResNet Combined with DenseNet for Feature Extraction Network
The image feature extraction network of YOLOv3 is primarily based on the concept of residual network. Darknet-53 uses multiple residual modules to form a ResNet containing a large number of parameters, which takes up most of YOLO3's computing resources. Using ResNet for image feature extraction is a common method. However, the deep network it builds is computationally intensive, which leads to a decrease in computational speed. DenseNet adds the values of subsequent layers through identity mapping that connects all channel merging layers to reuse the feature. Compared to ResNet, DenseNet increases the information efficiency and network gradient transmission. Each layer obtains a gradient directly from the loss function and receives a input signal directly so that a deeper network can be trained. Moreover, this network also has the effect of regularization. Other networks aim to improve the network performance from depth and width, while DenseNet works to improve the network performance from the perspective of feature reuse. The structural diagram of DenseNet is shown in Figure 1. An obvious difference between DenseNet and ResNet is that ResNet is superimposed by a shortcut while DenseNet is a concatenation. Each layer in DenseNet receives the output of all preceding layers as its input. In Figure 1, x 1 , x 2 , x 3 , and x 4 represent the output feature map while H 1 , H 2 , H 3 , and H 4 are nonlinear transformations. The common convolutional L-layer is composed of L connections, while Dense-Net's L-layer contains L(L + 1)/2 connections, and each layer is linked to all layers. Thus, all feature maps of the previous L − 1 layers can be received by each layer. The relationship between each layer of DenseNet's feature maps is shown in Equation (1).
The DR-Net network proposed in this paper is an image feature extraction network based on the combination of DenseNet and Resnet. The DBL (Conv2d-BN-ReLU) module is composed of convolution, Batch Normalization, and Leaky-ReLU. Two DBL modules constitute the Double-DBL (D-DBL) module, using the D-DBL module as the transport layer H i . A detailed composition of the transport layer is shown in Table 1. Table 1. The internal channel information from the transport layer.

D-DBL of DenseBlock1 D-DBL of DenseBlock2
Conv We use 1 × 1 convolution to compress the number of channels and 3 × 3 convolution to expand the number of channels. However, the DenseBlocks without clipping has (6, 12, 24, and 32) convolutional layers. Such a multi-layer DenseNet makes redundant feature maps and slows down the detection speed. Considering the combination with the residual block, we set up a four-layer structured DenseBlock. Figure 2 shows the structures of DenseBlock1 and DensBlock2, where DenseBlock1 has 128 feature map increments and DenseBlock2 has 256 feature map increments for each layer. After completing its mapping to connect all channel numbers, 1 × 1 convolution was used for dimensionality reduction, decreasing the number of feature maps to connect the subsequent layers and to reduce computational cost. The specific structure of DR-Net is shown in Figure 3. The proposed DR-Net can reduce the reliance of the network on residual modules and can obtain more semantic information. In order to better demonstrate the internal information of DR-Net, Table 2 records the specific parameter variations of the original image for feature extraction. Some partial feature maps selected randomly are shown in Figure 4. Deep feature maps have a lower resolution with more abstract sematic information, which make it difficult for human comparison and analysis. Therefore, shallow feature maps that contain more original features are used for visualization and human analysis. The left side is the original input image, the middle shows the shallow feature maps extracted by our DR-Net, and the right side shows the shallow feature maps extracted by Darknet53. We used pseudo-color to visualize the feature maps, and the brighter the color, the more interesting the features are in that region. Through the feature map visualization, we can find that the feature visualization maps extracted by DR-Net show the brighter colors of the classroom area where the students are located while Darknet-53 is a bit darker, and it shows that more detailed information is available in our DR-Net.

Mixed Dilated Convolution
Although the human head and shoulder region has minimal shape variation and high stability, the head is a small target compared to other objects for a single image and the ratio of the pixels is smaller. Feature Pyramid Networks (FPN) [29] mainly address the deficiencies of the model in its ability to handle multi-scale variation problems in object detection tasks. It is capable of handling multi-scale variation problems in object detection with a very small computational increase. In addition, the dilated convolution [30] can extract different feature maps with different sematic information and according to different dilated rates. After feature fusion in the FPN part of YOLOv3, the mixed dilated convolution is further used to fuse the context information to expand the receptive field of feature maps for better detection of small targets. From [31], it is known that the mixed dilated convolution has a gridding effect due to the dilated rate stacking, which leads to some missing pixels and loss of information continuity. The dilated rate stacking needs to satisfy Equation (2).
where r i is the dilated rate of layer i and M i is the maximum dilated rate of layer i.
Assuming that there are n layers in total, to satisfy M n = r n , a simple example is r = 1, 2, 4. Therefore, in this section, we propose that the MDC structure is composed of dilated rates r = 1, 2, 4. The MDC structure shown in Figure 5 is the module after the first FPN fusion.
In this MDC module, we use three different dilated rates of the dilated convolution in parallel and the dilated rate size reflects the corresponding size of the receptive field. First, we compress the number of channels in feature map C ∈ R W×H×768 after feature fusion by a 1 × 1 convolution and obtain feature map C 1 ∈ R H×W×256 . The dilated convolutions with different dilated rates (r = 1, 2, 4) are then sampled on the feature map C 1 to obtain C 2 ∈ R H×W×256 , C 3 ∈ R H×W×256 , and C 4 ∈ R H×W×256 . Finally, we connect C 1 , C 2 , C 3 , and C 4 to obtain the feature map M ∈ R H×W×1024 , M = [C 1 , C 2 , C 3 , C 4 ]. The model can receive information from various receptive fields and can then be fused to extract more semanticized information, in particular, feature information for small targets.  We productively integrate two MDC modules in front of the YOLO-head. When the connection of the spatial feature pyramid is completed, the four contextual informationaware module features are fused using MDC and then sent to YOLO-head for detection. Our proposed MDC-based spatial feature pyramid structure is shown in Figure 6. By extracting the image features through DR-Net, the feature maps with richer semantic information are obtained. However, the target location is not accurate, so the fusion of deep semantic information with shallow layers by upsampling can obtain detailed information and rich semantic information. The MDC module is used to sample and fuse the feature maps with dilated convolutions of different dilated rates to expand the receptive field and to utilize more fine-grained feature information.

K-Means for Anchor Boxes
YOLOv3 uses the idea of faster RCNN anchor boxes, which are a set of original candidate boxes with fixed widths and heights. The selection of the original anchor boxes affects the accuracy and the detection speed directly. Instead of manually selecting anchor frames, YOLOv3 runs K-means clustering on the dataset to find priority boxes automatically. K-means is the most commonly used clustering algorithm. The main idea is as follows: Given the value of K and K initial cluster centroids, each point is allocated to the nearest cluster centroid. After all points are allocated, the cluster centroids are recalculated based on all points within the same cluster. Then, the steps for assigning points and updating cluster centroids are performed iteratively until the change in cluster centroids is minimal or the specified number of iterations is reached. K-means generated clusters can reflect the sample distribution in each dataset, making good predictions easier for the network. Since the target we detect is the human head, which is a small target for other objects or backgrounds in the picture, the original anchor boxes of YOLOv3 do not apply to the current scene, so we need to re-cluster the anchor boxes.
In this paper, we use average IOU instead of distance to make anchor boxes, and adjacent real boxes have higher IOU values. The anchor boxes are calculated using Equation (3).
We calculate the IOU of two boxes, i.e., the similarity of two boxes. The smaller the d, the more similar box1 is to box_cluster. Assign box1 to box_cluster; then, assign the next box and update box_cluster until the cluster centroid change is slight or the number of iterations specified is achieved. IOU is the intersection over union, defined in Equation (4).
where area overlap is the area of overlap between the bounding box and the ground truth, and area union is the merged area between them. Algorithm 1 shows the K-means pseudo code algorithm in this paper. We applied K-means clustering on the Brainwash dataset, HollywoodHeads datasets, and SCUT-HEAD dataset. The average IOU we obtained using various k values is shown in Figure 7, and as k increases, the variation in the objective function becomes more and more stable. Since there are three detection layers in our method, we chose nine anchor boxes. Table 3 shows the width and height of the Brainwash dataset, HollywoodHeads dataset, and SCUT-HEAD dataset for the corresponding clusters. Table 3. The corresponding clusters on the Brainwash, Hollywoodheads, and SCUT-HEAD datasets.

Datasets
The HollywoodHeads dataset was proposed by Tuan-Hung [9] et al. The dataset was collected from 21 Hollywood movie scenes, containing 224,740 images. Each image in that scene has fewer heads, containing a total of 369,846 head annotations. The Hollywood-Heads training set consists of 216,719 image frames from 15 movies, the validation set consists of 6719 image frames from three movies, and the test set consists of 1032 image frames from three other movies. The same dataset partitioning was used when evaluating our proposed method.
The Brainwash dataset [8] is a huge collection of image data from crowded scenes captured by public cameras. It consists of 11,917 images obtained from coffee shop surveillance videos at a fixed interval of 100 s, with a high average number of heads (7.89) in the scene and 91,146 heads with markers in these images. There are 10,917 images with 82,906 annotations in the training set, 500 images with 3318 annotations in the validation set, and 500 images with 4992 annotations in the test set. The Brainwash dataset contains a larger number of human heads in each image compared with the HollywoodHeads dataset.
The SCUT-HEAD dataset [32] is a large-scale dataset for head detection containing 4405 images with 111,251 heads labeled. This scene has an extremely dense average number of heads (25.2), thus making this dataset very challenging. Part A includes 2000 images, with 67,321 head annotations, composed of images sampled from classroom surveillance videos. In Part B, 2405 images, with 43,930 head annotations, are obtained from Internet crawlers. Both parts of the SCUT-HEAD dataset are divided into the training and test sets. Table 4 shows a detailed comparison of the three datasets.

Evaluation Metrics
For a fair comparison, our experimental results were evaluated using average precision (AP). We took an IOU threshold of 0.5 as the benchmark for plotting precision-recall (PR) curves, and the area under the curve represents the AP value. We chose 0.5 as a specific IOU threshold, and the head can be considered correctly detected when the IOU of the predicted border is ≥0.5 [33]. The values of precision, recall, and F1-score (applied to the SCUT-HEAD dataset) can be obtained under this criterion, where precision indicates the percentage of positive samples among the images identified and recall indicates the percentage of positive samples predicted among all test set images as positive images. In general, to evaluate our approach, we used three standard metrics (precision (P), recall (R), and AP-IOU). All experiments were performed with 32G of memory on an NVIDIA Tesla V100 GPU. The formulas to calculate each evaluation index are as follows:

Ablation Study
As mentioned above, based on the combination of ResNet and DenseNet, we propose a DR-Net feature extraction network to achieve feature reuse, to reduce the number of parameters, and to improve the rate of transmission of information between layers. Furthermore, to extract more information about small targets and to improve the accuracy of head detection, we introduce the MDC spatial feature pyramid feature fusion, which is a structure located in front of the YOLO head. Compared with Darknet-53, our method improved AP by 0.086, 0.107, and 0.118 in the HollywoodHeads, Brainwash, and SCUT-HEAD test sets, respectively.
Moreover, we consider the use of multiple object detection scales, and we are aware that YOLOv3 has three object detection layers, which correspond to large, medium, and small objects. The small object-sensitive detection layer (the default image input size is 416 × 416) is 52 × 52, and we expand the detection layer to a larger scale 104 × 104 (the fourth detection layer) to detect the head. After the experiment, we found that the AP values improved by 0.003, 0.004, and 0.01, respectively. However, due to the increase in computation caused by the addition of detection layers, the FPS decreased from 15 (DR-Net+MDC) to 9. We finally embrace the DR-Net+MDC structure given the speed and accuracy. A comparison of our tests with the baseline Darknet-53 is shown in detail in Figure 8 and Table 5, which demonstrates the method's validity.

Results on the Brainwash Dataset
We conducted experiments with DR-Net+MDC as the backbone after reviewing the aforementioned ablation studies. The Brainwash dataset represents real scenes, all images are taken from a coffee shop surveillance camera, and the distribution of head is dense. In this paper, we compare the method proposed with several representative detectors, such as FCHD [34], E2PD [8], SSD [12], HeadNet [35], FRCN [15], etc. Our proposed method exceeds other baselines, and we improved the AP value by 1.1% on the Brainwash test set compared to the current best HeadNet at an IOU threshold of 0.5. The comparison of different methods is shown in Table 6, and the PR curves are shown in Figure 9. Table 6. Results of our method compared with other methods on the Brainwash dataset.

Results on the HollywoodHeads Dataset
The HoolywoodHeads dataset was tested with five different methods. First, we used general CNN-based object detection methods, FRCN and SSD, and then we considered that head detection is similar to face detection, so we select a CNN-based face detector DPM-Face [36]. Second, we also compared our model with the novel method of TKD [37] using LSTM [38], and the experimental results show superiority over TKD. Moreover, two of the latest head detectors, FCHD and HeadNet, were selected for testing and comparison. Since DMP-Face uses a multi-task cascaded convolutional network for face detection and alignment, it results in a low detection rate. The presence of a large number of side faces and backs of heads in this dataset results in a low detection rate. Our method achieves AP@84.8% on the HollywoodHeads dataset, which is better than other methods (Table 7 and Figure 10).   [15] VGG16+ResNet 0.698 FCHD [34] VGG16 0.74 TKD [37] LSTM 0.75 HeadNet [35] ResNet-101 0.83 Proposed DR-Net+MDC 0.848 Figure 10. The PR curves of our method versus other methods from the experiments on the Holly-woodHeads dataset.

Results on the SCUT-HEAD Dataset
We evaluated our model using precision, recall, and F1-score on the SCUT-HEAD dataset. The equation for F1-score is shown in Equation (8). We conducted the comparison experiments with the methods FRCN, YOLOv3, SSD, SMD [39], and R-FCN+FRN [40]. Our precision is slightly lower than that of SMD, probably because we maintained YOLOv3's three-scale detection while SMD uses a scale map to obtain the proportion of each head in the image for the scene and separates the head into candidate regions, focusing on the head size and the accuracy rate. In contrast, our technique uses one-time regression and prediction without candidate regions to slightly lower the accuracy rate. The detailed comparison information is shown in Table 8.

Qualitative Results on Three Datasets
We display some of the visualization results of our method and other methods on the three publicly datasets test sets. As shown in Figures 11-13, the yellow bounding boxes indicate the results predicted using YOLOv3, FRCN, and our proposed method; the green bounding boxes are the ground-truth labels of the test images, the red ones indicate missed detections or false detections of heads and the blue boxes indicate that heads that are not labeled in the dataset due to dataset errors but were detected by our method. From the comparison, we can see that our method has lower missed detection rates and false detection rates than the other methods, proving that our proposed DR-Net+MDC has a better detection rate for small targets.

Conclusions
In this paper, we proposed a feature extraction network DR-Net based on the combination of DenseNet and ResNet, which was used to reduce the number of parameters and to improve the information transmission rate between layers to obtain more semantic information. At present, most of neural networks still have problems with missed detection of small objects. We propose embedding the MDC structure after the FPN and fusing the dilated convolution module with different dilated rates to improve the sensitivity to head detection. Extensive experiments were conducted on three challenging datasets, and the experimental results showed that the proposed method in this paper significantly improves the accuracy of head detection.
Author Contributions: This paper proposes an image feature extraction network based on DR-Net and an MDC module that uses fine-grained feature information to improve the detection accuracy of small targets. Conceptualization and methodology, J.L. and Y.Z.; formal analysis and writing-original draft preparation, J.L.; dataset collection and analysis, Z.W. and M.N.; supervision and writing-review and editing, Y.Z., Y.W., and J.X.; All authors have read and agreed to the published version of the manuscript. Data Availability Statement: The HollywoodHeads dataset: https://www.di.ens.fr/willow/research/ headdetection/; the SCUT-HEAD dataset: https://github.com/wanjinchang/SCUT-HEAD-Dataset-Release, and the Brainwash dataset: http://datasets.d2.mpiinf.mpg.de/brainwash/brainwash.tar.

Conflicts of Interest:
The authors declare no conflicts of interest.