Pedestrian Detection Using Integrated Aggregate Channel Features and Multitask Cascaded Convolutional Neural-Network-Based Face Detectors

Pedestrian detection is a challenging task, mainly owing to the numerous appearances of human bodies. Modern detectors extract representative features via the deep neural network; however, they usually require a large training set and high-performance GPUs. For these cases, we propose a novel human detection approach that integrates a pretrained face detector based on multitask cascaded convolutional neural networks and a traditional pedestrian detector based on aggregate channel features via a score combination module. The proposed detector is a promising approach that can be used to handle pedestrian detection with limited datasets and computational resources. The proposed detector is investigated comprehensively in terms of parameter choices to optimize its performance. The robustness of the proposed detector in terms of the training set, test set, and threshold is observed via tests and cross dataset validations on various pedestrian datasets, including the INRIA, part of the ETHZ, and the Caltech and Citypersons datasets. Experiments have proved that this integrated detector yields a significant increase in recall and a decrease in the log average miss rate compared with sole use of the traditional pedestrian detector. At the same time, the proposed method achieves a comparable performance to FRCNN on the INRIA test set compared with sole use of the Aggregated Channel Features detector.


Introduction
Pedestrian detection, a fundamental task in computer vision, can assist in self-driving, the monitoring of crowded environments, and other activities by automatically detecting and localizing human bodies from images. This type of task is challenging, mainly because human bodies are depicted in numerous ways, hindering the overall description of their features. Firstly, human bodies distinguish themselves from each other in appearance, for example, the color of the skin, clothes, hairstyle, body structure, and pose. Secondly, human bodies often suffer from occlusion in crowded environments, such as train stations, shopping malls, and others. Occlusion reduces the area of a body exposed to smaller irregular-shaped areas. Lastly, images are sensitive to the image acquisition setup, such as the varying illumination conditions, view angles, and resolution. For instance, poor illumination or overexposure result in low-contrast images, which blur the human bodies. Furthermore, the human body shape appears different from various viewing angles, for example, from frontal and profile views. Finally, low-resolution human bodies cannot be easily identified, even by the human eye. To handle the diversity of human bodies, it is imperative to describe their appearances effectively using a robust and rich feature set that allows human bodies to be discriminated effectively from the background.
Feature extractors based on deep learning have drawn extensive attention in recent years due to their outstanding performance in extracting high-level semantic information and its large number of features, which has improved the detection performance dramatically. Those methods require training on extremely large datasets to learn representative and robust features; otherwise, the extracted features may be not generalized enough. Because of this, advanced deep learning methods often require massive volumes of annotated data and computational resources such as high-performance computers with GPU facilities. Considering this dilemma, many researchers have focused on the improvement of widely used traditional handcrafted features or the combination of Convolutional Neural Networks (CNNs) with manually extracted features for various computer vision tasks, including pedestrian detection [1][2][3][4][5][6].
One of the earliest popular families of handcrafted features for describing object silhouettes is the histogram of edge orientations, which was initially used for hand gesture recognition [7]. These features are signals that represent occurrences of gradient orientations in localized portions of an image. Another popular contour-based model is the so-called Shape Context [8], designed for measuring shape similarity for the purpose of shape matching. It is defined as a histogram of the relative coordinates of a reference point to a predefined set of neighboring points. This feature is suitable for matching objects whose contours (edges) are corrupted by weak noise or are noiseless and are placed on simple or ideally uniform backgrounds. Considering the complexity of real-life crowded environments, this feature is not recommended for use in human detection. In contrast to those two features, Haar-like features have been used for human detection [9,10] since the early stages. An image is filtered by several Haar wavelets with different predesigned patterns to extract edge features, line features, and center-surround contrast features. The Haar-like feature method is more suitable for objects with simple structures, for example, facial features, which have relatively simple structures. Otherwise, highly textured structures, like grass and trees, which challenge the sufficient representation of predesigned limited Haar wavelets can cause false positives [10]. Another of the most well-known features is the Scale-Invariant Feature Transform (SIFT) model, which consists of the position, scale, and orientations at selected key points, which are "interesting" points of the image signal, for example, contour corners [11]. An advanced version of SIFT is the Principal Components Analysis SIFT (PCA-SIFT) which uses PCA to represent the normalized gradient key point patches, proposed by Ke and Sukthakar in [12]. The authors of this work demonstrated that this method considerably improves both the accuracy and speed compared with the standard SIFT.
In [13], Dalal and Triggs compared the PCA-SIFT, wavelets, and shape context features with the Histogram of Oriented Gradients (HOG) for pedestrian detection and demonstrated that the HOG features greatly outperform the other features on both the MIT and INRIA pedestrian datasets. Owing to [13], HOG has become one of the most widely used features to extract silhouette information for human and other types of object detection. The HOG calculates occurrences (histograms) of gradient orientation in localized portions of an image. Gradient magnitudes serve as weights (votes) to strengthen or weaken the contributions of orientations. This technique differs from the methods mentioned above in that it is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy. The HOG feature is the concatenation of histograms in overlapping blocks.
To further improve the detection results produced by HOG, some research has focused on enriching the HOG feature sets by combining them with additional cues. The celebrated Aggregated Channel Features (ACF) [14], variants of the Integral Channel Features (ICF) [15], were proposed. The ACF detector begins with the computation of a color channel, such as RGB, LUV or HSV, the magnitude channel, and the HOG channel. These channels are aggregated and vectorized into an enhanced feature vector before being sent to the classifier. As mentioned previously, the ACF detector is characterized by taking both the color feature and the silhouette feature into consideration. Such rich features • A novel pedestrian detector integrating the multitask cascaded CNN and ACF is proposed; • Improved detection performance for significantly occluded pedestrians and beyond is achieved; • Robustness of the proposed detector in terms of datasets and beyond is achieved.
The paper is organized as follows: Section 2 presents the methodology used to construct the integrated detector, including the technical procedures involved and detailed explanations regarding the function and design principles of each module in the proposed detector. In Section 3, we investigate the choices of parameters and test the proposed detector on various datasets. The discussion is presented in Section 4.

The Proposed Detector
Traditionally, the ACF human body detector takes the features of a sliding window b x , b y , b w , b h as the input. The two-dimensional vector (b x , b y ) contains the x and y Cartesian coordinates of the top left corner of the sliding window, which has a width of b w and a height of b h , as denoted in Figure 1. Our integrated detector differs from the ACF human body detector, mainly by introducing a face detector module and score combination, the output of which is then fed into the ACF detector together with the features, as shown in Figure 2.
The paper is organized as follows: Section 2 presents the methodology used to construct the integrated detector, including the technical procedures involved and detailed explanations regarding the function and design principles of each module in the proposed detector. In Section 3, we investigate the choices of parameters and test the proposed detector on various datasets. The discussion is presented in Section 4.

The Proposed Detector
Traditionally, the ACF human body detector takes the features of a sliding window ( , , , ℎ ) as the input. The two-dimensional vector ( , ) contains the x and y Cartesian coordinates of the top left corner of the sliding window, which has a width of and a height of ℎ , as denoted in Figure 1. Our integrated detector differs from the ACF human body detector, mainly by introducing a face detector module and score combination, the output of which is then fed into the ACF detector together with the features, as shown in Figure 2.  The proposed integrated detector starts with feeding an image into a face detector which outputs predicted face bounding boxes ( , , , ℎ ) with centers located at The paper is organized as follows: Section 2 presents the methodology used to construct the integrated detector, including the technical procedures involved and detailed explanations regarding the function and design principles of each module in the proposed detector. In Section 3, we investigate the choices of parameters and test the proposed detector on various datasets. The discussion is presented in Section 4.

The Proposed Detector
Traditionally, the ACF human body detector takes the features of a sliding window ( , , , ℎ ) as the input. The two-dimensional vector ( , ) contains the x and y Cartesian coordinates of the top left corner of the sliding window, which has a width of and a height of ℎ , as denoted in Figure 1. Our integrated detector differs from the ACF human body detector, mainly by introducing a face detector module and score combination, the output of which is then fed into the ACF detector together with the features, as shown in Figure 2.  The proposed integrated detector starts with feeding an image into a face detector which outputs predicted face bounding boxes ( , , , ℎ ) with centers located at The proposed integrated detector starts with feeding an image into a face detector which outputs N predicted face bounding boxes f n x , f n y , f n w , f n h with centers located at ( f n x + f n w /2, f n y + f n h /2) and corresponding face scores of s n , n = 1, 2, . . . N. The vector ( f n x , f n y ) stands for the x and y coordinates of the top left corner of the n-th face bounding box with a width of f n w and a height of f n h , as shown in Figure 1. Considering that the  (1) to keep consistent with the ACF scores as the first step of the score combination. Secondly, face bounding boxes are filtered following two rules. One rule is to eliminate face bounding boxes that are beyond the enlarged sliding window area shown as the dotted box in Figure 1. The height and the width of the enlarged sliding window are increased by 2 × o f f set compared with those of the original sliding window. The other rule is to eliminate relatively large face bounding boxes determined by the sliding window and two predesigned parameters, namely r w and r h . The remaining M face bounding boxes meet the following conditions: In the last step of the score combination, the scores (s m f , m = 1, 2, . . . M) of the remaining face bounding boxes are scaled to s m sc according to where f ss is the face score scale, and d m is the Euclidian distance from the center of the m-th face bounding box to the anchor, which is located at b where d hr is the ratio of the Euclidian distance of the zeroscaled score to b h . Furthermore, f ss , f wr , f hr , and d hr are predesigned parameters, which are illustrated in Figure 1. The maximum scaled score, if it exists, is assigned to the initial overall score s all = max( s m sc ); otherwise, the initial overall score is assigned as 0. The initial overall score is fed into the ACF detector module to produce the final overall score, which is sent to the threshold module. In this module, the final overall score is considered in two cases according to its initial score. The first case is that, if the initial overall score is assigned by the scaled face score, the sliding window with s all > s thr will be the output as a nominal body bounding box. Otherwise, the sliding window with s all > −1 will be the output. Note that −1 is the default threshold of the ACF detector, and s thr is a predesigned parameter that should be larger than −1.

Modules of the Proposed Integrated Detector
In this part, the functionality and design principles of each module shown in Figure 2 are explained in detail.

Sliding Window
This module (Figure 3) outputs the location, size, and ACF features of a cropped area, namely the sliding window, of the image. As explained previously, the location and size are expressed as b x , b y , b w , b h on the original scale. This means that if the sliding window is in a subsampled layer in a feature pyramid, its size and location must be expanded according to the specific subsampling rate. The ACF features consist of a feature vector, which is the concatenation of the LUV color channel, the gradient magnitude channel, and the HOG channel. This module (Figure 3) outputs the location, size, and ACF features of a cropped area, namely the sliding window, of the image. As explained previously, the location and size are expressed as ( , , , ℎ ) on the original scale. This means that if the sliding window is in a subsampled layer in a feature pyramid, its size and location must be expanded according to the specific subsampling rate. The ACF features consist of a feature vector, which is the concatenation of the LUV color channel, the gradient magnitude channel, and the HOG channel.

Face Detector
This module ( Figure 4) takes an image as the input and outputs the detected face bounding boxes jointly with the face scores. Face detection plays a significant role in the overall integrated detection, because the larger the face score, the more likely it is for the corresponding human body to be detected. To correctly identify human bodies, the face detector should have high precision and yield a small number of false positives. We tested the pretrained Viola Jones (VJ) face detector [20], the fast face detector [21], and the MTCNN detector [22] on the INRIA pedestrian test dataset [13]. According to the results shown in the left column of Figure 5, the VJ detector tends to miss faces that are not presented in the frontal view and those that are occluded. False positives appear when the background is relatively complicated or when there are structures that resemble human faces, such as wheels. The fast face detector was designed based on the work of [23]. This new modified version of the detector extends the ACF used in [23] by adding an integral image channel, in which every pixel is the summation of all of the pixels above and to the left of it. As shown in the middle column of Figure 5, some multiview faces are identified at the expense of a dramatically increasing number of false positives. Compared with these two face detectors, the MTCNN detector gives more accurate detection results with lower false positives rates. As shown in the right column of Figure 5, no false positives occur in these four sample images, and only faces that are largely occluded or presented from mainly the back view are missed, which is expected. This level of performance is due to the collaboration of three convolutional neural networks: the Propose-

Face Detector
This module ( Figure 4) takes an image as the input and outputs the detected face bounding boxes jointly with the face scores. Face detection plays a significant role in the overall integrated detection, because the larger the face score, the more likely it is for the corresponding human body to be detected. To correctly identify human bodies, the face detector should have high precision and yield a small number of false positives.
namely the sliding window, of the image. As explained previously, the location and size are expressed as ( , , , ℎ ) on the original scale. This means that if the sliding window is in a subsampled layer in a feature pyramid, its size and location must be expanded according to the specific subsampling rate. The ACF features consist of a feature vector, which is the concatenation of the LUV color channel, the gradient magnitude channel, and the HOG channel.

Face Detector
This module ( Figure 4) takes an image as the input and outputs the detected face bounding boxes jointly with the face scores. Face detection plays a significant role in the overall integrated detection, because the larger the face score, the more likely it is for the corresponding human body to be detected. To correctly identify human bodies, the face detector should have high precision and yield a small number of false positives. We tested the pretrained Viola Jones (VJ) face detector [20], the fast face detector [21], and the MTCNN detector [22] on the INRIA pedestrian test dataset [13]. According to the results shown in the left column of Figure 5, the VJ detector tends to miss faces that are not presented in the frontal view and those that are occluded. False positives appear when the background is relatively complicated or when there are structures that resemble human faces, such as wheels. The fast face detector was designed based on the work of [23]. This new modified version of the detector extends the ACF used in [23] by adding an integral image channel, in which every pixel is the summation of all of the pixels above and to the left of it. As shown in the middle column of Figure 5, some multiview faces are identified at the expense of a dramatically increasing number of false positives. Compared with these two face detectors, the MTCNN detector gives more accurate detection results with lower false positives rates. As shown in the right column of Figure 5, no false positives occur in these four sample images, and only faces that are largely occluded or presented from mainly the back view are missed, which is expected. This level of performance is due to the collaboration of three convolutional neural networks: the Propose- We tested the pretrained Viola Jones (VJ) face detector [20], the fast face detector [21], and the MTCNN detector [22] on the INRIA pedestrian test dataset [13]. According to the results shown in the left column of Figure 5, the VJ detector tends to miss faces that are not presented in the frontal view and those that are occluded. False positives appear when the background is relatively complicated or when there are structures that resemble human faces, such as wheels. The fast face detector was designed based on the work of [23]. This new modified version of the detector extends the ACF used in [23] by adding an integral image channel, in which every pixel is the summation of all of the pixels above and to the left of it. As shown in the middle column of Figure 5, some multiview faces are identified at the expense of a dramatically increasing number of false positives. Compared with these two face detectors, the MTCNN detector gives more accurate detection results with lower false positives rates. As shown in the right column of Figure 5, no false positives occur in these four sample images, and only faces that are largely occluded or presented from mainly the back view are missed, which is expected. This level of performance is due to the collaboration of three convolutional neural networks: the Propose-Network (P-Net), Refine-Network (R-Net), and Output-Network (O-Net). An image is first fed into the fully convolutional neural network P-Net to quickly yield a large number of candidate detections, which are subsequently refined by the R-Net by further correction of the regression vector of the face candidate frame and nonmaximum suppression. The final face regression boxes and facial landmarks (contour key points) are output after correcting and filtering the detections produced earlier with the landmarks, corrections, and probabilities output by the O-Net. The MTCNN detector finally outputs the adjusted face bounding boxes, the facial landmarks, and face scores in the range [0, 1]. Note that more faces are detected by the MTCNN detector without the use of the O-Net, as shown in Figure 6. However, as expected, this structure also results in more false faces, as shown in Figure 7. To achieve the best and most robust face detection, the overall MTCNN detector, therefore, was chosen as our integrated detector.
of the regression vector of the face candidate frame and nonmaximum suppression. The final face regression boxes and facial landmarks (contour key points) are output after correcting and filtering the detections produced earlier with the landmarks, corrections, and probabilities output by the O-Net. The MTCNN detector finally outputs the adjusted face bounding boxes, the facial landmarks, and face scores in the range [0, 1]. Note that more faces are detected by the MTCNN detector without the use of the O-Net, as shown in Figure 6. However, as expected, this structure also results in more false faces, as shown in Figure 7. To achieve the best and most robust face detection, the overall MTCNN detector, therefore, was chosen as our integrated detector.  We observed that these are associated with particular structures, for example, car wheels.

Score Combination
The score combination consists of three modules, namely the face score transformation module, the face bounding box filtering module, and the score scaling and assigning module.

Face Score Transformation
This module takes the face scores generated by the MTCNN detector as inputs and outputs the corresponding weights, which are the transformed face scores according to Equation (1). This is to remain consistent with the weights produced by the cascading decision trees in the ACF detector. After the transformation, the face detector can be regarded as a single decision tree. The transformed face scores are used in the score scaling and assigning module.

Face Bounding Box Filtering
This filtering module ( Figure 8) takes ( , , , ℎ ) and ( , , , ℎ ), i.e., the sizes and locations of the face bounding boxes and the sliding windows respectively as inputs and decides which face bounding boxes should be sent to the score scaling and assigning module by checking whether they are potential faces of this sliding window according to the two rules explained below. We observed that these are associated with particular structures, for example, car wheels.

Score Combination
The score combination consists of three modules, namely the face score transformation module, the face bounding box filtering module, and the score scaling and assigning module.

Face Score Transformation
This module takes the face scores generated by the MTCNN detector as inputs and outputs the corresponding weights, which are the transformed face scores according to Equation (1). This is to remain consistent with the weights produced by the cascading decision trees in the ACF detector. After the transformation, the face detector can be regarded as a single decision tree. The transformed face scores are used in the score scaling and assigning module.

Face Bounding Box Filtering
This filtering module ( Figure 8) takes ( , , , ℎ ) and ( , , , ℎ ), i.e., the sizes and locations of the face bounding boxes and the sliding windows respectively as inputs and decides which face bounding boxes should be sent to the score scaling and assigning module by checking whether they are potential faces of this sliding window according to the two rules explained below. We observe that the MTCNN detector produces fewer false faces (a), although some faces with occluded facial features are missed. In contrast, a large number of false faces appear in the complex backgrounds (b) when the O-Net is removed from the MTCNN detector. We observed that these are associated with particular structures, for example, car wheels.

Score Combination
The score combination consists of three modules, namely the face score transformation module, the face bounding box filtering module, and the score scaling and assigning module.

Face Score Transformation
This module takes the face scores generated by the MTCNN detector as inputs and outputs the corresponding weights, which are the transformed face scores according to Equation (1). This is to remain consistent with the weights produced by the cascading decision trees in the ACF detector. After the transformation, the face detector can be regarded as a single decision tree. The transformed face scores are used in the score scaling and assigning module.

Face Bounding Box Filtering
This filtering module ( Figure 8) takes f n x , f n y , f n w , f n h and b x , b y , b w , b h , i.e., the sizes and locations of the face bounding boxes and the sliding windows respectively as inputs and decides which face bounding boxes should be sent to the score scaling and assigning module by checking whether they are potential faces of this sliding window according to the two rules explained below. The first rule is that the true face must be within the potential body bounding box, so the potential faces should be inside the sliding window. There are also face bounding boxes that are only partly inside the sliding box, for example, face bounding box D in Figure 1. To leave some flexibility for such bounding boxes, the four window edges are enlarged by a predesigned , shown as the dotted box in Figure 1. This rule is mathematically expressed as (2)-(5). Figure 9 shows an example from the INRIA pedestrian dataset. The overall scores of the ACF detector (a) and the integrated detector (b) are both 57.6. This is because the face (d) should have contributed to the overall score but is filtered out, as it is located at the edge of the sliding window (b). By setting the as5 pixels, the true positive (c) has a higher score of 62.55, as the face score is successfully included. The second rule is that the face bounding box should not be too large compared with the sliding window. According to the ground truth of the INRIA pedestrian test dataset, only the bounding boxes that contain most parts of a body where the person's height is larger than 100 pixels are labelled as true positives. Therefore, only the largest bounding boxes shown in Figure 10a-c are true positives, while other smaller bounding boxes are false positives, even though some of them do contain human body parts, such as those shown in Figure 10a. This phenomenon is exacerbated by the incorporation of the face detector in the system, as shown in Figure 10b. This is because the new face scores are large enough to alter the detection results of the ACF detector by introducing false positives when the sliding window is too small to contain sufficient features. To eliminate such false positives, the second rule is adopted by setting the minimum width ratio and height ratio ℎ . The width ratio is the ratio of the width of the sliding window to that of the face bounding box, and the height ratio is the ratio of the height of the sliding window to that of the face bounding box. Face bounding boxes with any ratio larger than or ℎ are filtered out, as expressed in (6)   The first rule is that the true face must be within the potential body bounding box, so the potential faces should be inside the sliding window. There are also face bounding boxes that are only partly inside the sliding box, for example, face bounding box D in Figure 1. To leave some flexibility for such bounding boxes, the four window edges are enlarged by a predesigned o f f set, shown as the dotted box in Figure 1. This rule is mathematically expressed as (2)-(5). Figure 9 shows an example from the INRIA pedestrian dataset. The overall scores of the ACF detector (a) and the integrated detector (b) are both 57.6. This is because the face (d) should have contributed to the overall score but is filtered out, as it is located at the edge of the sliding window (b). By setting the o f f set as 5 pixels, the true positive (c) has a higher score of 62.55, as the face score is successfully included. The first rule is that the true face must be within the potential body bounding box, so the potential faces should be inside the sliding window. There are also face bounding boxes that are only partly inside the sliding box, for example, face bounding box D in Figure 1. To leave some flexibility for such bounding boxes, the four window edges are enlarged by a predesigned , shown as the dotted box in Figure 1. This rule is mathematically expressed as (2)-(5). Figure 9 shows an example from the INRIA pedestrian dataset. The overall scores of the ACF detector (a) and the integrated detector (b) are both 57.6. This is because the face (d) should have contributed to the overall score but is filtered out, as it is located at the edge of the sliding window (b). By setting the as5 pixels, the true positive (c) has a higher score of 62.55, as the face score is successfully included. The second rule is that the face bounding box should not be too large compared with the sliding window. According to the ground truth of the INRIA pedestrian test dataset, only the bounding boxes that contain most parts of a body where the person's height is larger than 100 pixels are labelled as true positives. Therefore, only the largest bounding boxes shown in Figure 10a-c are true positives, while other smaller bounding boxes are false positives, even though some of them do contain human body parts, such as those shown in Figure 10a. This phenomenon is exacerbated by the incorporation of the face detector in the system, as shown in Figure 10b. This is because the new face scores are large enough to alter the detection results of the ACF detector by introducing false positives when the sliding window is too small to contain sufficient features. To eliminate such false positives, the second rule is adopted by setting the minimum width ratio and height ratio ℎ . The width ratio is the ratio of the width of the sliding window to that of the face bounding box, and the height ratio is the ratio of the height of the sliding window to that of the face bounding box. Face bounding boxes with any ratio larger than or ℎ are filtered out, as expressed in (6)   The second rule is that the face bounding box should not be too large compared with the sliding window. According to the ground truth of the INRIA pedestrian test dataset, only the bounding boxes that contain most parts of a body where the person's height is larger than 100 pixels are labelled as true positives. Therefore, only the largest bounding boxes shown in Figure 10a-c are true positives, while other smaller bounding boxes are false positives, even though some of them do contain human body parts, such as those shown in Figure 10a. This phenomenon is exacerbated by the incorporation of the face detector in the system, as shown in Figure 10b. This is because the new face scores are large enough to alter the detection results of the ACF detector by introducing false positives when the sliding window is too small to contain sufficient features. To eliminate such false positives, the second rule is adopted by setting the minimum width ratio r w and height ratio r h . The width ratio is the ratio of the width of the sliding window to that of the face bounding box, and the height ratio is the ratio of the height of the sliding window to that of the face bounding box. Face bounding boxes with any ratio larger than r w or r h are filtered out, as expressed in (6)

Score Scaling and Assigning
This last module (Figure 11) of the score combination takes the filtered face scores and the locations and sizes ( , , , ℎ )asinputs. They are used to compute the scaled score, which is assigned to the initial overall score and fed into the subsequent ACF detection module. Figure 11. Diagram of the score projection and assigning module. Note that the filtered face information is fed into this module. The output initial overall score is assigned to the corresponding input sliding window.
The basic rule of score scaling is that the nearer the face bounding box is to the anchor of a sliding window, the higher the initial overall score is. The anchor is located at the most likely position that a face of an upright human body will appear at. According to this rule, the scaled score is computed by (8), as illustrated in Figure 12 below. The highest scaled score appears at = 0, which means that the face bounding box is located exactly at the anchor, and the sliding window is temporarily considered to be the most likely to contain a human body. The scaled score decreases as the face bounding box moves away from the anchor and reaches 0 when = 0 . The parameter 0 is the zero-scaled score distance, as marked in Figure 1. 0 should be linearly related to the size of the sliding window to adapt to the changes in the size of the area in which the faces are likely to appear, as introduced by the varying sliding window sizes, so we set 0 = ℎ ⋅ ℎ . The scaled score becomes negative when > 0 and reaches its minimum value at the maximum value of before the face bounding box is filtered out. Such negative scores can help to eliminate false human body bounding boxes. According to (8), the scaled scores of the five face bounding boxes shown in Figure 1 are ranked as > > = 0 > > . They are used to compute the scaled score, which is assigned to the initial overall score and fed into the subsequent ACF detection module.

Score Scaling and Assigning
This last module (Figure 11) of the score combination takes the filtered face scores and the locations and sizes ( , , , ℎ )asinputs. They are used to compute the scaled score, which is assigned to the initial overall score and fed into the subsequent ACF detection module. Figure 11. Diagram of the score projection and assigning module. Note that the filtered face information is fed into this module. The output initial overall score is assigned to the corresponding input sliding window.
The basic rule of score scaling is that the nearer the face bounding box is to the anchor of a sliding window, the higher the initial overall score is. The anchor is located at the most likely position that a face of an upright human body will appear at. According to this rule, the scaled score is computed by (8), as illustrated in Figure 12 below. The highest scaled score appears at = 0, which means that the face bounding box is located exactly at the anchor, and the sliding window is temporarily considered to be the most likely to contain a human body. The scaled score decreases as the face bounding box moves away from the anchor and reaches 0 when = 0 . The parameter 0 is the zero-scaled score distance, as marked in Figure 1. 0 should be linearly related to the size of the sliding window to adapt to the changes in the size of the area in which the faces are likely to appear, as introduced by the varying sliding window sizes, so we set 0 = ℎ ⋅ ℎ . The scaled score becomes negative when > 0 and reaches its minimum value at the maximum value of before the face bounding box is filtered out. Such negative scores can help to eliminate false human body bounding boxes. According to (8), the scaled scores of the five face bounding boxes shown in Figure 1 are ranked as > > = 0 > > . Figure 11. Diagram of the score projection and assigning module. Note that the filtered face information is fed into this module. The output initial overall score is assigned to the corresponding input sliding window.
The basic rule of score scaling is that the nearer the face bounding box is to the anchor of a sliding window, the higher the initial overall score is. The anchor is located at the most likely position that a face of an upright human body will appear at. According to this rule, the scaled score is computed by (8), as illustrated in Figure 12 below. The highest scaled score appears at d m = 0, which means that the face bounding box is located exactly at the anchor, and the sliding window is temporarily considered to be the most likely to contain a human body. The scaled score decreases as the face bounding box moves away from the anchor and reaches 0 when d m = d 0 . The parameter d 0 is the zero-scaled score distance, as marked in Figure 1. d 0 should be linearly related to the size of the sliding window to adapt to the changes in the size of the area in which the faces are likely to appear, as introduced by the varying sliding window sizes, so we set d 0 = d hr · b h . The scaled score becomes negative when d m > d 0 and reaches its minimum value at the maximum value of d m before the face bounding box is filtered out. Such negative scores can help to eliminate false human body bounding boxes. According to (8), the scaled scores of the five face bounding boxes shown in Figure 1 are ranked as A > B > C = 0 > E > D.
Sensors 2022, 22, x FOR PEER REVIEW 11 of 23 Figure 12. The scaled score divided by versus the distance from the -th face center to the anchor is drawn according to (8). The higher the scaled score is, the more likely the sliding window is to contain a human body.
To increase the weight of the MTCNN detector, the predesigned face score scale is introduced in (8), so that the scaled score of the MTCNN detector is equivalent to the sum score of decision trees. The larger is, the greater the face detector's influence on the result of the integrated human body detector is. In Figure 13, an example is illustrated, where the human body is missed by the ACF detector (left) but is detected by the integrated detector with = 1 (right). In another example illustrated in the top sequence of Figure 14, multiple human bodies are missed, although their faces are correctly detected. This is because the face score is overly small compared with the summed score of up to 2048 decision trees in the ACF body detector. As a result, even with the face score included, the overall score of a bounding box associated with a missed body is still too small to reach the threshold. However, we observe in Figure 15 that when is increased to 5, 8, or 10, the previously missed human body is now detected. This means that as the weight of the MTCNN detector increases, the accuracy of a body bounding box increases. (a) (b) (c) Figure 12. The scaled score divided by f ss versus the distance from the m -th face center to the anchor is drawn according to (8). The higher the scaled score is, the more likely the sliding window is to contain a human body.
To increase the weight of the MTCNN detector, the predesigned face score scale f ss is introduced in (8), so that the scaled score s m sc of the MTCNN detector is equivalent to the sum score of f ss decision trees. The larger f ss is, the greater the face detector's influence on the result of the integrated human body detector is. In Figure 13, an example is illustrated, where the human body is missed by the ACF detector (left) but is detected by the integrated detector with f ss = 1 (right). In another example illustrated in the top sequence of Figure 14, multiple human bodies are missed, although their faces are correctly detected. This is because the face score is overly small compared with the summed score of up to 2048 decision trees in the ACF body detector. As a result, even with the face score included, the overall score of a bounding box associated with a missed body is still too small to reach the threshold. However, we observe in Figure 15 that when f ss is increased to 5, 8, or 10, the previously missed human body is now detected. This means that as the weight of the MTCNN detector increases, the accuracy of a body bounding box increases.  Figure 12. The scaled score divided by versus the distance from the -th face center to the anchor is drawn according to (8). The higher the scaled score is, the more likely the sliding window is to contain a human body.
To increase the weight of the MTCNN detector, the predesigned face score scale is introduced in (8), so that the scaled score of the MTCNN detector is equivalent to the sum score of decision trees. The larger is, the greater the face detector's influence on the result of the integrated human body detector is. In Figure 13, an example is illustrated, where the human body is missed by the ACF detector (left) but is detected by the integrated detector with = 1 (right). In another example illustrated in the top sequence of Figure 14, multiple human bodies are missed, although their faces are correctly detected. This is because the face score is overly small compared with the summed score of up to 2048 decision trees in the ACF body detector. As a result, even with the face score included, the overall score of a bounding box associated with a missed body is still too small to reach the threshold. However, we observe in Figure 15 that when is increased to 5, 8, or 10, the previously missed human body is now detected. This means that as the weight of the MTCNN detector increases, the accuracy of a body bounding box increases.   After score scaling, the maximum scaled score max( ) is assigned to the overall score of the sliding window for initialization. Sliding windows with positive initial scores are more likely to survive the ACF detector, while those with negative initial scores are more likely to be eliminated.

ACF Detector
The ACF detector ( Figure 16) that is employed to detect the human body in the proposed framework is mainly based on the pretrained cascading decision trees [24]. It takes the aggregated color channel, the gradient magnitude channel, and the HOG channel of the sliding window as the input feature vector and outputs the final overall score and the location of the sliding window as the nominal human body bounding box. Note that the final overall score is obtained through the addition of the scaled face score, considered the initial overall score, and the body score. One modification is that the initial overall score of the ACF detector is set as max( ), as mentioned in the score scaling and assigning module.  After score scaling, the maximum scaled score max( ) is assigned to the overall score of the sliding window for initialization. Sliding windows with positive initial scores are more likely to survive the ACF detector, while those with negative initial scores are more likely to be eliminated.

ACF Detector
The ACF detector ( Figure 16) that is employed to detect the human body in the proposed framework is mainly based on the pretrained cascading decision trees [24]. It takes the aggregated color channel, the gradient magnitude channel, and the HOG channel of the sliding window as the input feature vector and outputs the final overall score and the location of the sliding window as the nominal human body bounding box. Note that the final overall score is obtained through the addition of the scaled face score, considered the initial overall score, and the body score. One modification is that the initial overall score of the ACF detector is set as max( ), as mentioned in the score scaling and assigning module. After score scaling, the maximum scaled score max(s m sc ) is assigned to the overall score of the sliding window for initialization. Sliding windows with positive initial scores are more likely to survive the ACF detector, while those with negative initial scores are more likely to be eliminated.

ACF Detector
The ACF detector ( Figure 16) that is employed to detect the human body in the proposed framework is mainly based on the pretrained cascading decision trees [24]. It takes the aggregated color channel, the gradient magnitude channel, and the HOG channel of the sliding window as the input feature vector and outputs the final overall score and the location of the sliding window as the nominal human body bounding box. Note that the final overall score is obtained through the addition of the scaled face score, considered the initial overall score, and the body score. One modification is that the initial overall score of the ACF detector is set as max(s m sc ), as mentioned in the score scaling and assigning module.
the aggregated color channel, the gradient magnitude channel, and the HOG channel of the sliding window as the input feature vector and outputs the final overall score and the location of the sliding window as the nominal human body bounding box. Note that the final overall score is obtained through the addition of the scaled face score, considered the initial overall score, and the body score. One modification is that the initial overall score of the ACF detector is set as max( ), as mentioned in the score scaling and assigning module.   Figure 17. The human bodies in these samples have already been correctly marked by the bounding boxes with high body scores, as shown in the bottom set of images in Figure 17, and the bounding boxes may still have relatively high body scores if they are misplaced by only a short distance. After including the scaled face scores, these misplaced boxes are easily identified as false positives. This also explains why each false positive shown in Figure 17 contains a human face. Though the final overall scores of such false positives are higher than the default threshold, they are much lower than those of the true positives, as shown in Figure 17. Considering this phenomenon, a face threshold s thr , higher than the default threshold and lower than the final overall scores of the true positives, is set to filter out false positives containing faces. This face threshold must be higher than the default threshold, because the final overall scores are increased by the scaled face scores, whereas the default threshold is designed without considering face scores. The face threshold s thr helps with the elimination of false positives introduced by the incorporation of face scores and the implementation of f ss . Note that the default threshold is used if the sliding window does not contain any faces.  Figure 17. The human bodies in these samples have already been cor marked by the bounding boxes with high body scores, as shown in the bottom set ages in Figure 17, and the bounding boxes may still have relatively high body sco they are misplaced by only a short distance. After including the scaled face scores, misplaced boxes are easily identified as false positives. This also explains why each positive shown in Figure 17 contains a human face. Though the final overall scores o false positives are higher than the default threshold, they are much lower than th the true positives, as shown in Figure 17. Considering this phenomenon, a face thre ℎ , higher than the default threshold and lower than the final overall scores of th positives, is set to filter out false positives containing faces. This face threshold mu higher than the default threshold, because the final overall scores are increased b scaled face scores, whereas the default threshold is designed without considering scores. The face threshold ℎ helps with the elimination of false positives introduc the incorporation of face scores and the implementation of . Note that the d threshold is used if the sliding window does not contain any faces.  The procedure of the proposed integrated detector is summarized in Algorithm 1.

Output
Qualified Step 1 For each n: If Equations (2)-(7) are fulfilled: Send s all to the cascading decision trees.
Step 3 If s all = max{s m sc } is implemented: If s all > s thr :

Experiments Analysis and Results
In this section, we first investigate the influences of eight predesigned parameters on the performance of the proposed detector to obtain the optimal parameters. Afterwards, the tuned detector is compared with state-of-the-art methods, and its robustness in is evaluated on various datasets.
In the following experiments, the INRIA pedestrian dataset, Caltech pedestrian dataset, Citypersons dataset, and the ETHZ dataset were used.

•
INRIA pedestrian dataset [13]: The test dataset contains 288 positive color images with 589 labeled human bodies. These images were shot at around eye-level. Most of these human bodies have an upright orientation with some extent of occlusion. • Caltech pedestrian dataset [25]: This dataset consists of approximately 250,000 frames, 640 × 480 in size, and a total of 350,000 annotated bounding boxes. The standard test set with 4024 images and corresponding new annotations [26] were used in subsequent experiments. Each image contains about 1.4 persons.

•
Citypersons dataset [27]: The validation set contains 500 high-resolution images, 1024 × 2048 in size, and a total of 3938 persons. Each validation image contains about 7.9 persons. • ETHZ dataset [28]: This dataset is a collection of 8 video sequences from busy inner-city locations with annotated human bodies. We assessed two representative sequences from this dataset, namely the BAHNHOF sequence and the Sunny Day sequence. As the pretrained ACF detector cannot classify human bodies with very small sizes, ground truths with widths and heights smaller than 32 and 80 pixels, respectively, were filtered out from the image sequences. After this, the BAHNHOF sequence had 999 images with 3341 ground truths and the Sunny Day sequence had 354 images with 1560 ground truths.
For Caltech and Citypersons, pedestrians were allocated to the reasonable subset, heavily occluded subset, and all subset. The reasonable subset is a collection of pedestrians with heights greater than 50 pixels and visibility levels greater than 0.65. For the heavily occluded subset, the visibility lies in the range [0.2, 0.65]. The all subset consists of pedestrians with heights greater than 20 pixels and a visibility level greater than 0.2.
The MTCNN detector utilized in the experiments is based on the convolutional neural network, which not only detects human faces but also locates facial landmarks. It is available online and is well-trained, and therefore, it was directly applied to our detector. Note that, facial landmark locations were discarded, as only the face bounding boxes and probabilities were used in our detector.
The ACF detector is available in the Piotr's MATLAB toolbox version 3.40. It is an Adaboost classifier based on cascading binary decision trees. The ACF detector was pretrained on both the INRIA and Caltech pedestrian datasets, respectively.

Parameter Design
The integrated detector has eight predesigned parameters, namely o f f set, r w , r h in the face bounding boxes filtering module, f wr , f hr , d hr , f ss in the score scaling and assigning module, and s thr in the ACF detector. To fully exploit the power of the integrated detector, we investigated the influences of these parameters on the detection results via control variates in the following experiments. The MTCNN face detector available at [22] and the pretrained ACF detector provided by [24] were utilized, and the pre-designed parameters were initially set as o f f set = 0, r w = 3, r h = 7, f wr = 1/2, f hr = 1/8, d hr = 1/4, f ss = 8, s thr = 25. The integrated detector was tested on the INRIA pedestrian test dataset, which has 589 ground truths. The results were evaluated quantitively by calculating the recall and the log-average miss rate. The recall was calculated as the number of true positives divided by the number of groundtruths. The log-average miss rate refers to the average miss rate over the False Positives Per Image (FPPI) in the range 10 −2 , 10 0 , which can be calculated automatically using Piotr's MATLAB toolbox. The miss rate is defined as (1 − Recall).
Tables 1-6 list the detection results from the integrated detector for various choices of parameters. Their corresponding Receiver Operating Characteristic (ROC) curves are shown in Figure 18. Considering that s thr and f ss are correlated parameters because the face threshold should adapt to the final overall score, their 3D histograms are drawn in Figure 19, instead of using 2D curves and tables. Note that some abbreviations are used in these figures and tables, namely TP (the number of True Positives), FP (the number of False Positives), R (Recall), and AMR (the Average Miss Rate).    According to Table 1 and Figure 18a, the same recall and 3 more false positives are produced when the increases from 0 to 6. This shows that the contribution of the is not obvious in our setup. A value of = 0 is recommended when one wishes to minimize false positives. Tables 2 and 3 and Figure 18b,c show that = 1-4 and ℎ = 7, 8 can produce the maximum number of true positives. Increasing and ℎ leads to less false positives, because relatively small sliding windows are filtered out. However, some true positives are also eliminated when and ℎ are too large. To maintain as many true positives as possible, = 1 and ℎ = 7 are recommended. As for the score scaling and assigning, the anchor best locates at the middle of the width ( = 0.5) and 1/8th of the height ( ℎ = 1/8), as presented in Tables 4 and 5 and Figure 18d,e. This location is in line with the face positions of the most upright adult human bodies. Table 6 and Figure 18f show that the best ℎ is 1/4. As shown in Figure 19a-c, ℎ is inversely proportional to the number of true positives, false positives, and recall, whereas is directly proportional to them. This is because a higher threshold brings in fewer bounding According to Table 1 and Figure 18a Table 6 and Figure 18f show that the best d hr is 1/4. As shown in Figure 19a-c, s thr is inversely proportional to the number of true positives, false positives, and recall, whereas f ss is directly proportional to them. This is because a higher threshold brings in fewer bounding boxes, but more are obtained when the final overall scores are augmented by f ss . To strike a balance between these two parameters, parameters of s thr = 25, f ss = 8 are suggested, which produces the smallest average miss rate, as shown in Figure 18d.
According to these experimental results, the predesigned parameters can be allocated into four types according to their functions. First, r w , r h , and s thr can be increased to eliminate false positives but with the sacrifice of some true positives. In contrast, f ss can be increased to bring in both additional true and false positives. Thirdly, choosing appropriate f wr , f hr , and d hr values can increase the true positives and decrease the false positives at the same time. Finally, the o f f set has a negligible influence on the detection results.

Evaluation
Following an extensive experimental validation study, we can claim that, for the dataset considered, the integrated detector is finely tuned and produces the best detection results when we choose o f f set = 0, r w = 3, r h = 7, f wr = 0.5, f hr = 0.125, d hr = 0.25, f ss = 8, s thr = 25. Furthermore, for the dataset considered, as shown in Figure 20, the integrated detector produces a better performance than the traditional ACF detector with the average miss rate decreasing from 16.85% to 14.29%. By fusing the MTCNN face detector with the body detector, the number of true positives increases from 543 to 549, while the number of false positives decreases from 328 to 319, as shown in Table 7. Some image samples that depict the increased true positives and the eliminated false positives are shown in Figures 21 and 22, respectively. 8, ℎ = 25. Furthermore, for the dataset considered, as shown in Figure 20, the integrated detector produces a better performance than the traditional ACF detector with the average miss rate decreasing from 16.85% to 14.29%. By fusing the MTCNN face detector with the body detector, the number of true positives increases from 543 to 549, while the number of false positives decreases from 328 to 319, as shown in Table 7. Some image samples that depict the increased true positives and the eliminated false positives are shown in Figures 21 and 22, respectively.   , ℎ , and ℎ values can increase the true positives and decrease the false positives at the same time. Finally, the has a negligible influence on the detection results.

Evaluation
Following an extensive experimental validation study, we can claim that, for the dataset considered, the integrated detector is finely tuned and produces the best detection results when we choose = 0, = 3, ℎ = 7, = 0.5, ℎ = 0.125, ℎ = 0.25, = 8, ℎ = 25. Furthermore, for the dataset considered, as shown in Figure 20, the integrated detector produces a better performance than the traditional ACF detector with the average miss rate decreasing from 16.85% to 14.29%. By fusing the MTCNN face detector with the body detector, the number of true positives increases from 543 to 549, while the number of false positives decreases from 328 to 319, as shown in Table 7. Some image samples that depict the increased true positives and the eliminated false positives are shown in Figures 21 and 22, respectively.

Comparison with the State-of-the-Art Detectors
We also compared the proposed detector with state-of-the-art detectors, including the handcrafted feature based detectors HOG + SVM [13], DPM [16], ACF [14], and ACF + HSC [1]; learning based detectors, such as ConvNet [29]; and the deep models YOLOv3 [18], FRCNN [19], FRCNN + BN [19], SAR R-CNN [30], and RPN-BF [31]. The AMRs shown in Table 8 are cited from [1], except for the proposed detector, ACF, and YOLOv3. As shown in the table, the proposed detector produced the lowest AMR of the listed nondeep model-based detectors. It even outperformed YOLOv3 and achieved a performance comparable to that of FRCNN for the INRIA test dataset. Other deep-modelbased detectors produced much lower AMRs by taking advantage of extracting a large number of high-level features and training many epochs on high-end GPUs.   We also compared the proposed detector with state-of-the-art detectors, including the handcrafted feature based detectors HOG + SVM [13], DPM [16], ACF [14], and ACF + HSC [1]; learning based detectors, such as ConvNet [29]; and the deep models YOLOv3 [18], FRCNN [19], FRCNN + BN [19], SAR R-CNN [30], and RPN-BF [31]. The AMRs shown in Table 8 are cited from [1], except for the proposed detector, ACF, and YOLOv3. As shown in the table, the proposed detector produced the lowest AMR of the listed nondeep model-based detectors. It even outperformed YOLOv3 and achieved a performance comparable to that of FRCNN for the INRIA test dataset. Other deep-model-based detectors produced much lower AMRs by taking advantage of extracting a large number of high-level features and training many epochs on high-end GPUs.

Evaluation of Robustness
To evaluate the robustness of the integrated detector, it was tested on the ETHZ dataset, Caltech test set, and Citypersons validation set in the following experiments. The parameters of the integrated detector were still o f f set = 0, r w = 3, r h = 7, f wr = 0.5, f hr = 0.125, d hr = 0.25, and f ss = 8, s thr = 25. The embedded ACF detector was pretrained on the INRIA dataset or the Caltech dataset, if specified.
The results for the two sequences of the ETHZ dataset (Tables 9 and 10) show that the proposed integrated detector produced more true positives and fewer false positives, leading to an increased recall and decreased AMR positives. For the Caltech test set, a decrease in AMRs (Table 11) was also observed. A cross dataset evaluation on the Citypersons dataset was performed. The detectors were first pretrained on the INRIA and Caltech datasets and were then tested on Citypersons validation set. It was observed that the integrated detector significantly decreased the AMR (Figures 23a-c and 24a-c) and increased the recall (Figures 23d-f and 24d-f) for the reasonable, heavily occluded, and all subsets with both pretrained datasets. This, together with the information shown in Tables 9 and 10, indicates that the proposed integrated detector as well as the parameter design are robust when used with the training set and are applicable for use on an unseen dataset. the proposed integrated detector produced more true positives and fewer false positives, leading to an increased recall and decreased AMR positives. For the Caltech test set, a decrease in AMRs (Table 11) was also observed. A cross dataset evaluation on the Citypersons dataset was performed. The detectors were first pretrained on the INRIA and Caltech datasets and were then tested on Citypersons validation set. It was observed that the integrated detector significantly decreased the AMR (Figures 23a-c and 24a-c) and increased the recall (Figures 23d-f and 24d-f) for the reasonable, heavily occluded, and all subsets with both pretrained datasets. This, together with the information shown in Tables 9 and 10, indicates that the proposed integrated detector as well as the parameter design are robust when used with the training set and are applicable for use on an unseen dataset.   We also investigated the robustness of the proposed detector under different thresholds. The results (Table 11, Figures 23 and 24) show that the integrated detector improved the performance under varying thresholds, and the improvement was more significant under higher thresholds. This means that the proposed method is not only robust to thresholds but performs better when fewer false positives are required.
The influence of the minimum size (denoted as ms) of faces on the integrated detector was studied. The smaller the parameter ms, the more faces MTCNN was able to detect. According to Table 11 and Figures 23 and 24, more noticeable improvements were observed in most cases, except for the reasonable and all subsets pretrained on INRIA when ms was decreased to 13 pixels. This means that the performance of the face detector influences the improvements of the whole integrated detector. We also investigated the robustness of the proposed detector under different thresholds. The results (Table 11, Figures 23 and 24) show that the integrated detector improved the performance under varying thresholds, and the improvement was more significant under higher thresholds. This means that the proposed method is not only robust to thresholds but performs better when fewer false positives are required.
The influence of the minimum size (denoted as ms) of faces on the integrated detector was studied. The smaller the parameter ms, the more faces MTCNN was able to detect. According to Table 11 and Figures 23 and 24, more noticeable improvements were observed in most cases, except for the reasonable and all subsets pretrained on INRIA when ms was decreased to 13 pixels. This means that the performance of the face detector influences the improvements of the whole integrated detector.

Detection Speed
With regard to the detection speed, the time costs of detecting 288 INRIA pedestrian test images for different detectors via Intel(R) Core (TM) i7-8565U CPU @ 1.80GHz are compared in Table Figure 25. To increase the true positive rate while maintaining the least false positive rate, the complete MTCNN was utilized in our integrated detector. shown in Figure 25. To increase the true positive rate while maintaining the least false positive rate, the complete MTCNN was utilized in our integrated detector.

Conclusions
In conclusion, we presented an integrated pedestrian detector, namely the MTCNN+ACF detector, for small pedestrian datasets. It detects human bodies considering not only color and edge information but also facial features. The integrated detector aggregates multiple detection tasks uniformly to produce a final overall score, which is the sum of the scaled face score and the body score. The fusion rules and parameter choices were investigated in depth. The idea is simple and easy to implement, but the proposed detector can effectively and robustly improve the detection performance compared to the sole use of ACF detector on various pedestrian datasets. The proposed detector only utilizes the CPU device and does not require any further training; however, it achieves a performance level (14.29%) comparable to deep models such as FRCNN (14%) and YOLOv3 (14.75%) on the small pedestrian dataset. The recall and average miss rate were observed to have a steady increase and decrease, respectively, on the Citypersons, ETHZ, and Caltech datasets. Therefore, the proposed detector is an effective paradigm of multitask collaboration, and it serves as a cost-effective choice for pedestrian detection in the case of limited data and computational resources.

Conclusions
In conclusion, we presented an integrated pedestrian detector, namely the MTCNN + ACF detector, for small pedestrian datasets. It detects human bodies considering not only color and edge information but also facial features. The integrated detector aggregates multiple detection tasks uniformly to produce a final overall score, which is the sum of the scaled face score and the body score. The fusion rules and parameter choices were investigated in depth. The idea is simple and easy to implement, but the proposed detector can effectively and robustly improve the detection performance compared to the sole use of ACF detector on various pedestrian datasets. The proposed detector only utilizes the CPU device and does not require any further training; however, it achieves a performance level (14.29%) comparable to deep models such as FRCNN (14%) and YOLOv3 (14.75%) on the small pedestrian dataset. The recall and average miss rate were observed to have a steady increase and decrease, respectively, on the Citypersons, ETHZ, and Caltech datasets. Therefore, the proposed detector is an effective paradigm of multitask collaboration, and it serves as a cost-effective choice for pedestrian detection in the case of limited data and computational resources.