A Parallel Convolutional Neural Network for Pedestrian Detection

: Pedestrian detection is a crucial task in many vision-based applications, such as video surveillance, human activity analysis and autonomous driving. Recently, most of the existing pedestrian detection frameworks only focus on the detection accuracy or model parameters. However, how to balance the detection accuracy and model parameters, is still an open problem for the practical application of pedestrian detection. In this paper, we propose a parallel, lightweight framework for pedestrian detection, named ParallelNet. ParallelNet consists of four branches, each of them learns di ﬀ erent high-level semantic features. We fused them into one feature map as the ﬁnal feature representation. Subsequently, the Fire module, which includes Squeeze and Expand parts, is employed for reducing the model parameters. Here, we replace some convolution modules in the backbone with Fire modules. Finally, the focal loss is led into the ParallelNet for end-to-end training. Experimental results on the Caltech–Zhang dataset and KITTI dataset show that: Compared with the single-branch network, such as ResNet and SqueezeNet, ParallelNet has improved detection accuracy with fewer model parameters and lower Giga Floating Point Operations (GFLOPs).


Introduction
Pedestrian detection is an active research area in object detection [1]. The purpose of this study is to detect all pedestrians in each frame and locate their position, for applications in video surveillance [2], motion detection [3], intelligent transportation [4] and autonomous driving [5,6]. The development of pedestrian detection can be divided into two stages. The first is based on the feature engineering, with this type of method being consistent with "feature extractor + classifier" methods, such as Histogram of Oriented Gradient (HOG) + Support Vector Machines (SVM) [7], Integral Channel Features (ICF) + AdaBoost [8] and Deformable Part Model (DPM) + Latent Support Vector Machines (LatSVM) [9]. However, these methods still have limitations in their performance in engineering applications, particularly generalization performance [10]. The second is based on the neural network method. With the development of neural networks [11][12][13], general object detection algorithms are used in pedestrian detection, and can be further divided into two categories: Anchor-based and anchor-free. Anchor-based methods [14][15][16] mainly use anchors to generate proposal regions. This involves a large number of calculations but is able to achieve high detection accuracy. Anchor-free methods [16][17][18] take advantage of key point prediction, which can be more direct, avoiding a lot of IOU calculations. However, there remain some limitations to the performance of the overlap of key points. The method chosen for this paper was an anchor-based method and we will concentrate on this method from this point on.

1.
We design a backbone with a parallel structure, which is called ParallelNet. It aims to improve the robustness of feature representation and detection accuracy.

2.
A large number of Fire modules are employed, aiming to ensure accuracy while reducing the number of model parameters, with subsequent focal loss led into the ParallelNet for end-to-end training.

3.
We validate ParallelNet on the Caltech-Zhang and KITTI dataset, and compare the results with VGG16, ResNet50, ResNet101, SqueezeNet and SqueezeNet+. An ablation study was designed to verify the feasibility of the ParalleNet. In addition, to prevent the network from overfitting, we adopted data augmentation techniques, including random cropping and horizontal flipping.
The rest of the paper is organized as follows. We first review related works in Section 2. Then, we introduce the architecture of the ParallelNet and the detection head in Section 3. In Section 4, we report our experiments on the Caltech-Zhang and KITTI datasets, including details of experimental procedures and results. We conclude the paper in Section 5.

Pedestrian Detection with Hand-Crafted Features
Before the advent of Convolutional Neural Networks (CNNs), a common method for pedestrian detection extracting hand-crafted features within a sliding window in all possible positions and ranges. The most significant methods are the VJ method proposed by Viola and Jones [19], and the HOG feature descriptor proposed by Dalal and Triggs [7]. The VJ method detects two consecutive frames, taking advantage of both pedestrian motion and appearance features, whereas the HOG feature descriptor uses gradient feature information to construct a histogram. Dollar et al. proposed ICF [8], which primarily constructs histograms based on feature information between different channels. Since then, a great number of excellent channel-based methods were proposed one after another [20][21][22][23]. These early works focused more on the design of feature descriptors, and mostly used SVM or Random Forest for classification. However, these methods continue to have limitations in engineering applications and limited generalization performance.

Pedestrian Detection with CNNs
In recent years, Convolutional Neural Network (CNN) has become the main method in general object detection [24][25][26], as well as pedestrian detection. Some pioneers apply the classic general object detection network to pedestrian detection. Zhang et al. [27] analyzed the reasons why Faster-RCNN did not perform well in pedestrian detection, and proposed to use Region Proposal Network (RPN) to generate candidate regions directly, and then use Boosted Forest for classification. Similarly, Li et al. [28] used two subnetworks to detect pedestrians of different scales on the basis of Faster-RCNN, and used a scale-aware weighting mechanism to reduce the impact of object scale on detection accuracy. Mao et al. [29] referred to the idea of aggregating channel features in the Aggregated Channel Features (ACF) method and improved the detection performance by adding channel feature information to the CNN network model.
Although these methods have made encouraging progress [30], experts continue to explore new ideas and methods. As pedestrian detection plays an increasingly important role in the application of autonomous driving systems, an increasing amount of research is being dedicated to dealing with more complex scenarios. Firstly, focusing on occluded pedestrians, Zhang et al. [31] characterized various occlusion modes by employing an attention mechanism across channels, and used the attention network as an additional component of the Faster-RCNN detector. Secondly, for the crowd, Wang et al. [32] proposed Repulsion loss to improve the robustness of the algorithm in crowd scenarios by reducing the weight of nonobject bounding box nearing the object bounding box. In addition, Shao et al. [33] proposed the crowd-oriented dataset, CrowdHuman. On average, each image contains about 23 pedestrian objects. Moreover, using CrowdHuman pretraining can also improve the performance of the algorithm on other datasets (such as Caltech and CityPersons). The third involves small-scale objects, the detection of which is crucial during the driving scenario as it can assist in giving users more timely warnings [34]. Song et al. [35] combined somatic topological line localization (TLL) and temporal feature aggregation, which works well for small-scale objects far away from the camera. Figure 1 shows the input examples of these three cases.
Electronics 2020, 9, x FOR PEER REVIEW 3 of 16 In recent years, Convolutional Neural Network (CNN) has become the main method in general object detection [24][25][26], as well as pedestrian detection. Some pioneers apply the classic general object detection network to pedestrian detection. Zhang et al. [27] analyzed the reasons why Faster-RCNN did not perform well in pedestrian detection, and proposed to use Region Proposal Network (RPN) to generate candidate regions directly, and then use Boosted Forest for classification. Similarly, Li et al. [28] used two subnetworks to detect pedestrians of different scales on the basis of Faster-RCNN, and used a scale-aware weighting mechanism to reduce the impact of object scale on detection accuracy. Mao et al. [29] referred to the idea of aggregating channel features in the Aggregated Channel Features (ACF) method and improved the detection performance by adding channel feature information to the CNN network model.
Although these methods have made encouraging progress [30], experts continue to explore new ideas and methods. As pedestrian detection plays an increasingly important role in the application of autonomous driving systems, an increasing amount of research is being dedicated to dealing with more complex scenarios. Firstly, focusing on occluded pedestrians, Zhang et al. [31] characterized various occlusion modes by employing an attention mechanism across channels, and used the attention network as an additional component of the Faster-RCNN detector. Secondly, for the crowd, Wang et al. [32] proposed Repulsion loss to improve the robustness of the algorithm in crowd scenarios by reducing the weight of nonobject bounding box nearing the object bounding box. In addition, Shao et al. [33] proposed the crowd-oriented dataset, CrowdHuman. On average, each image contains about 23 pedestrian objects. Moreover, using CrowdHuman pretraining can also improve the performance of the algorithm on other datasets (such as Caltech and CityPersons). The third involves small-scale objects, the detection of which is crucial during the driving scenario as it can assist in giving users more timely warnings [34]. Song et al. [35] combined somatic topological line localization (TLL) and temporal feature aggregation, which works well for small-scale objects far away from the camera. Figure 1 shows the input examples of these three cases. Ideally, the model will have a high energy efficiency and small size, so that the algorithm may be eventually used in mobile devices. Szegedy et al. [36] proposed replacing one 5 × 5 convolution kernel with two 3 × 3 convolution kernels, which both have the same receptive field; however, by doing so can deepen the network and reduce the calculation parameters. Subsequently, several novel convolution kernel design methods have been proposed in quick succession. Howard et al. [37] proposed Depthwise Separable Convolution, which integrates traditional convolution into two steps, namely depthwise convolution and pointwise convolution, which greatly improves the calculation efficiency [38][39][40]. Zhang et al. [41] introduced group convolution, then followed by an operation of channel shuffling. This method guarantees the exchange of information between different groups, and achieves the goal of improved calculation efficiency while maintaining accuracy. Iandola [42] proposed the Fire module, which consists of a Squeeze module and an Expand module, significantly reducing the number of parameters while improving the detection accuracy [43][44][45]. In this paper, we will adopt this innovative convolution module. Ideally, the model will have a high energy efficiency and small size, so that the algorithm may be eventually used in mobile devices. Szegedy et al. [36] proposed replacing one 5 × 5 convolution kernel with two 3 × 3 convolution kernels, which both have the same receptive field; however, by doing so can deepen the network and reduce the calculation parameters. Subsequently, several novel convolution kernel design methods have been proposed in quick succession. Howard et al. [37] proposed Depthwise Separable Convolution, which integrates traditional convolution into two steps, namely depthwise convolution and pointwise convolution, which greatly improves the calculation efficiency [38][39][40]. Zhang et al. [41] introduced group convolution, then followed by an operation of channel shuffling. This method guarantees the exchange of information between different groups, and achieves the goal of improved calculation efficiency while maintaining accuracy. Iandola [42] proposed the Fire module, which consists of a Squeeze module and an Expand module, significantly reducing the number of parameters while improving the detection accuracy [43][44][45]. In this paper, we will adopt this innovative convolution module.
Essentially, all of these networks hope to detect the pedestrians more precisely with a small model size and real-time inference speed. In this research, we also commit to balancing the detection accuracy with model parameters.

Pedestrian Detection Benchmarks
Over the years, some datasets applied for video surveillance have been widely used, such as Daimler Pedestrian Detection Benchmark (DaimlerDB) [46], INRIA Person Dataset (INRIA) [7], ETH Pedestrian Dataset (ETH) [47], etc. However, in the past several years, for autonomous driving applications, many datasets have been proposed [48], including Caltech [49], KITTI [50], CityPersons [51], etc. These datasets are captured by onboard cameras while navigating through crowded areas and are widely used in various methods, in particular Caltech and KITTI. Zhang et al. [52] corrected the labeling errors in the Caltech dataset and provided a new sanitized version of annotations, which we call it Caltech-Zhang.

Architecture
In this section, we elaborate the architecture of ParallelNet and detection head. In Section 3.1, we introduce the strategies to improve the detection accuracy and reduce the model parameters. In Section 3.2, we introduce the detection strategies from localization, category and confidence score, respectively.

Strategy 1: Parallel
Inspired by the human brain, experts hope to build a network similar to neural structures, hence the term "neural network". In this type of network, each neuron is responsible for feature extraction. We obtain the weight of each neuron through suitable optimization methods, so that the output result is closest to the ground truth. Similarly, in a circuit structure, the principal objective is to obtain the ideal output by adjusting the resistance of the components. Analog circuit structures such as AlexNet and VGGNet can be regarded as series structures, where each layer of neurons is only connected to the previous layer of neurons, as shown in Figure 2a. If you increase the depth of the network, you may be able to increase the fitting ability of the network; however, it could also bring about complications such as overfitting, excessive calculation parameters, and vanishing gradient. These challenges are at the heart of the inspiration to design ParallelNet. Essentially, all of these networks hope to detect the pedestrians more precisely with a small model size and real-time inference speed. In this research, we also commit to balancing the detection accuracy with model parameters.

Pedestrian Detection Benchmarks
Over the years, some datasets applied for video surveillance have been widely used, such as Daimler Pedestrian Detection Benchmark (DaimlerDB) [46], INRIA Person Dataset (INRIA) [7], ETH Pedestrian Dataset (ETH) [47], etc. However, in the past several years, for autonomous driving applications, many datasets have been proposed [48], including Caltech [49], KITTI [50], CityPersons [51], etc. These datasets are captured by onboard cameras while navigating through crowded areas and are widely used in various methods, in particular Caltech and KITTI. Zhang et al. [52] corrected the labeling errors in the Caltech dataset and provided a new sanitized version of annotations, which we call it Caltech-Zhang.

Architecture
In this section, we elaborate the architecture of ParallelNet and detection head. In Section 3.1, we introduce the strategies to improve the detection accuracy and reduce the model parameters. In Section 3.2, we introduce the detection strategies from localization, category and confidence score, respectively.

Strategy 1: Parallel
Inspired by the human brain, experts hope to build a network similar to neural structures, hence the term "neural network". In this type of network, each neuron is responsible for feature extraction. We obtain the weight of each neuron through suitable optimization methods, so that the output result is closest to the ground truth. Similarly, in a circuit structure, the principal objective is to obtain the ideal output by adjusting the resistance of the components. Analog circuit structures such as AlexNet and VGGNet can be regarded as series structures, where each layer of neurons is only connected to the previous layer of neurons, as shown in Figure 2a. If you increase the depth of the network, you may be able to increase the fitting ability of the network; however, it could also bring about complications such as overfitting, excessive calculation parameters, and vanishing gradient. These challenges are at the heart of the inspiration to design ParallelNet.
where H is the height, W is the width, and 3 is the number of channels. Here, the form of a circuit is used to represent the network structure, where (a) represents a neural network with a series structure, and (b) represents a neural network with a parallel structure.
As shown in Figure 2b, we draw branches from the middle layer and extract features from the middle layer multiple times. Obviously, with the same network depth, the parallel structure network Figure 2. The size of the input image is H × W × 3, where H is the height, W is the width, and 3 is the number of channels. Here, the form of a circuit is used to represent the network structure, where (a) represents a neural network with a series structure, and (b) represents a neural network with a parallel structure.
As shown in Figure 2b, we draw branches from the middle layer and extract features from the middle layer multiple times. Obviously, with the same network depth, the parallel structure network can extract more discriminative feature representation. This is similar to the principle of increasing network depth. However, it is not difficult for us to find that the more branches we introduce, the more calculation parameters of the network. Therefore, we need to further reduce the model parameters.

Strategy 2: Fire Model
The structure of the SqueezeNet network mentioned above is shown in Figure 3. It uses the Fire module to greatly reduce the number of model parameters while maintaining the detection accuracy. Therefore, we introduce the Fire module when optimizing the network structure. The Fire module is shown in Figure 4. Where we can see, the Fire module consists of two parts, "Squeeze" and "Expand". The Squeeze part has several 1 × 1 convolutional kernels to reduce the dimensions of input. The Expand part has several 1 × 1 and 3 × 3 convolutional kernels to extract features and increase dimensions, respectively. In Figure 4, s 1 × 1 , e 1 × 1 and e 3 × 3 represent the numbers of kernels of each part. can extract more discriminative feature representation. This is similar to the principle of increasing network depth. However, it is not difficult for us to find that the more branches we introduce, the more calculation parameters of the network. Therefore, we need to further reduce the model parameters.

Strategy 2: Fire Model
The structure of the SqueezeNet network mentioned above is shown in Figure 3. It uses the Fire module to greatly reduce the number of model parameters while maintaining the detection accuracy. Therefore, we introduce the Fire module when optimizing the network structure. The Fire module is shown in Figure 4. Where we can see, the Fire module consists of two parts, "Squeeze" and "Expand". The Squeeze part has several 1 × 1 convolutional kernels to reduce the dimensions of input. The Expand part has several 1 × 1 and 3 × 3 convolutional kernels to extract features and increase dimensions, respectively. In Figure 4, 1 × 1 , 1 × 1 and 3 × 3 represent the numbers of kernels of each part.  Based on the above strategies, we have designed ParallelNet, whose structure is shown in Figure  5. Meanwhile, we take the Caltech-Zhang dataset as input and give the specific parameter settings of each layer in ParallelNet, as shown in Table 1. It is not difficult to find that ParallelNet has used a large number of Fire modules in each branch, and the output of each branch is a feature map with a size of 15 × 20 × 63. can extract more discriminative feature representation. This is similar to the principle of increasing network depth. However, it is not difficult for us to find that the more branches we introduce, the more calculation parameters of the network. Therefore, we need to further reduce the model parameters.

Strategy 2: Fire Model
The structure of the SqueezeNet network mentioned above is shown in Figure 3. It uses the Fire module to greatly reduce the number of model parameters while maintaining the detection accuracy. Therefore, we introduce the Fire module when optimizing the network structure. The Fire module is shown in Figure 4. Where we can see, the Fire module consists of two parts, "Squeeze" and "Expand". The Squeeze part has several 1 × 1 convolutional kernels to reduce the dimensions of input. The Expand part has several 1 × 1 and 3 × 3 convolutional kernels to extract features and increase dimensions, respectively. In Figure 4, 1 × 1 , 1 × 1 and 3 × 3 represent the numbers of kernels of each part.  Based on the above strategies, we have designed ParallelNet, whose structure is shown in Figure  5. Meanwhile, we take the Caltech-Zhang dataset as input and give the specific parameter settings of each layer in ParallelNet, as shown in Table 1. It is not difficult to find that ParallelNet has used a large number of Fire modules in each branch, and the output of each branch is a feature map with a size of 15 × 20 × 63. Based on the above strategies, we have designed ParallelNet, whose structure is shown in Figure 5. Meanwhile, we take the Caltech-Zhang dataset as input and give the specific parameter settings of each layer in ParallelNet, as shown in Table 1. It is not difficult to find that ParallelNet has used a large number of Fire modules in each branch, and the output of each branch is a feature map with a size of 15 × 20 × 63.

Detection Head
As mentioned in the previous section, each branch will output a feature map. We added the corresponding pixels of the four feature maps to get one feature map, and fed it into the detection network. Before explaining the detection process, Figure 6 roughly illustrates the detection process.
From Figure 6, we can see that the feature map includes three parts: localization, category and confidence score.
Electronics 2020, 9, x FOR PEER REVIEW 8 of 16 Figure 6. If the number of anchors in the experiment is assumed to be , the dimension of the prediction tensor will be (2 + 1 + 4). Among them, 2 represents the number of categories, that is, pedestrian/not pedestrian, 1 represents confidence score, and 4 represents the coordinates of the bounding box.

Bounding Box Loss
How do we get the coordinates of the bounding box? Assume that the size of the feature map is × . Here, H and W are the height and width, respectively.
is the number of anchors. In the detection stage, we carried out the following steps: 1. Sliding the window to get the coordinate of the anchor, denoted as (̂,̂,̂, ĥ ). Here, ⅈ ∈ [1, ]， ∈ [1, ]， ∈ [1, ]. the ground truth is known, denoted as ( , , , ℎ ). 2. Calculating the offset, denoted as ( , , , ℎ ), by encoding Equation (1). After a period of training and optimization, the offset convergences to the threshold, which means that the predicted bounding box is close to the ground truth.
Consequently, the loss function of the bounding box is to optimize the offset between predict bounding box and ground truth bounding box, as: where , , are the width, height and number of channels of the feature map, respectively; is the number of objects; for anchor located at (ⅈ, , ), if it contains object, = 1, else = 0; is the coefficient. Figure 6. If the number of anchors in the experiment is assumed to be K, the dimension of the prediction tensor will be K(2 + 1 + 4). Among them, 2 represents the number of categories, that is, pedestrian/not pedestrian, 1 represents confidence score, and 4 represents the coordinates of the bounding box.

Bounding Box Loss
How do we get the coordinates of the bounding box? Assume that the size of the feature map is H × W. Here, H and W are the height and width, respectively. K is the number of anchors. In the detection stage, we carried out the following steps: 1.
Sliding the window to get the coordinate of the anchor, denoted as Calculating the offset, denoted as δx ijk , δy ijk , δω ijk , δh ijk , by encoding Equation (1). After a period of training and optimization, the offset convergences to the threshold, which means that the predicted bounding box is close to the ground truth.

3.
Calculating the coordinate of the predicted bounding box, denoted as x P i , y P j , ω P k , h P k , by decoding Equation (2).
Consequently, the loss function of the bounding box is to optimize the offset between predict bounding box and ground truth bounding box, as: where W, H, K are the width, height and number of channels of the feature map, respectively; N obj is the number of objects; for anchor located at (i, j, k), if it contains object, I ijk = 1, else I ijk = 0; λ bbox is the coefficient.

Classification Loss
In terms of classification, the pedestrian detection problem is a binary classification problem. The most commonly used loss function for binary classification problems is the cross-entropy loss function, as: where C is the number of categories; p c is the output of detection head, which represents the probability that a bounding box contains an object; l c {0, 1}, which is the ground truth; λ class is the coefficient. However, reviewing ParallelNet, we use four branches to extract features separately. Obviously, we can extract more discriminative feature representation but it also means that we may be disturbed by more background information. This is not conducive to the robustness of training. In addition, in the driving scene, the background information is complex, and there is a problem of extreme foreground-background class imbalance. Therefore, we used focal loss [53] as the classification loss function, as: where γ is the focusing parameter, which smoothly downweight the rate for easy examples. α is the balance parameter, which is used to balance proportions of positive instances and negative instances.

Confidence Score Loss
In terms of confidence score, we define the confidence score as p c × IOU. The IOU is the intersection-over union between the predicted bounding box and the ground truth bounding box. Similarly, for the loss function of confidence score, we take the negative examples into account, as: where IOU + is the IOU for positive instance and ground truth; IOU − is the IOU for negative instance and ground truth; λ + con f , λ − con f are the coefficients.

Experimental Results
In this part, we will specifically introduce our experimental procedures and results to evaluate the method in this paper. First, we will introduce the basic experimental setup and implementation details. Then, the results of the ParalleNet on Caltech and KITTI datasets will be compared with VGG16, ResNet50, ResNet101, SqueezeNet and SqueezeNet+. At the same time, we further verify the feasibility of the parallel network through ablation experiments. Finally, we will explain some of the tricks applied during the experiment and show the detection results.

Datasets
The two datasets used in this paper are Caltech-Zhang and KITTI. Based on the original Caltech pedestrian dataset, Zhang et al. corrected several types of errors in the existing annotations, such as misalignments, missing annotations (false negatives), false annotations (false positives) and the inconsistent use of "ignore" regions. Caltech-Zhang has a total of 5039 images, and the input image size is 480 × 640.
KITTI is a benchmark that is designed with autonomous driving for the tasks of stereo, optical flow, visual odometry/SLAM (Simultaneous Localization and Mapping) and 3D object detection. There are 7481 pictures in total the size of the input image is 1242 × 375, including several kinds of labels, such as car, truck, pedestrian, tram and so on. However, we only detect the pedestrian category.

Training Details
In the experiment, we train the model on NVIDIA 2080Ti GPU, and randomly divide datasets into two equal parts as the training set and the validation set. We set the batch size to 20 and the number of anchors to nine; we used Stochastic Gradient Descent (SGD) with momentum to optimize the loss function during training, where the momentum was set to 0.9; we set the initial learning rate to 0.001, and each after 10,000 steps, it is decayed in half; λ class = 1, λ + con f = 75, λ − con f = 100, λ bbox = 5, α 1 = 0.25, α 0 = 0.75, γ = 2; the final stage used Non-Maximum Suppression (NMS to filter the bounding boxes.

Comparative Experiments
We experimented on Caltech-Zhang and KITTI datasets, respectively. In addition to the ParallelNet proposed in this paper, we also conducted experiments with the same settings on the backbones of VGG16, ResNet50, ResNet101, SqueezeNet and SqueezeNet+.Then fed the result into the same detection network, just as described in Section 3.2. The experimental results are shown in Table 2. It should be pointed out that SqueezeNet+ deepens the network on the basis of the SqueezeNet, and uses a 7 × 7 convolution kernel to increase the receptive field. Table 2. Comparison results of ParallelNet with some most commonly used backbone evaluated on Caltech-Zhang and KITTI dataset. AP is the average precise, which is used to measure the detection accuracy; FPS is the number of images processed per second, which represents the detection speed; # Param is the number of model elements, the lower is better; GFLOPs is the FLOPs (floating point of operations) per image, which also the lower is better.  Compared to VGG16, ResNet50 and ResNet101, ParallelNet achieved 35.71 FPS on Caltech-Zhang and 32.26 FPS on KITTI. The good performance on detection speed benefits from the employment of the Fire module. Simultaneously, although the architecture of ParallelNet seems complicated, the Fire module greatly reduces the model parameters and GFLOPs, which can also be seen from the experimental results.
Compared to SqueezeNet and SqueezeNet+, which also employed the Fire module, ParallelNet achieved the best performance on detection accuracy, 54.61% AP on Caltech-Zhang and 77.9% AP on KITTI. Unfortunately, the detection speed of ParallelNet is not as fast as SqueezeNet. This is not difficult to understand, for that ParallelNet constructs branches in parallel to extract more feature representation, which will inevitably increase the GFLOPs. However, from the experimental results, we can also find that the detection speed of ParallelNet is better than that of SqueezeNet+. This shows that the "parallel connection" compared to the "series connection", well balances the detection accuracy and model parameters. This also provides a new idea for the construction of the backbone.

Ablation Study
In the structure of Figure 5, we introduced parallel branches and finally got the output of four equal-sized feature maps. In the next experiment, we increase the number of parallel branches in turn, and compare the results on the Caltech-Zhang dataset, just as shown in Table 3. Table 3. In the table, 1-branch means that no parallel branch is introduced, but only one branch; 2-branches means that one parallel branch is introduced, a total of 2 branches; and so on. Judging from the experimental results, the detection results of each additional branch are improved. However, with the improvement of detection accuracy, the detection speed has decreased. It is not to find the reason. As the number of branches increases, the model becomes more complex, which means that we need more time and resources to train the network. This makes it impossible to balance detection accuracy and detection speed, which means we cannot increase the number of parallel branches indefinitely. Therefore, under comprehensive considerations, 4-branches was selected as the final solution.

Training Results
In the training stage, we have given a quantization of experimental results. In this part, we will give some intuitive, graphical results. As the settings described in Section 4.1.2, ParallelNet was trained for 9700 steps. Figure 7 shows the loss for the training along with the number of training steps. The loss graphs, including bounding box loss, classification loss and confidence score loss, all converge toward a small value. The total loss converges from 35.2 to about 0.4. It proves that the model has been fully trained.
Electronics 2020, 9, x FOR PEER REVIEW 11 of 16 on KITTI. Unfortunately, the detection speed of ParallelNet is not as fast as SqueezeNet. This is not difficult to understand, for that ParallelNet constructs branches in parallel to extract more feature representation, which will inevitably increase the GFLOPs. However, from the experimental results, we can also find that the detection speed of ParallelNet is better than that of SqueezeNet+. This shows that the "parallel connection" compared to the "series connection", well balances the detection accuracy and model parameters. This also provides a new idea for the construction of the backbone.

Ablation Study
In the structure of Figure 5, we introduced parallel branches and finally got the output of four equal-sized feature maps. In the next experiment, we increase the number of parallel branches in turn, and compare the results on the Caltech-Zhang dataset, just as shown in Table 3.
Judging from the experimental results, the detection results of each additional branch are improved. However, with the improvement of detection accuracy, the detection speed has decreased. It is not to find the reason. As the number of branches increases, the model becomes more complex, which means that we need more time and resources to train the network. This makes it impossible to balance detection accuracy and detection speed, which means we cannot increase the number of parallel branches indefinitely. Therefore, under comprehensive considerations, 4-branches was selected as the final solution.

Training Results
In the training stage, we have given a quantization of experimental results. In this part, we will give some intuitive, graphical results. As the settings described in Section 4.1.2, ParallelNet was trained for 9700 steps. Figure 7 shows the loss for the training along with the number of training steps. The loss graphs, including bounding box loss, classification loss and confidence score loss, all converge toward a small value. The total loss converges from 35.2 to about 0.4. It proves that the model has been fully trained.

Testing Results
In the testing stage, some detection results are shown in Figure 8. We have randomly selected some representative instances. Taking Figure 8a as an example, the two pictures in the first column show that the model performs well on the single and small-scaled; the two in the second column show that, no matter from the side or the back of the pedestrian, the model can give correct detection results; and the two in the last column show some more complex situations, such as occluded pedestrian and multiscaled pedestrians. Figure 8b also illustrates a fraction of detection results on the KITTI dataset.

Testing Results
In the testing stage, some detection results are shown in Figure 8. We have randomly selected some representative instances. Taking Figure 8a as an example, the two pictures in the first column show that the model performs well on the single and small-scaled; the two in the second column show that, no matter from the side or the back of the pedestrian, the model can give correct detection results; and the two in the last column show some more complex situations, such as occluded pedestrian and multiscaled pedestrians. Figure 8b also illustrates a fraction of detection results on the KITTI dataset.

Testing Results
In the testing stage, some detection results are shown in Figure 8. We have randomly selected some representative instances. Taking Figure 8a as an example, the two pictures in the first column show that the model performs well on the single and small-scaled; the two in the second column show that, no matter from the side or the back of the pedestrian, the model can give correct detection results; and the two in the last column show some more complex situations, such as occluded pedestrian and multiscaled pedestrians. Figure 8b also illustrates a fraction of detection results on the KITTI dataset.

Tricks
Except for the experiment procedures aforementioned, we also introduce several tricks, just as the following: 1. In order to prevent the network from overfitting, an L2 loss is added to the total loss function.
In addition, data augmentation techniques are used, including random cropping and horizontal flipping. Figure 9 shows some detection examples after data augmentation. 2. As mentioned above, our ParallelNet will output four feature maps of equal size, which need to be merged into one feature map when inputting the detection head. There are several alternative fusion methods, including concatenating by channel, add by pixel, average by pixel and maximum by pixel. Through experiments, the detection accuracy of several methods is similar, but the calculation of the pixel-by-pixel method is less, so the method of adding by pixel is finally selected. Figure 9. Take the results on Caltech-Zhang as an example, two adjacent pictures are the detection output of the same input after data augmentation.

Tricks
Except for the experiment procedures aforementioned, we also introduce several tricks, just as the following: 1.
In order to prevent the network from overfitting, an L2 loss is added to the total loss function. In addition, data augmentation techniques are used, including random cropping and horizontal flipping. Figure 9 shows some detection examples after data augmentation.

2.
As mentioned above, our ParallelNet will output four feature maps of equal size, which need to be merged into one feature map when inputting the detection head. There are several alternative fusion methods, including concatenating by channel, add by pixel, average by pixel and maximum by pixel. Through experiments, the detection accuracy of several methods is similar, but the calculation of the pixel-by-pixel method is less, so the method of adding by pixel is finally selected.

Tricks
Except for the experiment procedures aforementioned, we also introduce several tricks, just as the following: 1. In order to prevent the network from overfitting, an L2 loss is added to the total loss function.
In addition, data augmentation techniques are used, including random cropping and horizontal flipping. Figure 9 shows some detection examples after data augmentation. 2. As mentioned above, our ParallelNet will output four feature maps of equal size, which need to be merged into one feature map when inputting the detection head. There are several alternative fusion methods, including concatenating by channel, add by pixel, average by pixel and maximum by pixel. Through experiments, the detection accuracy of several methods is similar, but the calculation of the pixel-by-pixel method is less, so the method of adding by pixel is finally selected. Figure 9. Take the results on Caltech-Zhang as an example, two adjacent pictures are the detection output of the same input after data augmentation.

Conclusions
This paper proposed ParallelNet, a backbone composed of four branches in parallel, applied for the pedestrian detection task. To satisfy the requirement of both high accuracy and small model size, we further applied the Fire module in the network. Subsequently, in the detection stage, a convolutional layer is added to generate the detection bounding box, which also reduces the model parameter. Moreover, the focal loss is applied to ameliorate the problem of extreme foreground-background class imbalance. Experimental results on the Caltech-Zhang dataset and KITTI dataset show that, compared with the one-branch network, such as ResNet, SqueezeNet and so on, ParallelNet has improved detection accuracy with fewer model parameters and lower GFLOPs. In addition, compared to the method of increasing the depth of the network, the idea of constructing the network in parallel is better in both improving the detection accuracy and reducing parameter. In the future work, we will apply this idea to other excellent lightweight networks, such as MobileNet, ShuffleNet and MixNet proposed recently.