2.1. Datasets
The main goal of this research is to investigate the potential of these network optimization techniques, which are developed on academic datasets, when applied to operational datasets. For this, we apply our techniques on two typical use-cases: both an academic dataset, Pascal VOC [
7], and an operational dataset, the LWIR Railway Surveillance Data [
20] and investigate in each of these two cases how much they can be slimmed down.
Pascal VOC is a dataset for object detection, which contains 21,503 images with 20 different classes. As seen in
Figure 1a, the various different classes are completely unrelated (e.g., “aeroplane”, “cow”, “person”, “tvmonitor”), which is often seen in an academic dataset and makes detection on these data much more difficult. For our experiments, we used the Pascal VOC 2007+2012 dataset and combined their “training 2007”, “training 2012” and “validation 2012” splits for training, whilst using the “validation 2007” split for validation. Because the annotations for the VOC 2012 testing set are not released publicly, we only use the ones from 2007 for testing purposes. The number of images per split are shown in
Table 1.
The publicly available LWIR Railway Surveillance Data (
https://iiw.kuleuven.be/onderzoek/eavise/viper/dataset, accessed on 31 March 2021) is a dataset for person detection in long wave infrared videos. The single class nature and fixed camera viewpoint are two common scene constraints in real-life scenarios, which makes this a prime example of this kind of operational dataset (see
Figure 1b). Moreover, comparing the person cutouts in
Figure 2 shows that there is much less intra-class variance; i.e., the academic Pascal VOC data contain images where persons are annotated from many different viewpoints and distances, compared to the LWIR data where all persons have a similar appearance, size and viewpoint. The dataset consists of 21,852 frames split across 28 different video sequences. Since the original paper for this dataset does not provide train, validation and test splits, we need to create our own. Care must be taken when splitting video sequences to not put frames of the same sequence in different splits, as the models could then overfit to the data unnoticeably. We thus split the 28 different sequences into 3 subsets, trying to match the 65-10-25% split from VOC. The video sequences in each split are listed in
Table 2 and the number of images in each split are found in
Table 1.
Our operational dataset contains many scene constraints and is much less challenging than the Pascal VOC data. This is done on purpose as it allows us to cover two completely different kinds of dataset of various complexity. By validating our techniques on these distinctive datasets, we make a strong case for the generalizability of our optimization pipeline.
2.2. Depth-Wise Separable Convolutions
A first technique that can be used in order to reduce the number of computations, is to replace all regular convolutions in the network with depth-wise separable convolutions. Initially introduced by Sifre and Mallat [
10] and later popularized by Howard et al. [
11] in their MobileNet paper, depth-wise separable convolutions are a form of factorized convolution that split a regular convolution into a depth-wise and point-wise convolution (see
Figure 3). A standard convolution both filters and combines information from multiple previous feature maps in a single step, resulting in the following computational cost:
where
and
are the dimension (width × height) of the kernel and feature map, respectively, and
and
are the depth of the input and output feature maps.
Depth-wise separable convolutions split this into two distinct operations, the depth-wise convolution for applying a filter and a point-wise convolution for combining information of multiple filters. This results in a total computational cost:
Notice that both regular and depth-wise separable convolutions combine the same information and generate an output with the same shape. However, the computational cost of depth-wise separable convolutions is clearly lower:
Our baseline network, YoloV2, is a fully convolutional network, where each convolution uses a kernel size of 3 × 3 (
). As the number of output feature maps in that network are always orders of magnitude bigger than nine, we can thus expect our network to have around nine times less computations, when swapping regular convolutions by depth-wise separable convolutions. However, as seen in
Section 3, replacing regular convolutions with depth-wise separable convolutions results in a significant drop of Average Precision (AP). Indeed, depth-wise separable convolutions have a more restricted modeling capability compared to regular convolutions and thus a drop in accuracy might be expected.
We determined experimentally that both the first and last convolutions work best with these extra modeling capabilities. We thus decided to keep regular convolutions for the first convolution of the network, as well as the second to last convolution. The latter combines information from 2 feature maps from different parts of the network (see
Figure 4). We coined this optimized architecture
MobileYoloV2. Whilst significantly faster, this architecture still presents a notable reduction in accuracy compared to the original network.
One of the easiest ways to improve the performance of single-shot detectors is to increase the input resolution of the images going into the network. However, this also increases the amount of computations by a significant amount (see
Table 3). Instead, we reduce the amount of downsampling that the model performs. This ensures that the output resolution of the model is bigger and thus the model is able to perform better. To limit the impact on the computational performance, we modify the network as close to the end as possible. We therefore choose to increase the output resolution, by removing the
“reorg” operation introduced before concatenation of 2 feature maps of different dimensions. Instead we upsample the smallest feature map (see
Figure 5). Our modification does not change the aim of the concatenation operation, which is to combine shallow, but spatially fine-grained features with higher level, more downsampled features (see
Figure 4). Instead of chopping down the spatially bigger feature map, we upscale the smaller one. This results in spatially bigger feature maps with more fine-grained details, allowing for potentially better detection results and/or localization. As the concatenation happens at the end of the network, with only two convolutions remaining, the influence on the computational complexity remains rather limited (see
Table 3). In fact, as both YoloV2 and MobileYoloV2 have the same last 2 layers—a regular convolution followed by a last point-wise convolution—the computational overhead of upsampling is exactly 6.11 Giga Multiply-Accumulate operations (GMAC) for both networks. We named these models with upsampling
YoloV2 Upsample and
MobileYoloV2 Upsample, respectively.
2.3. Channel-Wise Pruning
A second method to reduce the computational complexity of a neural network is to reduce the number of channels of the intermediate convolutions and feature maps. This method is called pruning and it relies on the fact that networks tend to be highly overparameterized [
26,
27]. However, contemporary experience seems to indicate that it is easier to train overparameterized networks [
15,
27,
28]. Pruning exploits this fact, by removing redundant and low importance filters from a trained network. When training models on small operational datasets, we use transfer learning in order to adapt a pretrained network to our specific use-case. When retraining, we want to keep the complex modeling capabilities that the original network contains. Afterwards pruning allows us to remove the redundancy in the network, which is much more present in the case of constrained operational datasets.
Since our network is fully convolutional and we aim to run the network on a Graphics Processing Unit (GPU), we focus on the channel-wise pruning of convolutional filters. Many pruning implementations emulate pruning by masking or replacing kernel weights with zeroes [
29,
30,
31]. While this is a completely valid approach which can help to reduce the size of the weight files, it does not reduce the number of computations when implemented on GPUs. Our framework effectively removes the channels from the convolutions during the pruning step, reducing the computational complexity of the network. As shown in
Figure 6, when you remove a channel from a convolution, this modifies the dimension of the output feature map. This in turn influences the next operation that the network will perform with that feature map. This is trivial for the simple case of a linear sequence, depicted in
Figure 6, but requires careful dependency tracking when working with more complex networks. Hence, we implemented a generic convolutional pruning framework in our open source Lightnet library [
25] for PyTorch [
29], which generates a dependency tree of the operations after each convolution and then takes care of adapting these operations. Most network architecture add a batch normalization layer and a non-linear activation after each convolution. The normalization layer contains parameters specific for the different channels of the feature map and thus needs to be adapted. However, this is not the case for the activation layers, nor is this necessary for pooling operations. Finally, our framework is also capable of tracking the feature map channels when concatenating or stacking multiple feature maps together, and thus allows to prune these convolutions as well. One limitation of our framework is that we cannot prune convolutions whose feature maps are used in element-wise operations with other feature maps, such as residual connections. Indeed, pruning these convolutions would require to prune the same channels from both convolutions and is not currently implemented in our framework.
Once the dependency tree has been built, the actually pruning starts. Our iterative pruning pipeline is described in pseudo code in Algorithm 1. We iteratively prune
of our model, and then retrain it for a maximum number of epochs
E, in order to reach the same accuracy as the original model. Note that the absolute number of pruned channels at each step diminishes over time, because the network itself becomes smaller. In order to limit the total runtime of the algorithm, we set a hard limit to the minimal number of channels to prune. Once the algorithm prunes less than 5 channels of the network per step, the pruning stops. If the new model reaches an accuracy during retraining that is
higher than the original accuracy, training is stopped prematurely. This prevents overfitting on the validation set and allows to set a higher value for the number of epochs
E, without having unnecessary long retraining times for the first few iterations of the pruning algorithm. Finally, if we cannot reach
after retraining for
E epochs, we still continue our pruning pipeline if the accuracy is only
below the original accuracy. The rationale behind this is that usually our validation set is small, and thus a minor drop in accuracy on this dataset might not be representative for the entire data. We therefore set the lower bound slightly below the original validation accuracy. Note that we use a separate validation set with this pruning algorithm, to prevent overfitting on our test set. Only the final models after pruning are tested on the test set and compared with the original models for verification.
Algorithm 1: Pruning pipeline. |
|
Since pruning is a very active research topic, there are a wide range of techniques which select the appropriate channels to be removed [
31]. In this paper, we aim to study how relatively simple approaches translate to operational use-cases. We thus only implemented and compared two different pruning techniques, based on the L2-norm [
13] and the Geometric Median (GM) [
14]. The L2-norm based pruning technique is straightforward: we compute the L2-norm of the weights of each prunable convolutional channel in the network. The channels with the lowest L2-norm are considered to be the least important and are thus pruned. However, the scale of the L2-norm can vary significantly depending on the depth of the convolution in the network. Inspired by Molchanov et al. [
13], we further normalize the L2-norm of each channel in a convolution:
where
and
are the weights of a single channel in a convolution
W. This allows us to compare the importance of channels in different convolutions and thus allows us to prune a certain percentage of the channels of all convolutions in the network.
In 2019, He et al. [
14] discussed two issues related to norm-based pruning:
Small Norm Deviation: The different norms might be concentrated in a small interval, which makes it hard to select an optimal threshold for pruning channels.
Large Minimum Norm: The channels with minimum norm may not be arbitrarily small. In this case, channels which we consider least important might still contain relevant information and pruning them might have negative consequences.
In order to solve these problems, they propose a method to prune channels that are closest to the geometric median of all channels in a convolution. After optimization, the final formula for the importance of a channel in a convolution is given as:
A potential disadvantage of this technique is that it becomes impossible to compare channels of different convolutions, as they have vastly different importance values. As such, the geometric median can only be used on a per-layer basis and thus we can only prune all the layers in our network uniformly. In order to mitigate this, in this paper we propose to combine both L2-based pruning and GM-based pruning (e.g., pruning 5% with L2 and 5% with GM).