Compression of Deep Convolutional Neural Network Using Additional Importance-Weight-Based Filter Pruning Approach

Sawant, Shrutika S.; Wiedmann, Marco; Göb, Stephan; Holzer, Nina; Lang, Elmar W.; Götz, Theresa

doi:10.3390/app122111184

Open AccessArticle

Compression of Deep Convolutional Neural Network Using Additional Importance-Weight-Based Filter Pruning Approach

by

Shrutika S. Sawant

¹,

Marco Wiedmann

¹

,

Stephan Göb

¹

,

Nina Holzer

¹,

Elmar W. Lang

² and

Theresa Götz

^1,2,3,*

¹

Fraunhofer IIS, Fraunhofer Institute for Integrated Circuits IIS, 91054 Erlangen, Germany

²

CIML Group, Biophysics, University of Regensburg, 93040 Regensburg, Germany

³

Clinic of Rheumatology, University Hospital Erlangen, 91054 Erlangen, Germany

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(21), 11184; https://doi.org/10.3390/app122111184

Submission received: 26 September 2022 / Revised: 21 October 2022 / Accepted: 26 October 2022 / Published: 4 November 2022

Download

Browse Figures

Versions Notes

Abstract

:

The success of the convolutional neural network (CNN) comes with a tremendous growth of diverse CNN structures, making it hard to deploy on limited-resource platforms. These over-sized models contain a large amount of filters in the convolutional layers, which are responsible for almost 99% of the computation. The key question here arises: Do we really need all those filters? By removing entire filters, the computational cost can be significantly reduced. Hence, in this article, a filter pruning method, a process of discarding a subset of unimportant or weak filters from the original CNN model, is proposed, which alleviates the shortcomings of over-sized CNN architectures at the cost of storage space and time. The proposed filter pruning strategy is adopted to compress the model by assigning additional importance weights to convolutional filters. These additional importance weights help each filter learn its responsibility and contribute more efficiently. We adopted different initialization strategies to learn more about filters from different aspects and prune accordingly. Furthermore, unlike existing pruning approaches, the proposed method uses a predefined error tolerance level instead of the pruning rate. Extensive experiments on two widely used image segmentation datasets: Inria and AIRS, and two widely known CNN models for segmentation: TernausNet and standard U-Net, verify that our pruning approach can efficiently compress CNN models with almost negligible or no loss of accuracy. For instance, our approach could significantly reduce 85% of all floating point operations (FLOPs) from TernausNet on Inria with a negligible drop of 0.32% in validation accuracy. This compressed network is six-times smaller and almost seven-times faster (on a cluster of GPUs) than that of the original TernausNet, while the drop in the accuracy is less than 1%. Moreover, we reduced the FLOPs by 84.34% without significantly deteriorating the output performance on the AIRS dataset for TernausNet. The proposed pruning method effectively reduced the number of FLOPs and parameters of the CNN model, while almost retaining the original accuracy. The compact model can be deployed on any embedded device without any specialized hardware. We show that the performance of the pruned CNN model is very similar to that of the original unpruned CNN model. We also report numerous ablation studies to validate our approach.

Keywords:

convolutional neural network; filter pruning; image segmentation; network compression; weight initialization strategies

1. Introduction

Different from traditional hand-crafted feature extraction methods [1,2,3,4], which involve abundant engineering skills and domain expertise, the convolutional neural network (CNN), a typical deep feature learning method, automatically extracts features at various scales. Supported by immense technological advancements, CNN models have grown deeper and wider. This growth in complexity has significantly intensified the learning ability of CNNs, which renders these models more suitable for numerous computer vision applications, such as object detection, object recognition, image classification, semantic segmentation, and many others [5,6,7,8,9,10,11]. Despite their surprising success, these deep models come at the cost of a vast amount of parameters and heavy computations, hampering their deployment on mobile or embedded devices with limited computational resources. Therefore, this problem has attracted significant attention from the research community in recent years. A prominent solution is to perform network compression without ominously degrading the model’s performance. To compress a deep model, numerous approaches have been suggested including knowledge distillation [12,13,14], network quantization [15], a lightweight architecture [16,17,18,19], low-rank approximations [20,21], and network pruning [22,23,24,25]. In knowledge distillation, the knowledge gathered by a large network (called the Teacher network) trained on large dataset is transferred to a smaller network (also known as the Student network). The Student model tries to mimic the Teacher model by learning its generalization capability and obtains a compact network with similar or higher accuracy. However, acquiring a well-learned Teacher model and transferring the rich knowledge from the Teacher to a Student are very challenging in knowledge distillation. Network quantization tries to cut down the number of bits occupied by weights to obtain a compact network model. Here, weights are represented by lowering the number of bits required per weight to store each weight. The weights can be quantized to 16-bit, 8-bit, 4-bit, or even with 1-bit. Quantization can be applied either during or after the training of the neural network, where relevancy among the network parameters is completely disregarded. Despite obtaining remarkable performance in network compression, these methods experience the threat of accuracy degradation. Most recent attempts have focused on factorization-based methods, for instance the methods where standard convolution modules are replaced with depthwise separable convolution modules, to obtain lightweight neural network architectures. Many classic lightweight architectures have been developed using the above process, such as MobileNet, SqueezNet [26], SqueezeNext [27], ESPNet, etc. Although these models have achieved significant performance, they tend to have strong domain dependency, which makes it difficult to balance between model size and its performance. Low-rank approximation methods obtain a compressed model by using a low-rank matrix to approximate the original weight matrix. These methods discards weak network components layer by layer. Singular-value decomposition (SVD) is one of the popular low-rank approximation methods. These methods are suitable for fully connected layers, since operations in these layers can be performed easily using matrix operations. However, these are not suitable for convolutional layers, where most of the computations of the network take place.

Despite the significant compression abilities of the above-mentioned approaches, network pruning has been widely explored in recent years due to its simplicity and effectiveness. Network pruning is the desirable way to compress the model by removing unwanted parameters in a structured or unstructured manner. Unstructured pruning removes the connections from the model with small magnitude, resulting in an irregular sparse structure of the CNN [15,22,28,29,30]. Specialized software and hardware platforms [31,32,33,34,35,36] are required to recover the network and accelerate the same. In contrast, structured pruning removes weak or unimportant filters from the model without corrupting the structure of the model and avoids the necessity of specialized software and hardware platforms [37,38,39,40,41,42,43,44,45,46,47]. For filter pruning, it is critical to decide how to find the “unimportant filter” intended for removal. There exist many criteria proposed by researchers including magnitude-based norms, Taylor expansion, gradient-based norms, and so on [24,25,39,40,48,49,50,51]. Identifying informative filters in a large network is very challenging since the search space is enormous to explore. Hence, some attempts have been made based on evolutionary optimization methods, which offer alternative filter pruning approaches for network compression. These approaches [14,23,29,37,52,53,54,55] search for optimal solutions (for instance, selecting informative filters while rejecting the weak filters) and, consequently, can lead to a compressed network. However, despite the satisfying performance, the approaches are still not computationally efficient at runtime.

In this work, we concentrate on the removal of unimportant filters from a deep CNN. Convolutional filters generally consume fewer parameters than fully connected layers; still, they account for most of the FLOPs. The key open question arises here: Do we really need all those filters? If not, it appears reasonable to discard unimportant filters in order to lower the computational burden. Again, another key question arises: How do we decide the importance of filters? The CNN consists of a vast amount of filters, which are either alike or shifted versions of each other. These filters produce different feature maps, highlighting or reflecting the different aspects of the distinctive information of the input data. However, these diverse features may not contribute equally to the final performance. Therefore, it is reasonable to assign additional weights to each filter, which further help the different features contribute efficiently. Therefore, in this paper, a new pruning approach, which uses external weights applied to each filter, is proposed to discard the weak filters from the network. More importantly, these extra weights (we use the term importance weights throughout the paper) allow each filter to learn their responsibility without interfering the weight update process in back propagation. The approach uses iterative pruning where importance weights decide which filters to be pruned at each iteration. In addition, an error tolerance level, which maintains the difference between the loss incurred by the unpruned and pruned network to a specific limit, is introduced. Unlike existing pruning approaches, the proposed approach does not need a pre-defined pruning rate; rather, it uses the error tolerance level and determines which and how many filters to be pruned. All existing works on magnitude-based filter pruning consider the weight values of either individual filters or a group of filters. These approaches have not shown much of a drop in performance, thus validating their effectiveness in model compression. However, it has been observed that these approaches fail to recuperate the full performance after fine-tuning with fewer iterations. Note that, instead of pruning the pre-trained model (trained on different dataset), we pruned the model, which is trained from scratch on a target dataset, and obtained a quite comparable performance. Closely related to our work, Singh et al. [56], pruned weak filters using an adaptive filter pruning rate. The importance of the filters is calculated using their

l_{1}

-norm. On the other hand, our approach enforces importance weights and uses them to identify the weak filters. The main contributions of the proposed filter pruning work can be briefly summarized as follows:

Appropriate importance weights are assigned to different filters, which influence the contributions of different feature maps to the final decision.
The effect of different weight initialization strategies such as random, chaotic, He, LeCun, and Xavier initializations on the decision performance is analyzed.
The pruned network is assessed on TernausNet and U-Net with the Inria and AIRS dataset for image segmentation application. Extensive experiments conducted on TernausNet and U-Net validate the improvement in the performance over unpruned networks.

The remaining part of the paper is organized as follows: In Section 2, the proposed framework for compressing neural networks is discussed in detail. The experimental studies are discussed in Section 3. Finally, Section 4 provides concluding remarks.

2. Proposed Methodology

The proposed filter pruning is performed to learn more about filters from different aspects and prune accordingly. The overall process is shown in Figure 1. This is an iterative pruning approach where additional importance weights are applied to identify unimportant filters. This filter pruning is performed in a way to minimize the damage to the original network and obtain a compact network.

The different filters contain various kinds of features, thus preserving additional useful information for further processing. Under this assumption, additional importance weights are applied to each filter. These importance weights will help the model learn efficiently and maintain the diversity among different feature maps by constraining the representation error. Certain filters in the convolutional layer contribute less or not at all. These filters are less reliable than others for any computer vision task. In these cases, it is assumed that the importance weights assist filters in capturing the discriminant characteristics of the input data and contribute efficiently in the final decision. Furthermore, similar to the traditional convolutional layer, the initialization of importance weights influences the convergence process and learning ability of the model. Hence, to analyze the impact of initialization, we adopted and compared various weight initialization methods. The stochastic nature of weights and the dynamic behavior of the back propagation algorithm cause the CNN with random weight initialization to be trained in an uncontrollable manner, leading to highly redundant filters with a non-uniform contribution in the overall performance of the model. Therefore, we used different initialization methods for additional importance weight initialization such as random uniform initialization, chaotic initialization, i.e., logistic map [57], He initialization [58], LeCun uniform initialization [59], and Xavier initialization [60]. Initialization also depends on the activation function used in the network. For example, using ReLU activation, He initialization is superior to Xavier–Glorot initialization. However, additional importance weights do not interfere in the process of gradient descent. Hence, this conclusion is irrelevant in our approach.

Let M be an original network model and M’ a compact network. The original network M with L layers consists of N convolutional filters in each layer. The parameters of the filters in each

l^{t h}

layer are denoted as

F_{l} ϵ R^{c \times N \times h \times ω}

, where

c

is the number of filter kernels and

h \times ω

denotes the size of the filter kernels.

F_{l}

is composed of

c

3D kernels

W_{l}^{i} ϵ R^{c \times h \times ω}, i ϵ (1, 2, \dots, N) .

In the proposed filter pruning model, we introduced additional importance weights

w_{i}

to each convolutional filter

F_{i}

. The aim of the proposed filter pruning was to iteratively prune unimportant filters. Let us assume that the larger the importance weight

w_{i}

, the more the convolutional filter

F_{i}

contributes to the decision-making process, and vice versa. The initial importance weights are applied using one of the above-mentioned initialization strategies. The importance weights are updated every subsequent iteration

(t + 1)

as

w_{i}^{t + 1} = w_{i}^{t} \frac{e^{m_{i}^{t}} - 1}{e^{m_{i}^{t}} + 1}

(1)

where, for each filter with c kernels and

k

the absolute weight of each 2D kernel,

m_{i}^{t}

can be computed as the sum of the absolute values of the filter weights:

m_{i}^{t} = \sum_{j = 1}^{c} |k_{j}|

(2)

Let us consider

E (F)

to be the original cost function or loss related to the original unpruned model M and

E (F^{'})

the loss representing the pruned model

M^{'}

. Let

F^{'}

be the set of retained filters.

The block diagram for the proposed filter pruning approach is presented in Figure 2. The overall procedure for the proposed filter pruning is as follows:

1.

Train the original unpruned model M from scratch and calculate its loss

E (F)

.

2.

Add importance weights to each convolutional filter using any of the weight initialization strategies such as random uniform initialization, chaotic initialization, i.e., logistic map, He initialization, LeCun uniform initialization, and Xavier initialization.

3.

Sort the filters based on the additional importance weights.

4.

Prune the filters whose weight values (additional importance weights) are less than a pre-defined threshold. Eliminate the filters in the next convolutional layer corresponding to pruned feature maps (pruned filters), and obtain a compact model

M^{'}

using the remaining filters.

5.

Compute

E (F^{'})

of the pruned model

M^{'}

.

6.

Check if

|E (F) - E (F^{'})| < ε

, where

ε

denotes the error tolerance level:

i.: If the loss difference is less than $ε$ , update the importance weights using Equation (1) and go to Step 3.
ii.: Else, stop the process.

7.

Obtain the final pruned model (or compact model).

8.

Fine-tune the pruned model to recover the loss.

In this way, after pruning and fine-tuning, a compact and lightweight model will be generated, which would show an approximately similar performance as the unpruned model.

3. Experiments

To verify the performance of the proposed approach, we conducted extensive experiments for TernausNet [61] and U-Net [62] on two widely used aerial image segmentation datasets, Inria [63] and AIRS [64].

3.1. Experimental Settings

For all experiments, we set

ε = 0.5

. We also report an ablation study for various values of

ε

. The value of the constant r used in a logistic chaotic map was set to r = 4, as the chaos phenomenon occurs at the same value [65,66]. Unless stated otherwise, both TernausNet and U-Net were trained on Inria and AIRS for 30 epochs using the Adam optimizer with a learning rate of 0.0001, an exponential decay rate for the first moment estimates of 0.9, an exponential decay rate for the second moment estimates of 0.999, and a batch size of 16. The models were implemented within the Pytorch framework.

3.2. Evaluation Metrics

To qualitatively evaluate the performance of the proposed approach, three commonly used evaluation metrics for segmentation applications were adopted, validation loss (Val. loss), validation accuracy (Val. Acc.), and intersection over union (IOU) [61]. To quantify the computational complexity of the proposed approach, two popular resource efficiency evaluation metrics were used, namely the number of parameters to measure the model size and the FLOPs to measure the computational cost. We used the same equation to compute the FLOPs as mentioned in [67].

3.3. Experiments on Inria Dataset

The Inria dataset consists of 180 aerial images in total captured over ten different cities with a broader area of

810 {km}^{2}

and includes different settlements and landscapes. The size of each image is

5000 \times 5000

pixels with a spatial resolution of

0.3 m

. In Inria, the ground truth is only given for the training set. For the ease of comparison, we split the dataset as mentioned by Maggiori et al. [63], and Bischke et al. [68]. We selected 25 images (the first five images of each of the five cities from the training set) for validation. For the Inria dataset, the training set contains

55, 955

images and the validation set encompasses

9025

images. For the ease of experimentation, the original images were randomly cropped to a size of

256 \times 256

. For data augmentation, we followed the same data augmentation as used in TernausNet [61], where random vertical and horizontal flips were used in the experiments.

We pruned TernausNet and U-Net using the proposed approach. In this study, we reproduced both models and trained them from scratch on the Inria, as well as the AIRS dataset for 30 epochs. These were utilized as baseline models. Consequently, the original accuracy of each network may slightly vary from the reported one in the literature. The importance weights were then applied to each filter by employing different weight initialization strategies. We iteratively pruned the least important filters in each pruning iteration. The pruning process was stopped when the difference between the losses incurred by the unpruned model and the loss incurred by the pruned model was more than the desired tolerance error. To restore the original performance, an additional fine-tuning for 15 epochs was adapted.

We attempted to intuitively illuminate the significance of different weight initialization strategies, namely random, logistic chaotic map, He, Lecun, and Xavier, via a series of experiments on the Inria dataset. We ran the experiments three times and report the accuracies as “mean ± std”. The pre-defined threshold was set to 1.2-times the mean of the importance weights. Accordingly, the percentage of the filters pruned from TernausNet and U-Net varied based on the initialization strategy used. Table 1 depicts the pruning performance of TernausNet and U-Net based on different weight initialization strategies.

3.3.1. TernausNet on Inria

TernausNet is the improved version of the U-Net architecture (for more details about U-Net, refer to Section 3.3.2). This amended version is composed of encoder and decoder networks, where VGG11 without a fully connected layer was adopted as its encoder. The fully connected layer of the VGG11 network was replaced with a single convolutional layer of 512 filter channels. In the decoder, transposed convolutional layers are concatenated with information from the encoder. The pre-trained TernausNet (baseline model) has 32.3 billion (B) FLOPs and 22.9 million (M) parameters with a 96% Val. Acc. As reported in Table 1, a reduction in the number of FLOPs and the number of parameters was observed with the proposed approach across all initialization strategies. For instance, when compressing TernausNet with the proposed approach using random initialization, the FLOPs were reduced by 84.98% and the number of parameters decreased by 83.54%, while the drop in accuracy was negligible. Moreover, using He and Xavier initializations, our approach reduced the FLOPs by 64.18% and the number of parameters by 73.20% with even a 0.04% improvement in accuracy. However, the reduction in FLOPs was 20.8% less than with a random initialization strategy. Overall, the FLOPs were reduced by more than 64% while maintaining the baseline performance, which indicates that the proposed approach is able to reduce the FLOPs by a large amount with all initialization strategies, while maintaining the Val. Acc. almost intact.

3.3.2. U-Net on Inria

U-Net is a U-shaped architecture consisting of encoder–decoder paths [62]. The encoder captures contextual or low-resolution features from input data while reducing the spatial dimensions (downsampling) and increasing the channels. The pre-trained U-Net (baseline model) has 54.6 B FLOPs and 31 M parameters with 96.13% Val. Acc. We can observe that the proposed approach strongly reduced the number of FLOPs while keeping a similar accuracy and a comparable reduction in the number of parameters. As shown in Table 1, as expected, a random initialization strategy reduced the FLOPs more strongly than any other initialization strategy, whereas the drop in the Val. Acc. was negligible. When compressing U-Net with a chaotic initialization strategy, the FLOPs were reduced by 81.28%. However, the drop in accuracy was only 0.21%, which is slightly better than a random initialization strategy. The smallest drop in the Val. Acc. by just 0.04% was shown under the Lecun initialization strategy. However, the FLOPs were only reduced by 74.46%, which is 9.34% less than with a random initialization strategy, indicating many important or discriminative filters were removed. Moreover, the error performance of the pruned models was also comparable to that of the baseline models.

Based on the performance on TernausNet and U-Net, we can say that our pruning approach can achieve promising results under all initialization strategies and obtain a compact network model.

3.3.3. Comparison with Other Filter Pruning Approaches

We compared our approach with several state-of-the-art criteria to show the effectiveness of the importance weights as a pruning criterion in deep CNNs, which are briefly described below:

i.: $l_{1}$ -norm: A score is computed for each $i^{t h}$ filter using the $l_{1}$ -norm as $s_{i} = ‖ F_{i} ‖_{1} = \sum_{j = 1}^{c} |k_{j}|$ . Filters with a low score are considered weak filters and, hence, pruned [50].
ii.: Random pruning: Filters are removed randomly [41].
iii.: Entropy-based pruning: The filter importance is calculated based on Shannon’s entropy measure. If a filter has low entropy, this filter is considered unimportant and, consequently, removed [40].

To have a fair comparison with the state-of-the-art methods, we report the following measures: Val. loss, drop in Val. Acc., reduction in FLOPs, and reduction in the number of parameters under the same pruning rate, i.e., 50%. Table 2 presents the results obtained with the Inria dataset. Obviously, the proposed approach hardly worsened the original accuracy (−0.04% to 0.32% accuracy drop) when (72.25%–83.68%) of the network parameters and (64.18%–84.98%) of the FLOPs were removed. When compressing TernausNet, the proposed approach with random initialization showed the largest FLOP reduction rate, i.e., almost 85%, in comparison with other filter pruning approaches with a drop of only 0.32% in the Val. Acc. In comparison with random filter pruning, we observed that our approach can reduce 10.68% more FLOPs with only a little larger drop in accuracy. When pruning the TernausNet by entropy pruning, the accuracy drop was less than that of the proposed approach with random or chaotic initialization. Unfortunately, at the same time, the reduction in FLOPs was insignificant, indicating that many filters were retained despite their minor influence on the performance of the model. The

l_{1}

-norm induced a sparse representation and strongly diminished the FLOPs, yet the performance was still pure. This result specifies the weakness of the norm in pruning weak filters: it may force the network to pay too much attention to prune many filters, which is harmful for model generalization. Similar observations hold for U-Net, where the proposed approach outperformed the competing methods with a larger reduction in the FLOPs and parameters without sacrificing the model accuracy too much. Based on the overall performance, we can conclude that the proposed approach can obtain promising results with all initialization strategies.

3.4. Experiments on AIRS Dataset

To further validate the effectiveness of the proposed approach, we also performed experiments on a large segmentation dataset, i.e., AIRS. The AIRS dataset consists of 951 aerial images captured over the area of Christchurch, the largest city on the South Island of New Zealand. The size of each image is

10, 000 \times 10, 000

pixels with a spatial resolution of

0.075 m

. In AIRS, the ground truth is only given for the training set. For the ease of comparison, we split the dataset as mentioned by Q. Chen et al. [64]. For the AIRS dataset, the training set contains

309, 377

images and the validation set encompasses

33, 934

images. For the ease of experimentation, the original images were randomly cropped to a size of

526 \times 526

and then resized to

256 \times 256

. For data augmentation, we followed the same data augmentation as used in TernausNet [61], where random vertical and horizontal flips were used in the experiments. Table 2 presents our pruning results for TernausNet and U-Net on the larger AIRS dataset under different weight initialization strategies.

3.4.1. TernausNet on AIRS

As Table 3 corroborates, the reduction in the FLOPs and the number of network parameters is clearly obvious using the proposed approach under different initialization strategies. For instance, when compressing TernausNet with the proposed approach using a random initialization, the FLOPs were reduced by 84.34% and the parameters were decreased by 83.50%, while the drop in accuracy stayed moderate with 0.05%. In contrast, He initialization showed the same drop in accuracy with a 75.84% reduction in the FLOPs, which is 8.5% less than in the case of a random initialization strategy, indicating that many still important or discriminative filters were removed. A LeCun initialization showed an almost similar performance to that of a He initialization with a little larger reduction in the FLOPs. When compressing TernausNet with a chaotic initialization strategy, the FLOPs were reduced by 77.37%. However, the drop in accuracy amounted to 0.07%, which is almost the same as that of a random initialization strategy. Additionally, using Xavier initialization, our approach reduced the FLOPs by 75.84% and the number of parameters by 75.42% with just a 0.04% drop in accuracy. However, the reduction in the FLOPs was 8.5% less than with a random initialization strategy. Overall, the FLOPs were significantly reduced while maintaining the baseline performance, which indicates that the proposed approach is able to reduce the FLOPs by a large amount with all initialization strategies, while maintaining the Val. Acc. almost intact.

3.4.2. U-Net on AIRS

As shown in Table 3, as expected, a random initialization strategy was able to reduce the FLOPs more strongly than any other initialization strategy, whereas the drop in the Val. Acc. was still negligible. Chaotic initialization showed an almost similar performance with a drop of 0.06% in accuracy and an 81.46% reduction in the FLOPs, which is 2.17% less than with a random initialization strategy. LeCun initialization showed a similar performance to that of He initialization with a little less reduction of the FLOPs. When compressing U-Net with a chaotic initialization strategy, the FLOPs were reduced by 81.46%, which is slightly less than with a random initialization strategy and an equal accuracy drop. Furthermore, using Xavier initialization, our approach reduced the FLOPs by 64.18% and the number of parameters by 73.20% with even a 0.04% improvement in the accuracy. However, the reduction in the FLOPs was 20.8% less than with a random initialization strategy.

Based on the performance of TernausNet and U-Net on a large segmentation dataset, we can say that our pruning approach can achieve promising results under all initialization strategies and can obtain a compact network model.

3.4.3. Comparison with Other Filter Pruning Approaches

We compared our approach with several state-of-the-art methods to show the effectiveness of the importance weights as a pruning criterion in deep CNNs. Table 4 presents the results obtained on the AIRS dataset. We can see that our filter pruning approach led to a stronger reduction of the FLOPs and network parameters while only moderately changing the accuracy and loss. We observed that, when pruning TernausNet, a random filter pruning method showed an almost similar accuracy drop as that of the proposed approach with random initialization. However, our approach is superior because random filter pruning reduced the FLOPs by 74.73%, which is 8.9% less than what was achieved using our approach. One interesting observation was that the

l_{1}

-norm failed again, delivering much inferior results compared with the baseline approach or the results of the competitors. Despite strongly reducing the FLOPs, only the drop in accuracy reflected the degradation in network performance induced by the pruned filters. Similarly, our approach also strongly reduced the FLOPs when pruning U-Net with a random initialization strategy. Moreover, our approach, after removing more than 74% of the network parameters, achieved better accuracy than the state-of-the-art methods.

3.5. Ablation Study

We conducted an extensive ablation study to further analyze different settings of our approach. For simplicity and reliability, all the following experiments were conducted on the Inria dataset for TernausNet.

3.5.1. Influence of Different Initialization Strategies on Performance

To analyze the difference of the reduction in the FLOPs with the different initialization strategies, we compared the difference in the FLOPs per layer, which is noted in Table 5. Although no fixed pruning rate was used, the proposed approach showed an almost identical reduction of the FLOPs in all layers, except the last one. It can be easily observed that, with a random initialization strategy, many filters were pruned, resulting in an 84.98% reduction of the FLOPs. More specifically, more than 98% of the FLOPs were reduced from all layers, except the last layer, indicating that many weak or non-discriminative filters were removed. The chaotic initialization also showed similar performance. Despite He, Lecun, and Xavier initialization resulting in a larger number of pruned filters, the reduction in the FLOPs and network parameters was still acceptable. We firmly believe that a random initialization strategy works best among all initialization strategies used in the proposed approach because of its precise ability to prune less discriminant or weak filters. In effect, more filters are pruned with a random initialization than with other initialization strategies without losing much accuracy.

Table 6 reports the compression and acceleration analysis of the proposed approach under different initialization strategies on TernausNet. The compression ratio tells how much the network is compressed and is computed by dividing the number of parameters of the baseline network by that of the pruned network. The acceleration ratio reflects the theoretical computational costs and is computed by dividing the FLOPs of the baseline network by those of the pruned network. As reported in Table 6, a random initialization strategy achieved the highest acceleration, as well as the highest compression.

Additionally, we provide the convergence behavior (here, the process is said to be converged if the condition in Step 6 of Algorithm 1 is not satisfied or the pruning process stopped) of the proposed approach under different importance weight initialization strategies. The learning curves of the proposed approach under different weight initialization strategies are shown in Figure 3. The ascending trend is quite evident in the learning curve of the proposed approach on all importance weight initialization strategies explored. This strongly indicates the ability of importance weights in pruning weak filters and obtaining a compressed network over the course of iterations. We observed that chaotic initialization converged faster than all other strategies, whereas LeCun initialization took more iterations to converge. Moreover, it was observed that the He initialization strategy followed the convergence behavior of the Xavier strategy (we obtained the same results with He initialization as that of the Xavier strategy; hence, it is not displayed in Figure 3).

3.5.2. Sensitivity of Hyperparameter $ε$

We also performed an ablation study on hyperparameter

ε

, i.e., how much difference between the loss incurred by an unpruned model and the loss incurred by a pruned model is acceptable. We experimented with

ε = \{0.3, 0.35, 0.4, 0.45, 0.5\}

, and the results are reported in Table 7. We observed that if

ε

was set to a value lower than 0.3, it did not degrade the performance, but failed to obtain a compact model. On the other hand, if

ε

was set to a value higher than 0.5, it failed to converge. As with the above specified values, the results obtained by the proposed approach, including different initialization strategies, look rather similar; we consider

ε = 0.5

a proper value, which returned steady performance for different initialization strategies. Therefore, we set

ε = 0.5

for all experiments.

3.5.3. Effects of Threshold Selection

In this subsection, we explore the influence of different threshold values on the performance of the pruned model. We tested TernausNet on the Inria dataset with two different threshold values, and the corresponding experimental results are reported in Table 8. Here, importance weights applied to each filter were considered to define a threshold value. To study the influence of different thresholds, we set

ε = 0.4

, as a larger value led to an unstable training process in the case of using the mean of the importance weights as a pruning threshold. If weak filters were removed by applying a global threshold, then the pruned model lost its ability to accurately classify by removing an entire layer. Therefore, pruning thresholds were adapted locally for each layer. The pruning thresholds used represent either the mean importance weight of the filters per layer or 1.2-times the average importance weight of the filters per layer. As shown in Table 8, the mean of the importance weights taken as the threshold provided a good performance under different initialization strategies, but failed to significantly reduce the FLOPs and the number of parameters. In contrast, defining the pruning threshold as 1.2-times the average importance weight, we were able to discard large amounts of filters while maintaining the accuracy intact. For instance, the mean of the importance weights used as the pruning threshold reduced 74.78% of the FLOPs under random initialization, whereas 1.2-times the average importance weight used as the pruning threshold reduced 84.34% of the FLOPs under the same initialization strategy, which is an almost 10% larger reduction of the FLOPs. Hence, based on the experimental results, we adopted 1.2-times the mean of the importance weights as a proper pruning threshold, which provided acceptable performance while significantly reducing the FLOPs.

4. Conclusions

In this paper, a novel filter pruning method was proposed to obtain a compact and lightweight CNN model. In the proposed filter pruning approach, importance weights were applied to each convolutional filter, which helped select those filters, which learned more efficiently and contributed more effectively. We analyzed various weight initialization strategies to discard weak filters. Our approach ensures that the pruned model will not cross the pre-defined error tolerance level. Extensive experiments on various segmentation models such as U-Net and TernausNet demonstrated the excellent performance of the proposed filter pruning approach. Furthermore, we confirmed that our approach achieved better performance than other state-of-the-art approaches.

Based on the overall performance, we can conclude that our proposed approach can obtain promising results under all initialization strategies with both small and large datasets. The experimental results on two segmentation models and two datasets indicated that our approach can obtain very compact pruned models with better performance than state-of-the-art approaches. However, iteratively checking the loss difference between the unpruned and pruned model took some time. Furthermore, we used two hyperparameters in our study, and selecting the appropriate ones is crucial. Therefore, in the future, we will use a method that works faster and use a method to automatically determine the parameters based on network models and datasets. Additionally, in the future, we would like to explore the benefits of the proposed pruning method for classification tasks on larger datasets, such as ImageNet.

Author Contributions

S.S.S. and T.G.: formulated the research goal, designed the methodology, and wrote the main manuscript text; M.W.: implemented and executed all the experiments; S.G.: prepared Figure 1 and Figure 3 and data curation; N.H.: planned and coordinated the research activity; E.W.L.: analyzed the results. All authors reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received financial support for research and publication from European Research Consortium for Informatics and Mathematics (ERCIM) fellowship program.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the Fraunhofer Institute for Integrated Circuits (IIS) for providing the infrastructure for carrying out this research work and the European Research Consortium for Informatics and Mathematics (ERCIM) for the award of a Research Fellowship.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chen, H.; Fang, M.; Yijia, C.; Yijun, X.; Tao, C. A Hyperspectral Image Classification Method Using Multifeature Vectors and Optimized KELM. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2781. [Google Scholar] [CrossRef]
Hasan, A.M.; Shin, J. Online Kanji Characters Based Writer Identification Using Sequential Forward Floating Selection and Support Vector Machine. Appl. Sci. 2022, 12, 10249. [Google Scholar] [CrossRef]
Sawant, S.S.; Manoharan, P. Unsupervised band selection based on weighted information entropy and 3D discrete cosine transform for hyperspectral image classification. Int. J. Remote Sens. 2020, 41, 3948–3969. [Google Scholar] [CrossRef]
Song, Y.; Cai, X.; Zhou, X.; Zhang, B.; Chen, H.; Li, Y.; Deng, W.; Deng, W. Dynamic hybrid mechanism-based differential evolution algorithm and its application. Expert Syst. Appl. 2023, 213, 118834, ISSN 0957-4174. [Google Scholar] [CrossRef]
Roy, A.M. Adaptive transfer learning-based multiscale feature fused deep convolutional neural network for EEG MI multiclassification in brain–computer interface. Eng. Appl. Artif. Intell. 2022, 116, 105347, ISSN 0952-1976. [Google Scholar] [CrossRef]
Pius, K.; Li, Y.; Agyekum, E.A.; Zhang, T.; Liu, Z.; Yamak, P.T.; Essaf, F. SD-UNET: Stripping down U-Net for Segmentation of Biomedical Images on Platforms with Low Computational Budgets. Diagnostics 2021, 10, 110. [Google Scholar] [CrossRef] [Green Version]
Yaohui, L.; Gross, L.; Li, Z.; Li, X.; Fan, X.; Qi, W. Automatic Building Extraction on High-Resolution Remote Sensing Imagery Using Deep Convolutional Encoder-Decoder with Spatial Pyramid Pooling. IEEE Access 2019, 7, 128774–128786. [Google Scholar] [CrossRef]
Lawal, M.O. Tomato detection based on modified YOLOv3 framework. Sci. Rep. 2021, 11. [Google Scholar] [CrossRef]
Wu, Y.; Wan, G.; Liu, L.; Wei, Z.; Wang, S. Intelligent Crater Detection on Planetary Surface Using Convolutional Neural Network. In Proceedings of the 2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 12–14 March 2021; pp. 1229–1234. [Google Scholar]
Zhang, M.; Li, W.; Du, Q. Diverse Region-Based CNN for Hyperspectral Image Classification. IEEE Trans. Image Process. 2018, 27, 2623–2634. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation. arXiv 2019, 39, 1856–1867. [Google Scholar] [CrossRef]
Yu-Wei, H.; Leu, J.; Faisal, M.; Prakosa, S.W. Analysis of Model Compression Using Knowledge Distillation. IEEE Access 2022, 10, 85095–85105. [Google Scholar] [CrossRef]
Wang, Z.; Lin, S.; Xie, J.; Lin, Y. Pruning Blocks for CNN Compression and Acceleration via Online Ensemble Distillation. IEEE Access 2019, 7, 175703–175716. [Google Scholar] [CrossRef]
Zhou, Y.; Yen, G.G.; Yi, Z. Evolutionary Shallowing Deep Neural Networks at Block Levels. IEEE Trans. Neural Netw. Learn. Syst. 2021, 1–13. [Google Scholar] [CrossRef] [PubMed]
Song, H.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In Proceedings of the 4th International Conference on Learning Representations, ICLR 2016—Conference Track Proceedings, San Juan, Puerto Rico, 2–4 May 2016; pp. 1–14. [Google Scholar]
Dong, Z.; Zhang, R.; Shao, X.; Kuang, Z. Learning Sparse Features with Lightweight ScatterNet for Small Sample Training. Knowl. Based Syst. 2020, 205, 106315. [Google Scholar] [CrossRef]
Andrew, H.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint 2017, arXiv:1704.04861. [Google Scholar]
Luo, J.H.; Zhang, H.; Zhou, H.Y.; Xie, C.W.; Wu, J.; Lin, W. ThiNet: Pruning CNN Filters for a Thinner Net. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2525–2538. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M.; Caspi, A.; Shapiro, L.; Hajishirzi, H. ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Computer Vision—ECCV ECCV 2018. Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; Volume 1. [Google Scholar] [CrossRef] [Green Version]
Chen, Z.; Chen, Z.; Lin, J.; Liu, S.; Li, W. Deep Neural Network Acceleration Based on Low-Rank Approximated Channel Pruning. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 67, 1232–1244. [Google Scholar] [CrossRef]
Swaminathan, S.; Garg, D.; Kannan, R.; Andres, F. Sparse Low Rank Factorization for Deep Neural Network Compression. Neurocomputing 2020, 398, 185–196. [Google Scholar] [CrossRef]
Babak, H.; Stork, D.G.; Ivolff, G.J.; Hill, S.; Suite, R. Optiml Brain Surgeon and General Xetlwork Pruning. In Proceedings of the IEEE International Conference on Neural Networks, Nagoya, Japan, 25–29 October 1993; pp. 293–299. [Google Scholar]
Wu, T.; Li, X.; Zhou, D.; Li, N.; Shi, J. Differential Evolution Based Layer-Wise Weight Pruning for Compressing Deep Neural Networks. Sensors 2021, 21, 880. [Google Scholar] [CrossRef]
Xu, Y.; Fang, Y.; Peng, W.; Wu, Y. An Efficient Gaussian Sum Filter Based on Prune-Cluster-Merge Scheme. IEEE Access 2019, 7, 150992–150995. [Google Scholar] [CrossRef]
Yeom, S.-K.; Seegerer, P.; Lapuschkin, S.; Binder, A.; Wiedemann, S.; Müller, K.-R.; Samek, W. Pruning by Explaining: A Novel Criterion for Deep Neural Network Pruning. Pattern Recognit. 2021, 115. [Google Scholar] [CrossRef]
Forrest, I.; Song, H.; Mattew, M.; Khalid, A.; Wiliam, D.; Kurt, K. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5. In International Conference on Learning Representations; IEEE: Piscataway, NJ, USA, 2017; pp. 1–13. [Google Scholar]
Amir, G.; Kiseok, K.; Bichen, W.; Zizheng, T. SqueezeNext: Hardware-Aware Neural Network Design. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW) 2018, 2018, 1719–171909. [Google Scholar] [CrossRef] [Green Version]
Song, H.; Pool, J.; Tran, J.; Dally, W.J. Learning Both Weights and Connections for Efficient Neural Networks. Adv. Neural Inf. Process. Syst. 2015, 1135–1143. [Google Scholar] [CrossRef]
Wang, H.; Zhang, Q.; Wang, Y.; Yu, L.; Hu, H. Structured Pruning for Efficient ConvNets via Incremental Regularization. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar] [CrossRef] [Green Version]
Wen, L.; Zhang, X.; Bai, H.; Xu, Z. Structured pruning of recurrent neural networks through neuron selection. Neural Netw. 2020, 123, 134–141, ISSN 0893-6080. [Google Scholar] [CrossRef] [Green Version]
Kang, H.-J. Accelerator-Aware Pruning for Convolutional Neural Networks. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2093–2103. [Google Scholar] [CrossRef] [Green Version]
Liu, J.Y.; Hui, J.F.; Sun, M.Y.; Liu, X.S.; Lu, W.H.; Ma, C.H.; Zhang, Q.B. A Multiplier-Less Convolutional Neural Network Inference Accelerator for Intelligent Edge Devices. IEEE J. Emerg. Sel. Top. Circuits Syst. 2021, 11, 739–750. [Google Scholar] [CrossRef]
Russo, E.; Palesi, M.; Monteleone, S.; Patti, D.; Mineo, A.; Ascia, G.; Catania, V. DNN Model Compression for IoT Domain-Specific Hardware Accelerators. IEEE Internet Things J. 2021, 9, 6650–6662. [Google Scholar] [CrossRef]
Liu, J.Y.; Hui, J.F.; Sun, M.Y.; Liu, X.S.; Lu, W.H.; Ma, C.H.; Zhang, Q.B. Libraries of Approximate Circuits: Automated Design and Application in CNN Accelerators. IEEE J. Emerg. Sel. Top. Circuits Syst. 2020, 10, 406–418. [Google Scholar] [CrossRef]
Li, G.; Ma, F.; Guo, J.; Zhao, H. A Flexible and Efficient FPGA Accelerator for Various Large-Scale and Lightweight CNNs. IEEE Trans. Circuits Syst. I: Regul. Pap. 2021, 69, 1185–1198. [Google Scholar]
Liu, J.Y.; Hui, J.F.; Sun, M.Y.; Liu, X.S.; Lu, W.H.; Ma, C.H.; Zhang, Q.B. An Efficient and Flexible Accelerator Design for Sparse Convolutional Neural Networks. IEEE Trans. Circuits Syst. I: Regul. Pap. 2021, 68, 2936–2949. [Google Scholar] [CrossRef]
Francisco, E.; Gary, G.Y. Pruning Deep Convolutional Neural Networks Architectures with Evolution Strategy. Inf. Sci. 2021, 552, 29–47. [Google Scholar] [CrossRef]
Götz, T.I.; Göb, S.; Sawant, S.; Erick, X.F.; Wittenberg, T.; Schmidkonz, C.; Tomé, A.M.; Lang, E.W.; Ramming, A. Number of Necessary Training Examples for Neural Networks with Different Number of Trainable Parameters. J. Pathol. Inform. 2022, 13, 100114. [Google Scholar] [CrossRef] [PubMed]
Yang, H.; Kang, G.; Dong, X.; Fu, Y.; Yang, Y. Soft Filter Pruning for Accelerating Deep Convolutional Neural Networks. IJCAI Int. Jt. Conf. Artif. Intell. 2018, 2234–2240. [Google Scholar] [CrossRef] [Green Version]
Luo, J.H.; Wu, J. An Entropy-Based Pruning Method for CNN Compression. arXiv 2017, arXiv:1706.05791v1. [Google Scholar] [CrossRef]
Deepak, M.; Bhardwaj, S.; Khapra, M.M.; Ravindran, B. Studying the Plasticity in Deep Convolutional Neural Networks Using Random Pruning. Mach. Vis. Appl. 2019, 30, 203–216. [Google Scholar] [CrossRef] [Green Version]
Sawant, S.S.; Bauer, J.; Erick, F.X.; Ingaleshwar, S.; Holzer, N.; Ramming, A.; Lang, E.W.; Götz, T. An optimal-score-based filter pruning for deep convolutional neural networks. Appl. Intell. 2022. [Google Scholar] [CrossRef]
Shi, J.; Xu, J.; Tasaka, K.; Chen, Z. SASL: Saliency-Adaptive Sparsity Learning for Neural Network Acceleration. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 2008–2019. [Google Scholar] [CrossRef]
Lin, S.; Ji, R.; Li, Y.; Deng, C.; Li, X. Toward Compact ConvNets via Structure-Sparsity Regularized Filter Pruning. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 574–588. [Google Scholar] [CrossRef]
Tian, G.; Chen, J.; Zeng, X.; Liu, Y. Pruning by Training: A Novel Deep Neural Network Compression Framework for Image Processing. IEEE Signal Process. Lett. 2021, 28, 344–348. [Google Scholar] [CrossRef]
Zheng, Y.-J.; Chen, S.-B.; Ding, C.H.Q.; Luo, B. Model Compression Based on Differentiable Network Channel Pruning. IEEE Trans. Neural Netw. Learn. Syst. 2022. [Google Scholar] [CrossRef]
Zuo, Y.; Chen, B.; Shi, T.; Sun, M. Filter Pruning without Damaging Networks Capacity. IEEE Access 2020, 8, 90924–90930. [Google Scholar] [CrossRef]
Li, G.; Wang, J.; Shen, H.; Chen, K.; Shan, G.; Lu, Z. CNNPruner: Pruning Convolutional Neural Networks with Visual Analytics. IEEE Trans. Vis. Comput. Graph. 2021, 27, 1364–1373. [Google Scholar] [CrossRef] [PubMed]
Yang, H.; Liu, P.; Wang, Z.; Hu, Z.; Yang, Y. Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; p. 4335. [Google Scholar] [CrossRef] [Green Version]
Hao, L.; Samet, H.; Kadav, A.; Durdanovic, I.; Graf, H.P. Pruning Filters for Efficient Convnets. In Proceedings of the 5th International Conference on Learning Representations 2017, ICLR 2017—Conference Track Proceedings, Toulon, France, 24–26 April 2017. [Google Scholar]
Jun, M.; Sun, K.; Liao, X.; Leng, L.; Chu, J. Human Segmentation Based on Compressed Deep Convolutional Neural Network. IEEE Access 2020, 8, 167585–167595. [Google Scholar] [CrossRef]
Chang, J.; Lu, Y.; Xue, P.; Xu, Y.; Wei, Z. ACP: Automatic Channel Pruning via Clustering and Swarm Intelligence Optimization for CNN. arXiv 2021, arXiv:2101.06407. [Google Scholar]
Sijie, N.; Gao, K.; Ma, P.; Gao, X.; Zhao, H.; Dong, J.; Chen, Y.; Chen, D. Exploiting Sparse Self-Representation and Particle Swarm Optimization for CNN Compression. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1447. [Google Scholar] [CrossRef]
Wang, Z.; Li, F.; Shi, G.; Xie, X.; Wang, F. Network Pruning Using Sparse Learning and Genetic Algorithm. Neurocomputing 2020, 404, 247–256. [Google Scholar] [CrossRef]
Zhou, Y.; Yen, G.G.; Yi, Z. Evolutionary Compression of Deep Neural Networks for Biomedical Image Segmentation. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 2916–2929. [Google Scholar] [CrossRef]
Singh, P.; Verma, V.K.; Rai, P.; Namboodiri, V.P. Namboodiri. Acceleration of Deep Convolutional Neural Networks Using Adaptive Filter Pruning. IEEE J. Sel. Top. Signal Process. 2020, 14, 838–847. [Google Scholar] [CrossRef]
Sarfaraz, M.; Doja, M.N.; Chandra, P. Chaos Based Network Initialization Approach for Feed Forward Artificial Neural Networks. J. Comput. Theor. Nanosci. 2020, 17, 418–424. [Google Scholar] [CrossRef]
Kaiming, H.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classificatio. Int. J. Robot. Res. 2015. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436444. [Google Scholar] [CrossRef] [PubMed]
Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
Vladimir, I.; Shvets, A. TernausNet: U-Net with VGG11 Encoder Pre-Trained on Imagenet for Image Segmentation. arXiv 2018, arXiv:1801.05746. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015. [Google Scholar] [CrossRef] [Green Version]
Emmanuel, M.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can Semantic Labeling Methods Generalize to Any City? The Inria Aerial Image Labeling Benchmark. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 3226–3229. [Google Scholar] [CrossRef] [Green Version]
Chen, Q.; Wang, L.; Wu, Y.; Wu, G.; Guo, Z.; Waslander, S.L. Aerial Imagery for Roof Segmentation: A Large-Scale Dataset towards Automatic Mapping of Buildings. ISPRS J. Photogramm. Remote Sens. 2021, 147, 42–55. [Google Scholar] [CrossRef] [Green Version]
Liu, L.; Liu, X.; Wang, N.; Zou, P. Modified Cuckoo Search Algorithm with Variational Parameters and Logistic Map. Algorithms 2018, 11, 30. [Google Scholar] [CrossRef] [Green Version]
Yang, D.; Li, G.; Cheng, G. On the Efficiency of Chaos Optimization Algorithms for Global Optimization. Chaos Solitons Fractals 2007, 34, 1366–1375. [Google Scholar] [CrossRef]
Liu, X.; Wu, L.; Dai, C.; Chao, H.C. Compressing CNNs Using Multi-Level Filter Pruning for the Edge Nodes of Multimedia Internet of Things. IEEE Internet Things J. 2021, 4662, 1–11. [Google Scholar] [CrossRef]
Bischke, B.; Helber, P.; Folz, J.; Borth, D.; Dengel, A. Multi-Task Learning for Segmentation of Building Footprints with Deep Neural Networks. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1480–1484. [Google Scholar] [CrossRef]

Figure 1. Overall procedure for the proposed filter pruning.

Figure 2. The proposed filter pruning approach. Here,

w_{t}

represents the weight of the filter at the

t^{t h}

iteration.

Figure 2. The proposed filter pruning approach. Here,

w_{t}

represents the weight of the filter at the

t^{t h}

iteration.

Figure 3. Convergence curve of the proposed approach with different weight initialization strategies. We report the loss of the network just after pruning (and before fine-tuning).

Table 1. Results of pruning TernausNet and U-Net on Inria dataset. “Baseline” represents the normal training results, i.e., without pruning. “↓” represents the reduction rate, which is the drop between the performance of the pruned model and the baseline model. A negative value in “Val. Acc. ↓” specifies an improvement in the model accuracy. B = billion, M = million.

Network	Model	Val. Loss	Val. Acc. (%)	Val. Acc. ↓ (%)	FLOPs	FLOPs ↓ (%)	# Param.	# Param. ↓ (%)
TernausNet	Baseline	0.1037	96.00	-	32.3 B	-	22.9 M	-
	Proposed (Random)	0.1121	95.68 (±0.001)	0.32	4.85 B	84.98	3.77 M	83.54
	Proposed (Chaotic)	0.1061	95.91 (±0.001)	0.09	6.27 B	80.59	4.33 M	81.12
	Proposed (He)	0.1046	96.04 (±0.001)	−0.04	11.6 B	64.18	6.14 M	73.20
	Proposed (Lecun)	0.1046	96.00 (±0.001)	0.00	11.0 B	66.02	6.36 M	72.25
	Proposed (Xavier)	0.1039	96.04 (±0.001)	−0.04	11.6 B	64.18	6.14 M	73.20
U-Net	Baseline	0.1025	96.08	-	54.6 B	-	31.0 M	-
	Proposed (Random)	0.1079	95.80 (±0.001)	0.27	8.85 B	83.80	3.10 M	83.68
	Proposed (Chaotic)	0.1066	95.88 (±0.001)	0.21	10.2 B	81.28	5.06 M	81.15
	Proposed (He)	0.1070	95.89 (±0.001)	0.20	13.6 B	75.17	5.85 M	75.32
	Proposed (Lecun)	0.1042	96.04 (±0.001)	0.04	14.0 B	74.46	7.66 M	74.70
	Proposed (Xavier)	0.1056	95.94 (±0.001)	0.15	13.9 B	74.59	7.85 M	74.39

Table 2. Comparison with the state-of-the-art methods for pruning TernausNet and U-Net on Inria dataset. “# Param.” indicates number of parameters, and “↓” represents the reduction rate.

Network	Method	VAL. Loss	Val. Acc. ↓ (%)	FLOPs ↓ (%)	# Param. ↓ (%)
TernausNet	Random Pruning	0.1059	0.09	74.30	75.08
	$l_{1}$ -Norm	0.4030	10.19	81.74	66.47
	Entropy	0.1066	0.07	71.87	69.87
	Proposed (Random)	0.1121	0.32	84.98	83.54
	Proposed (Chaotic)	0.1061	0.09	80.59	81.12
	Proposed (He)	0.1046	−0.04	64.18	73.20
	Proposed (Lecun)	0.1046	0.00	66.02	72.25
	Proposed (Xavier)	0.1039	−0.04	64.18	73.20
U-Net	Random Pruning	0.1055	0.15	74.21	74.19
	$l_{1}$ -Norm	0.3113	9.50	87.23	57.32
	Entropy	0.1062	0.12	68.79	65.94
	Proposed (Random)	0.1079	0.28	83.80	83.68
	Proposed (Chaotic)	0.1066	0.21	81.28	81.15
	Proposed (He)	0.1042	0.04	74.46	74.70
	Proposed (Lecun)	0.1070	0.20	75.17	75.32
	Proposed (Xavier)	0.1056	0.16	74.59	74.39

Table 3. Results on AIRS dataset. “Baseline” represents the normal training results, i.e., without pruning. “↓” represents the reduction rate.

Network	Model	Val. Loss	Val. Acc. (%)	Val. Acc. ↓ (%)	FLOPs	FLOPs ↓ (%)	# Param.	# Param. ↓ (%)
TernausNet	Baseline	0.0174	99.32	-	32.3 B	-	22.9 M	-
	Proposed (Random)	0.0190	99.27	0.05	5.06 B	84.34	3.78 M	83.50
	Proposed (Chaotic)	0.0190	99.24	0.07	7.31 B	77.37	4.27 M	81.36
	Proposed (He)	0.0181	99.27	0.05	7.80 B	75.84	5.64 M	75.42
	Proposed (Lecun)	0.0178	99.29	0.03	8.08 B	74.96	5.90 M	74.27
	Proposed (Xavier)	0.0184	99.27	0.04	7.80 B	75.84	5.64 M	75.42
U-Net	Baseline	0.0169	99.34	-	54.6 B	-	31.0 M	-
	Proposed (Random)	0.0178	99.28	0.06	8.95 B	83.63	5.02 M	83.84
	Proposed (Chaotic)	0.0179	99.28	0.06	10.1 B	81.46	5.97 M	80.75
	Proposed (He)	0.0170	99.31	0.02	13.0 B	76.13	7.50 M	75.82
	Proposed (Lecun)	0.0173	99.31	0.03	13.4 B	75.41	7.78 M	74.92
	Proposed (Xavier)	0.0170	99.32	0.02	13.8 B	74.68	7.61 M	75.47

Table 4. Comparison with the state-of-the-art methods for pruning TernausNet and U-Net on AIRS dataset. “↓” represents the reduction rate.

Network	Method	Val. Loss	Val. Acc. ↓ (%)	FLOPs ↓ (%)	# Param. ↓ (%)
TernausNet	Random Pruning	0.0180	0.05	74.73	74.13
	$l_{1}$ -Norm	0.1491	4.66	80.91	65.45
	Entropy	0.0174	0.02	69.86	72.47
	Proposed (Random)	0.0190	0.05	84.34	83.50
	Proposed (Chaotic)	0.0190	0.07	77.37	81.36
	Proposed (He)	0.0181	0.05	75.84	75.42
	Proposed (Lecun)	0.0178	0.03	74.96	74.27
	Proposed (Xavier)	0.0184	0.04	75.84	75.42
U-Net	Random Pruning	0.0170	0.03	73.28	73.98
	$l_{1}$ -Norm	0.2578	6.29	87.12	57.02
	Entropy	0.0167	0.00	67.93	72.14
	Proposed (Random)	0.0178	0.06	83.63	83.84
	Proposed (Chaotic)	0.0179	0.06	81.46	80.75
	Proposed (He)	0.0170	0.02	76.13	75.82
	Proposed (Lecun)	0.0173	0.03	75.41	74.92
	Proposed (Xavier)	0.0170	0.02	74.68	75.47

Table 5. Layer by layer pruning statistics for TernausNet on Inria dataset.

Layer	Baseline	Random		Chaotic		He		Lecun		Xavier
Layer	#Filters	#Filters	FLOPs ↓ (%)	#Filters	FLOPs ↓ (%)	#Filters	FLOPs ↓ (%)	#Filters	FLOPs ↓ (%)	#Filters	FLOPs ↓ (%)
conv1	64	11	99.94	30	99.83	64	99.64	64	99.64	64	99.64
conv2	128	49	99.75	58	99.20	128	96.25	128	96.25	128	96.25
conv3	256	103	99.42	107	99.29	256	96.26	256	96.26	256	96.26
conv4	256	106	98.75	112	98.63	132	96.14	133	96.11	132	96.14
conv5	512	209	99.37	230	99.26	262	99.01	265	98.99	262	99.01
conv6	512	211	98.74	223	98.54	245	98.17	265	98.00	245	98.17
conv7	512	203	99.69	211	99.66	261	99.54	258	99.51	261	99.54
conv8	512	214	99.69	220	99.67	249	99.54	255	99.53	249	99.54
conv9	512	206	99.92	230	99.91	275	99.88	248	99.89	275	99.88
convT_1	256	105	99.85	104	99.83	125	99.75	134	99.76	125	99.75
conv10	512	203	99.54	227	99.48	254	99.32	260	99.28	254	99.32
convT_2	256	104	99.40	109	99.29	127	99.08	134	99.01	127	99.08
conv11	512	209	98.12	230	97.82	250	97.35	259	97.05	250	97.35
convT_3	128	51	98.78	54	98.58	61	98.26	62	98.17	61	98.26
conv12	256	96	98.28	115	97.82	128	97.18	129	97.13	128	97.18
convT_4	64	25	98.90	28	98.53	31	98.19	30	98.23	31	98.19
conv13	128	51	98.27	57	97.76	63	95.42	61	95.60	63	95.42
convT_5	32	11	98.97	14	98.54	31	96.43	1	99.89	31	96.43
conv14	32	10	99.60	13	98.95	7	98.78	19	97.74	7	98.78
conv15	1	1	0	1	0	1	0	1	0	1	0
# Remaining Filters	5441	2178		2373		2950		2962		2950
Val. Acc.	96.00%	95.68%		95.91%		96.04%		96.00%		96.04%
# Param.	22.9 M	3.77 M		4.33 M		6.14 M		6.36 M		6.14 M
FLOPs	32.3 B	4.85 B		6.27 B		11.6 B		11.0 B		11.6 B
Model Size (MB)	89.58	14.76		16.93		24.02		24.87		24.02

Table 6. Compression and acceleration analysis of the proposed approach on TernausNet.

Initialization Strategy	Compression Ratio	Acceleration Ratio
Random	6.1×	6.7×
Chaotic	5.3×	5.2×
He	3.7×	2.8×
LeCun	3.6×	2.9×
Xavier	3.7×	2.8×

Table 7. Ablation study over the

ε

values for TernausNet on Inria dataset.

Table 7. Ablation study over the

ε

values for TernausNet on Inria dataset.

Initialization Strategy	$ε$	Val. Acc. (%)	FLOPs ↓ (%)	# Param. ↓ (%)
Random	0.3	95.84	84.34	83.50
	0.35	95.85	84.34	83.50
	0.4	95.77	84.34	83.50
	0.45	95.78	84.34	83.50
	0.5	95.68	84.98	83.54
Chaotic (logistic map)	0.3	95.92	80.59	81.12
	0.35	95.82	80.59	81.12
	0.4	95.83	80.59	81.12
	0.45	95.81	80.59	81.12
	0.5	95.91	80.59	81.12
He	0.3	95.99	75.84	75.42
	0.35	95.91	75.84	75.42
	0.4	95.80	77.30	75.50
	0.45	96.03	64.18	73.20
	0.5	96.04	64.18	73.20
LeCun	0.3	95.84	74.96	74.27
	0.35	95.97	74.96	74.27
	0.4	95.93	74.96	74.27
	0.45	95.75	76.48	74.66
	0.5	96.00	66.02	72.25
Xavier	0.3	95.88	75.84	75.42
	0.35	96.03	75.84	75.42
	0.4	95.82	77.30	75.50
	0.45	96.07	64.18	73.20
	0.5	96.04	64.18	73.20

Table 8. Pruning results for TernausNet on Inria dataset under various threshold values.

	Mean of Weights			1.2-Times the Mean of Weights
	Val. Acc. (%)	FLOPs ↓ (%)	# Param. ↓ (%)	Val. Acc. (%)	FLOPs ↓ (%)	# Param. ↓ (%)
Random	95.92	74.78	74.29	95.76	84.34	83.50
Chaotic	95.93	74.43	74.72	95.83	80.59	81.12
He	95.82	77.19	75.50	95.79	77.30	75.50
LeCun	95.91	74.78	74.29	95.93	74.96	74.31
Xavier	95.79	77.19	75.50	95.82	77.30	75.50

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sawant, S.S.; Wiedmann, M.; Göb, S.; Holzer, N.; Lang, E.W.; Götz, T. Compression of Deep Convolutional Neural Network Using Additional Importance-Weight-Based Filter Pruning Approach. Appl. Sci. 2022, 12, 11184. https://doi.org/10.3390/app122111184

AMA Style

Sawant SS, Wiedmann M, Göb S, Holzer N, Lang EW, Götz T. Compression of Deep Convolutional Neural Network Using Additional Importance-Weight-Based Filter Pruning Approach. Applied Sciences. 2022; 12(21):11184. https://doi.org/10.3390/app122111184

Chicago/Turabian Style

Sawant, Shrutika S., Marco Wiedmann, Stephan Göb, Nina Holzer, Elmar W. Lang, and Theresa Götz. 2022. "Compression of Deep Convolutional Neural Network Using Additional Importance-Weight-Based Filter Pruning Approach" Applied Sciences 12, no. 21: 11184. https://doi.org/10.3390/app122111184

APA Style

Sawant, S. S., Wiedmann, M., Göb, S., Holzer, N., Lang, E. W., & Götz, T. (2022). Compression of Deep Convolutional Neural Network Using Additional Importance-Weight-Based Filter Pruning Approach. Applied Sciences, 12(21), 11184. https://doi.org/10.3390/app122111184

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Compression of Deep Convolutional Neural Network Using Additional Importance-Weight-Based Filter Pruning Approach

Abstract

1. Introduction

2. Proposed Methodology