Pruning convolutional neural networks with an attention mechanism for remote sensing image classification

: Despite the great success of Convolutional Neural Networks (CNNs) in various visual recognition tasks, the high computational and storage costs of such deep networks impede their deployments in real-time remote sensing tasks. To this end, considerable attention has been given to the ﬁlter pruning techniques, which enable slimming deep networks with acceptable performance drops and thus implementing them on the remote sensing devices. In this paper, we propose a new scheme, termed Pruning Filter with Attention Mechanism (PFAM), to compress and accelerate traditional CNNs. In particular, a novel correlation-based ﬁlter pruning criterion, which explores the long-range dependencies among ﬁlters via an attention module, is employed to select the to-be-pruned ﬁlters. Distinct from previous methods, the less correlated ﬁlters are ﬁrst pruned after the pruning stage in the current training epoch, and they are reconstructed and updated during the next training epoch. Doing so allows manipulating input data with the maximum information preserved when executing the original training strategy such that the compressed network model can be obtained without the need for the pretrained model. The proposed method is evaluated on three public remote sensing image datasets, and the experimental results demonstrate its superiority, compared to state-of-the-art baselines. Speciﬁcally, PFAM achieves a 0.67% accuracy improvement with a 40% model-size reduction on the Aerial Image Dataset (AID) dataset, which is impressive.


Introduction
With mass applications of remote sensing equipment, how to perform efficient remote sensing image scene classification is becoming a significant, yet challenging research problem. In recent years, Convolutional Neural Networks (CNNs) have shown appealing performance on various computer vision tasks, which have been applied broadly in remote sensing image scene classification [1][2][3][4][5][6]. However, large amounts of computational resources from the high-performance GPUs are required to run the complicated CNNs. Moreover, traditional CNNs usually contain many network parameters, even millions, which indicates that remote sensing equipment suffers from high demands for large memory storage space. Most of the remote sensing devices prefer to conduct real-time data collection and analysis on an aircraft, rather than making a decision in a workshop. In this context, the computing power in existing remote sensing devices is quite limited due to the embedded low-level CPUs and GPUs, thus making it infeasible to deploy the deep learning techniques on those machines directly.
In this paper, we propose a novel filter pruning method, termed Pruning Filter with Attention Mechanism (PFAM), which integrates the attention module with softly pruning the less correlated filters to obtain compact deep network model for the remote sensing image classification. To be specific, by deploying the attention module, the target filters that have smaller correlation values that others are pruned in the current pruning stage. In the next training epoch, the pruned filters in the previous pruning stage are recovered and further updated to avoid the accuracy loss because of the pruning process. By conducting such iterative training steps, the compact deep network model can be finally obtained with satisfactory performance. The contributions of our work are illustrated in the following three aspects: (1) A novel deep network compression method termed Pruning Filter with Attention Mechanism (PFAM) is proposed for efficient remote sensing image classification. The compact network model is obtained by integrating the attention-based filter pruning strategy into a unified end-to-end training process. (2) A novel correlation-based filter selection criterion is proposed in the filter pruning, where the correlation value of each filter is calculated through the attention module, and then, the less correlated filters are pruned to reduce the network complexity. By using the proposed attention module, it models the correlation among filters efficiently via exploring their long-range dependencies, which is more likely to make wise decisions in selecting the to-be-pruned filters without compromising the network performance. (3) Extensive experiments on three remote sensing image datasets demonstrate that our proposed PFAM outperforms the state-of-the-art algorithms significantly.
The rest of this paper is organized as follows. Section 2 discusses previous representative approaches in the research field of network compression. Then, our proposed filter pruning method is detailed in Section 3. Section 4 shows the experimental results on multiple remote sensing datasets. Finally, the discussion and conclusions of this research are drawn in Sections 5 and 6, respectively.

Related Works
In network pruning, the unimportant activations and connections could be removed to obtain more compact models. For instance, Han et al. [23] discarded the unnecessary connections that were less than a default threshold value, and then, they retrained such a sparse model to improve its accuracy. Deep Compression [33] achieved great energy and memory savings by extending their previous work [23]. In particular, they combined the connection pruning with the quantization techniques, where several remaining important connections could share the same weights. Then, Huffman encoding technology was utilized to make further compression. Although Deep Compression obtained a high compression ratio on the CNN models by removing unimportant parameters, the parameter importance varied dramatically if changing the network structure. This implies that this sort of hard pruning method will suffer if the important connections are removed incorrectly during long-time training.
To enhance it, dynamic network surgery was proposed in [34], where the pruning and splicing methods are jointly combined to recover some important connections that were removed incorrectly before. They are recovered as significant connections again during the next training period, thus reducing the possibility of misclassification to the maximum extent. More importantly, the pruning and training processes are synchronized seamlessly such that the problems of long retraining time and incorrect pruning can be solved effectively. Alternatively, unlike conventional weight or connection pruning, Li et al. [29] proposed a filter pruning method named Pruning Filters for Efficient ConvNets (PFEC) to delete some filters and their related feature maps in the next convolutional process to reduce the computation costs. In particular, the absolute values of the filters are calculated and ordered from smallest to largest, then the filters with the smallest values are removed to obtain a highly compressed network model. Liu et al. [30] proposed a pruning method termed network slimming by enforcing channel-level sparsity in the network. To be specific, they evaluated the scores of the input channels by calculating the corresponding weights of the batch normalization layers. These scores are compared with a pre-set threshold value, and those filters with scores lower than the threshold are removed.
Unlike previous research works that utilized hard filter pruning technology, the soft pruning method was proposed in Soft Filter Pruning (SFP) [32] and Filter Pruning via Geometric Median (FPGM) [35], where the pruned convolution filters are recovered and involved in the next training iteration, rather than being deleted once and gone permanently. By doing so, there is no need to go through the fine-tuning process on the pretrained model to compensate for the accuracy drop after the pruning. One more advantage of such a soft pruning scheme is that it saves much training time by conducting the pruning process immediately after the end of each training process. Our method generally follows the soft pruning method as [32,35]; however, it utilizes an advanced pruning filter selection strategy based on the correlation between filters from the involved attention module, thus obtaining a more robust and compact network model.

Motivation
As discussed in the previous sections, earlier filter pruning works [29][30][31] generally compressed the deep CNNs in a hard manner. To be specific, these algorithms firstly prune the unimportant filters in every single convolutional layer from a pretrained model directly, where the significance evaluation of the single filter is often inexact because of ambiguous calculations in one certain layer. Then, the pruned model needs to go through the fine-tuning stage to compensate for the performance degradation caused by the pruning. However, once the filters are selected, these to-be-pruned filters are abandoned permanently during the pruning process and never recovered again in the following fine-tuning stage. Although the model is dramatically reduced in size due to the removal of filters, such a hard pruning method is more likely to yield unsatisfactory performance due to the shrinkage of model capacity. Besides, it is worth mentioning that these methods may have expensive computational resources costs to get the final pruned model fine-tuned on the training data.
In contrast to the hard pruning methods, Reference [32] proposed to dynamically remove the filters in a soft manner by calculating the norm value of each filter independently. Despite its advances, the underlying mechanism that focuses on the individual effect of the single filter in one convolutional layer without considering the global relationship among them is deemed suboptimal. In other words, the least important filter that is determined individually is not the least important one if a global view considering all the filters is adopted. From another perspective, it might be useful if we investigate the long-range dependency of filters and involve these dedicated dependency/relationships among filters in the to-be-pruned filter selection. In traditional CNNs, however, the long-range dependency among layers can only be obtained by repeatedly backpropagating through the stacking convolutional layers. That is to say, the current deep networks are usually inefficient at capturing such long-range dependencies, and it is difficult to operate locally when the information needs to be passed back and forth between relatively remote locations [36]. To tackle the above problems, the attention mechanism was applied in many applications to obtain better network performance. For example, Mnih et al. [37] utilized an attention module in the network training that pays more attention to the local areas with high-correlated weights from the whole target areas, which simulates the human's visual attention behavior when observing images. Following previous works [38][39][40], they adopted a self-attention module that calculates the response at a specific position as a weighted sum of the features at all positions [41][42][43][44]. In this case, the weights or attention vectors were calculated with low computational costs, while a good balance between the long-range dependency modeling ability and computational efficiency was achieved.
Inspired by previous works, we propose a new filter pruning method termed Filter Pruning with Attention Mechanism (PFAM), which integrates an attention mechanism into the filter pruning.
In particular, the attention module is used to count and collect the correlation value of each filter and then select the less correlated filters based on those values; hence, the overall correlations among all filters in one convolutional layer are considered to obtain the minimal accuracy loss globally. In the pruning stage, the values of those selected as less correlated are set to zero, which means these filters are removed. In the next training epoch, the values of pruned filters are recovered from zero to non-zero and updated by the upcoming forward-backward operations. By doing so, the training data can be processed by the original training strategy without compromising the performance, while the compressed network model can be obtained in the end with no need for the pretrained model. The general process of the proposed filter pruning approach is shown in Figure 1.  Figure 2 illustrates the working flow of the proposed attention module in calculating the correlation values between filters in one convolutional layer, which shares a similar structure as many computer vision tasks. In particular, the filters are treated like the feature maps in the sense that flatting one filter into a one-dimensional vector is similar to flatting one feature map into a one-dimensional vector, but with different vector length. In our paper, instead of finding the most attractive feature maps in previous works, we aim at selecting the least attractive filters from a certain number of candidates and pruning those less correlated filters to create a compact model. Without loss of generality, we first define some mathematical symbols following [32] to ease the explanation. To be specific, the dimension of filter tensor with a k × k filter size in the i-th convolutional layer is defined as W (i) ∈ R C i+1 ×C i ×k×k , where 1 ≤ i ≤ L. C i+1 and C i mean the number of output and input channels separately for the i-th convolutional layer and L is the number of layers. Then, all the filters are flattened in the i-th layer as W (i) ∈ R C i+1 ×V , where V denotes the shape of each filter in the i-th convolutional layer and equals C i × k × k. The filters in the i-th layer are first transformed into two weight spaces

Filter Selection with Attention-Based Correlation
g to calculate the attention map, which can be formulated as below: F(w j ) and G(w k ) represent the values of the j-th filter in W (i) f and the k-th filter in W (i) g weight spaces, respectively. θ j,k represents the correlative extent between the j-th and k-th filter. M is the number of filters in the i-th convolutional layer. Therefore, the output of the attention value is As discussed above, the attention module is used to evaluate the correlation of each filter based on Equation (2) in the pruning stage. The filters with smaller correlation values can be pruned, because it turns out that they have less impact on the network performance, as opposed to other highly correlated filters.

Filter Pruning and Reconstruction
In the pruning stage, all the candidate convolution layers with the same pruning rate P i = P are pruned at the same time, which saves a large amount of computations, compared to the hard pruning methods. In particular, a pruning rate P i is set to select a total number of C i+1 P i less correlated filters in the i-th convolution layer [32,35]. After pruning the filters in each convolution layer, existing methods always require extra training to converge the network [21,22]. During the training process, these selected filters are first zeroed out, which means they have no contribution to the network output in the current pruning stage.
In most filter pruning methods, however, the pruned filters and their associated feature maps are removed permanently during the pruning process, which could affect the performance significantly. To deal with this problem, these pruning methods usually are conducted based on the pretrained model, and they also need to spend extra fine-tuning time for accuracy compensation. To get rid of the heavy dependences on the pretrained model and the time-consuming fine-tuning process, we follow the same reconstruction strategy as [27] at this stage, where the pruned filters in the previous pruning process are reconstructed during one epoch of retraining. To be specific, these pruned filter values are updated from zero to non-zero after the backpropagation [32,35]. By doing so, the pruned model still has the same capacity as the original model during the training process. More importantly, each of those filters can still contribute to the final prediction. As a result, we can train our network from either scratch or the pretrained model and obtain competitive results even without the need for the fine-tuning stage. Figure 3 shows the flowchart of the proposed PFAM, where the iterative training is repeatedly performed until the accuracy loss becomes converged after several training epochs. We can get a sparse model with multiple zeroed filters when the model becomes converged. These selected filters will remain unchanged because the iteration has completed. Since each filter corresponds to one feature map, these feature maps corresponding to those zeroed filters will always be zero during the inference procedure. Removal of these zeroed filters, as well as their related feature maps will not have any effect. After the iteration of the previous steps, the compact model is finally created. The whole process is briefly summarized in Algorithm 1.

Initialize network
Select filters with small correlations

Reconstruct the pruned filters
Prune selected filters simultaneously Find N i+1 P i filters that satisfy Equation (2); 6: Zeroize selected filters; 7: end for 8: end for 9: return the compact model W * from W.

Experiments and Analysis
In this section, we provide extensive experimental results and analysis to illustrate the system performance of our algorithm on three popular remote sensing benchmarks: Remote Sensing Image Scene Classification, created by Northwestern Polytechnical University (NWPU-RESISC45) [45], Aerial Image Dataset (AID) [46], and RSSCN7 [47].

NWPU-RESISC45 Dataset
NWPU-RESISC45 (http://www.escience.cn/people/JunweiHan/NWPU-RESISC45.html) [45] is a popular public dataset for remote sensing image scene classification, which was extracted from Google Earth by experts at Northwestern Polytechnical University (NWPU). This dataset is made up of a total of 31,500 images, which are categorized into 45 scene classes as shown in Figure 4. Each class includes 700 images with a size of 256 × 256 pixels in the RGB color space.

RSSCN Dataset
The RSSCN7 (https://github.com/palewitout/RSSCN7) [47] dataset was released in 2015 by Wuhan University, China, which contains 2800 remote sensing images in total from seven typical scene categories: grasslands, forests, farmland, car parks, residential areas, industrial areas, and rivers and lakes, as shown in Figure 5. For each category, there are 400 images with the size 400 × 400 collected from Google Earth, and these pictures are sampled at four different scales. It is also a challenging dataset because the remote images were taken in different seasons and weather conditions with various sampling scales.

Grass
Field Industry River Lake Forest Resident Parking Figure 5. Example images of the RSSCN7dataset.

AID Dataset
AID (https://captain-whu.github.io/AID/) [46] is a large-scale aerial image dataset, which was selected from Google Earth imagery. This dataset contains in total 10,000 images with a fixed size of 600 × 600 pixels within 30 classes, as shown in Figure 6. Compared to other classic datasets, the number of images for each category is not equal, and different scene types range from 220 to 420, which makes it more challenging in the image classification. Although images in this dataset were acquired at different times with different imaging conditions, some classes are quite similar and therefore make the differences between classes smaller.

Deep Architecture
In the experiment, we chose a popular deep model, ResNet [15], as the backbone network because of the more complex and less redundant structure of ResNet compared to VGG models [48]. We also conducted experiments on the VGG models to further consolidate the superiority of our method.

Implementation Details
In this work, the PyTorch framework was used to implement the proposed method. In the dataset preprocessing, we followed the PyTorch guidance [49] to perform data argumentation. To be specific, all images in the AID dataset were resized to 256 × 256 pixels from the original 600 × 600 pixels, which follows the same image size in the NWPU-RESISC45 dataset. Regarding the RSSCN7 dataset, we resized all images from 400 × 400 pixels to 256 × 256 pixels for the same reason.
We applied the same training ratio (80%) to make fair comparisons of the experiments, and all three datasets were randomly divided into training and testing sets based on the pre-set ratios to calculate the overall classification accuracy. All the experiments were conducted three times to get fair and reliable results. The hardware configurations were the Linux Ubuntu 14.04 operating system with i7-5960X CPU, 64GB RAMs, and one NVIDIA GTX1080Ti GPU.
In the pruning stage, only one hyper-parameter P i = P was set to prune all the convolutional layers after finishing every training epoch. The pruning rate makes a trade-off between the compression and the accuracy throughout the pruning process [32]. Notably, the projection shortcuts do not need to be pruned for the compression because of the limited contribution to the overall costs when using ResNet [15] in the evaluation.
In the network training, we used Stochastic Gradient Descent (SGD) as the optimizer with the weight decay and momentum as 0.0005 and 0.9, respectively. The learning rates were set separately for two training phases: 0.01 in the first 50 epochs and 0.002 for the last 50 epochs. The training processes for all network models were terminated when the training losses converged. Moreover, the batch-size was set to 64 for the NWPU-RESISC45 dataset and 32 for the AID dataset and the RSSCN7 dataset to balance the requirements of the computer memory and the image number contained in the training and test sets.
We tested our method on VGG-16 [48] and ResNet-18, -34, -50, and -101 [15], respectively. In order to make fair comparisons, we used the pruning rate of 40% for the same model, while the pruning methods for both the scratch and the pretrained model during the same training epochs were also tested and analyzed. There is no need to conduct the fine-tuning after the scratch-wise model training in our method when compared to many previous approaches with the hard pruning manner.

Evaluation Metrics
Four different evaluation metrics, Accuracy (Acc.), Accuracy Drop (Acc. Drop), FLOPs, and pruning ratio (Pruning), were used in the experiments, where the italics denote the symbols in the tables. Almost all model compression methods compare the performance using the pruning ratio vs. model accuracy. We tweaked the global pruning rate (normally a parameter) of each algorithm involved in the comparison, such that all the models pruned by different methods were more or less identical, e.g., 40%, in terms of the size of the model. Then, we tested the pruned models on remote sensing datasets and calculated the top 1 accuracy, given the image classification task. In particular, Acc. measures the classification accuracy obtained by the specific method, which is expressed as the average and the standard deviation of accuracy after running the experiments three times. Acc. Drop is computed by the accuracy of the pruned model minus that of the baseline model, where a negative number implies that the pruned model achieves even higher accuracy than the baseline model. The smaller Acc. Drop is, the better performance. The pruning rate denotes the real compression ratio of the network models. The larger the pruning rate is, the more compact model. FLOPs denotes the total number of floating-point operations, which is used as a reference metric in evaluating the pruning method.
Followed by [50] (Equation (3)), the top k prediction was made up of any set of labels related to the k biggest scores and k ∈ {1, . . . , n − 1}. Concretely, we assumed that Y (k) ∈ {1, . . . , n} was the set of k-tuples with k distinct elements in the output space. y means the set of k-tuples of the ground truth label. s ∈ R n and s [k] are defined as a vector of scores per label and the k-th biggest element of s, respectively. Next, from Equation (4), we sum up the number of P k (s) by Equation (3) and then divide the batch size to get the top k accuracy for one batch. Finally, we get the final average top k accuracy for all images by averaging them from all batches.
Top (k) accuracy = Avg sum (P k (s)) batch size (4) Table 1 shows the results on the NWPU-RESISC45 dataset when applying ResNet-18, -34, -50, and -101 respectively for remote sensing image classification, where our proposed method achieved superior performance compared to the other state-of-the-art filter pruning methods. In particular, PFEC [29] used the hard pruning method in pruning ResNet-18 and -34 with accuracies of 89.78% and 90.75% independently, whereas the figures from SFP [32] were 0.29% and 0.42% lower than them, but FPGM [35] obtained better results than them. Although ThiNet [51] obtained the second highest accuracy in pruning ResNet-18 and -50 with 92.47% and 92.60%, which outperformed the other two soft pruning methods, our method still obtained the highest accuracy among them, where PFAM achieved the highest accuracies (92.56% and 92.87%) with compression ratios of 42.3% and 40.6%. Moreover, PFAM achieved the lowest accuracy drops of 0.27% on ResNet-34 with the second largest compression ratio. FPGM [35] achieved the best accuracy (92.64%) among the five methods when pruning ResNet-101, which was slightly better than our PFAM (92.52%). It is noted that, in this case, the pruning ratio of our PFAM (39.6%) was larger than that of FPGM [35] (38.1%). To sum up the above experimental results, our PFAM obtained a highly compressed network model with competitive performance in most cases, given the NWPU-RESISC45 dataset.

ResNet on the AID Dataset
Similarly, our proposed method consistently outperformed the other state-of-the-art methods on the AID dataset, as shown in Table 2. Although SFP [32] achieved the largest pruning ratios on ResNet-18, -34, -50, and -101, much higher accuracies were achieved by our proposed PFAM, where the gaps were 4.01%, 4.26%, 3%, and 3.17%, respectively. FPGM [35] obtained the second-best results in these experiments, and the accuracy was at least 0.19% (on ResNet-18: 85.08% vs. 85.27%) lower than that of PFAM. Although hard pruning methods such as PFEC [29] and ThiNet [51] obtained higher accuracy than SFP [32], their performance was still worse than ours. It is worth mentioning that our filter pruning method even outperformed the original model when pruning ResNet-101 in terms of accuracy, which was 84.17% and 84.10%, respectively. That indicates the powerful ability of PFAM in producing a highly compressed model while maintaining competitive performance. To provide more comprehensive performance verification of our method, we also tested our method on pruning VGG-16 with training from scratch (Table 3) and ResNet based on the pretrained model (Table 4) on the AID dataset. The results in Table 3 describe the effectiveness of the soft pruning methods compared to the hard ones on the VGG-16 model, where PFAM still achieved the best accuracy (86.03%) under the same compression ratio among the four pruning methods. The performance degradation was inevitable for the hard filter pruning methods like PFEC [29], while the soft filter pruning methods enabled maintaining the strong network expressive ability after the reconstruction stage on the pruned filters to obtain better performance. In Table 4, the proposed PFAM achieves the best accuracies in pruning ResNet-18 (89.86%), -34 (89.77%), and -101 (90.57%), which demonstrate the effectiveness of our method in pruning ResNet with the pretrained model.  14) 0.52 3.08 × 10 9 39.8 Table 4. Accuracies (%) of pruning ResNet based on the pretrained model on the AID dataset.

Results on RSSCN7
For the RSSCN7 dataset, we tested our PFAM on ResNet-18, -34, -50, and -101 and VGG-16 with a 40% pruning ratio to provide comprehensive insights into the performance. In Table 5, PFAM obtains the best performance compared to the other three methods on VGG-16, which is the same as Table 3. Although SFP [32], FPGM [35], and PFAM achieved the same compression ratio (39.8%), the accuracy of our method was much higher than their's. It can be seen that our proposed method also showed a high effect on selecting and pruning redundant filters on VGGNet.
Moreover, In Table 6, unlike previous results on the NWPU-RESISC45 and AID datasets, SFP [32] obtained the worst performances except on ResNet-101 in the experiments, even though it obtained the highest compression ratios. However, FPGM [35] and our proposed method generally achieved better experimental results than the norm-criterion methods like PFEC [29] and SFP [32]. The reason for the worse performance is that the norm-criterion methods only focus on pruning each individual filter without considering the global correlation among all filters. Therefore, this leads to the suboptimal performance. Compared to the methods selecting filters based on the norm-based criterion and the geometric median, the proposed attention module is utilized in PFAM to find the correlation between filters globally, which enables yielding superior performance because of the advanced pruning strategy.

Discussion
In view of the comparison results of PEFC [29], ThiNet [51], SFP [32], and FPGM [35] on three test datasets, it is clear that our method generally yields competitive performance throughout extensive evaluations. We achieve the best performance in the vast majority of neural network architectures, which can show that the way we select the to-be-pruned filters is better than the other methods. The main advantage of the proposed method is that the overall correlation between filters in one convolutional layer is considered in the pruning, rather than only caring about the importance of each individual filter. In order to further investigate the performance difference of our method and the current leading methods under different pruning rates, we conducted extensive experiments regarding the performance variations by applying different pruning ratios on ResNet-18 and -34. The corresponding results are shown in Figure 7a,b. Since our method belongs to the soft pruning methods, we verified two other soft pruning methods for a fair comparison. Hence, three soft filter pruning methods, SFP [32], FPGM [35], and PFAM (ours), were tested on the NWPU-RESISC45 and RSSCN7 datasets, respectively. It is worth mentioning that the performance of SFP [32] became far worse than the other two methods with the increasing pruning ratio on both the RSSCN7 dataset and the NWPU-RESISC45 dataset.
For the RSSCN7 dataset, from Table 7, we can see that our method achieves the best performance at most pruning ratios (20%, 40%, and 60%) in pruning ResNet-18, while the accuracy of FPGM [35] is the highest at the extreme compression ratio of 80%, and a similar situation happened in the experiment results of pruning ResNet-34. However, on the relatively high pruning rate 80%, our approach does not perform as well as it did at the other pruning rate because too many filters are pruned in each convolution layer. Given a small or moderate pruning rate, such as 20% or 40%, our proposed method can accurately find the most redundant filters by calculating the correlation between the filters. That is why we consistently outperform the other algorithms across different datasets. However, with the increase of the pruning rate, our pruning method does not perform the best, because too many filters are removed from each convolutional layer. When the pruning rate is as high as 80%, which means nearly 80% of filters in each convolution layer need to be pruned, this greatly damages the final prediction.
On the NWPU-RESISC45 dataset, SFP [32] continues its poor performance with the increasing pruning ratio, as shown in Figure 7a,b. In Figure 7a, PFAM (green line) outperforms the other two methods for all the pruning ratios, while in Figure 7b, PFAM achieves the best performance under all settings on ResNet-34, except for FPGM [35] at the pruning rate of 60%. It seems that our method occasionally performs worse than FPGM on one specific dataset. The reason might be that there are so many filters with similar correlation values, which happened to confuse our selection mechanism. In other words, our selection mechanism is not good enough, compared to FPGM, in this particular case. Although our approach does not perform very well at a pruning rate of 60% because of the filter selection, we still achieve the best performance compared to the three other settings, which can show the way that we chose the filters is better than the other comparison methods in most cases.  Although the experimental results on the three remote sensing datasets illustrate the effectiveness of the proposed filter pruning method, there are still some limitations. For example, the filter selection mechanism, given large compression ratios, does not seem optimal, though it works perfectly at relatively small compression ratios. Besides, our method treats each convolution layer as having equal significance during the pruning stage, which may have a huge negative impact on overall performance if many filters need to be pruned from an important layer. Likewise, the neural network architecture still has high redundancy if only a few filters are pruned from some layers that are not significant. Therefore, how to prune filters in each convolution layer dynamically remains as a future research point.

Conclusions and Future Work
In this paper, we present a novel method termed Pruning Filter with Attention Mechanism (PFAM) for lightweight remote sensing image classification. In particular, a correlation-based filter pruning criterion is applied, where the correlation between filters is determined by the attention mechanism. Different from previous pruning methods, we prune filters with the least correlation, which has a small negative impact on the overall correlation among filters in one layer. These less correlated filters are firstly pruned after the pruning stage in the current training epoch, then they are recovered and updated during the next training epoch. In this way, the training data are processed by the original model during the training process, while the compressed network model can be obtained in the end without the need for the extra fine-tuning stage. The proposed method is extensively evaluated on three public remote sensing image datasets, and the experimental results show that our method achieves superior performance compared to the state-of-the-art methods. Notably, PFAM achieves the best performance under all types of ResNet architectures and VGG-16 on the AID dataset, especially for ResNet-34 and -50, which obtains 0.67% higher accuracy than the state-of-the-art at similar compression ratios. However, we still treat each layer equally during the pruning stage, which may have a huge negative impact on overall performance if it happens to prune many filters from an important layer. Hence, we will design a sort of dynamic pruning strategy, which could determine the number of to-be-pruned filters at certain layers based on their importance levels. Doing so will help reduce the model complexity dramatically while affecting the accuracy little. Additionally, in future work, we will use our technique to compress deep CNN architectures for various applications, such as visual tracking [52][53][54], video analysis [55,56], and large-scale visual search [57,58].