A Deformable Convolutional Neural Network with Spatial-Channel Attention for Remote Sensing Scene Classiﬁcation

: Remote sensing scene classiﬁcation converts remote sensing images into classiﬁcation information to support high-level applications, so it is a fundamental problem in the ﬁeld of remote sensing. In recent years, many convolutional neural network (CNN)-based methods have achieved impressive results in remote sensing scene classiﬁcation, but they have two problems in extracting remote sensing scene features: (1) ﬁxed-shape convolutional kernels cannot effectively extract features from remote sensing scenes with complex shapes and diverse distributions; (2) the features extracted by CNN contain a large number of redundant and invalid information. To solve these problems, this paper constructs a deformable convolutional neural network to adapt the convolutional sampling positions to the shape of objects in the remote sensing scene. Meanwhile, the spatial and channel attention mechanisms are used to focus on the effective features while suppressing the invalid ones. The experimental results indicate that the proposed method is competitive to the state-of-the-art methods on three remote sensing scene classiﬁcation datasets (UCM, NWPU, and AID). our method good classiﬁcation performance under various types and sizes and training ratios.


Introduction
With the development of remote sensing, it is more and more convenient to obtain veryhigh resolution land-cover images, which provides a reliable data source for remote sensing scene classification. As a basic problem in the field of remote sensing, remote sensing scene classification is widely used in land resources planning [1][2][3][4][5], urban planning [6][7][8], and disaster monitoring [9][10][11].
Remote sensing scene classification has always been a challenging problem because of the following characteristics.
(1) Remote sensing scenes have a complex outline and structure, whether the scene is a natural scene (island) or artificial scene (church), as shown in Figure 1a. (2) The spatial distribution of remote sensing scenes is complex. Remote sensing images are a bird's-eye view, so the direction, size, and position of the scenes are arbitrary. As shown in Figure 1b, the size of circular farmland is not fixed, and the position of spark residential is arbitrary. (3) There is intra-class diversity in remote sensing scenes. Affected by season, weather, light, and other factors, the same scene may have different forms of expression. As shown in Figure 1c, the forest has an obvious color difference due to different seasons; the church has a distinct shape difference due to different cultures. (4) There is inter-class similarity in remote sensing scenes. As shown in Figure 1d, the parking lot and container are highly similar in color, shape, direction, and spatial distribution in remote sensing images. The same situation also exists in the highway and bridge. The early remote sensing scene classification methods mainly utilized some low-level handcrafted features, such as Gabor [12], local binary patterns (LBPs) [13], scale-invariant feature transform (SIFT) [14], and histogram of oriented gradients (HOG) [15]. Later, some methods aggregated low-level features to generate mid-level features, such as Bag-ofvisual-words (BoVW) [16], spatial pyramid matching (SPM) [17], improved fisher kernel (IFK) [18], and vectors of locally aggregated descriptors (VLAD) [19]. These methods can deal with remote sensing scenes with simple shape and texture, but they fail to handle remote sensing scenes with complex structure and spatial distribution because they cannot extract high-level features.
The deep learning method automatically learns the distinguishing and expressive high-level features from images. This kind of method first made a breakthrough in the field of image classification [20][21][22][23] and then was successfully applied to the field of remote sensing scene classification. Li et al. [24] proposed a fusion strategy for remote sensing scene classification, which fuses the multi-layer features of the pre-trained CNN to achieve discriminated feature expression. Lu et al. [25] investigated a bidirectional adaptive feature fusion strategy, which fuses the deep learning features and the SIFT features to obtain a discriminative image presentation. He et al. [26] used covariance pooling to fuse the feature maps of different CNN layers to realize the rapid processing of large-scale remote sensing images. Flores et al. [27] proposed a method that combines CNN with the Gaussian mixture model to generate robust and compact features. Fang et al. [28] added a frequency-domain branch to CNN to enhance its robustness to rotating remote sensing images. Sun et al. [29] proposed a gated bidirectional network to fuse semanticassist features and appearance-assist features, which solves the problem of multi-layer CNN features information redundancy. Zheng et al. [6] proposed performance multiscale pooling (MSP), which improves the remote sensing scene classification performance by enhancing the ability of CNN to extract multiscale spatial information. Bi et al. [30] used an attention mechanism to enhance the ability to extract local semantic information from remote sensing scenes. Wu et al. [31] proposed a revolutionary neural network framework based on a group revolution scheme, which improves the efficiency of CNN. Xie et al. [32] proposed label augmentation to expand the remote sensing scene dataset, which realizes the classification of few-shot remote sensing scenes. Chen et al. [33] proposed a contextual information-preserved architecture learning (CIPAL) framework for remote sensing scene classification to utilize the contextual information.
Although the existing deep learning methods have made some achievements in remote sensing scene classification, they mostly enhance the expression of CNN features from the perspective of feature fusion (such as fusing handcrafted features; fusing multilayer CNN features; fusing contextual information). These methods usually add model parameters and computation. Different from these studies, our study designs a remote sensing scene classification method from basic theory, which considers the data types and task requirements. The main contributions of this study are summarized as follows.
(1) A Deformable CNN (D-CNN) is proposed. D-CNN breaks through the limitation of fixed convolution kernel size and enhances the feature extraction ability of remote sensing scenes with complex structure and spatial distribution. The rest of this paper is organized as follows. Section 2 introduces the proposed method in detail, including feature extraction, feature enhancement, and classification. The experiments of our method on three datasets (UCM, NWPU, and AID) are shown in Section 3. Section 4 gives the discussion. Section 5 concludes this study.

Overall Architecture
The overall architecture of our proposed method is shown in Figure 2. It consists of three parts: feature extraction, feature enhancement, and classification. In the feature extraction, D-CNN extracts the high-level features of the input remote sensing scene images. In the feature enhancement, the spatial information in the CNN feature maps is enhanced by the spatial attention enhancement mechanism; then, the channel information in the spatial attention feature maps is enhanced by the channel attention enhancement mechanism; finally, the spatial-channel attention feature maps are obtained. In the classification, the spatial-channel attention feature maps are classified.

Feature Extraction
Extracting the features of remote sensing images using CNN is an important step of remote sensing scene classification methods based on deep learning, and the quality of feature extraction directly affects the classification effect. The traditional CNN is limited by the shape of the convolution kernels and cannot adapt to remote sensing scenes with complex structure and spatial distribution. Generally, there are two methods to solve such a problem. One method is data augmentation, which constructs a dataset with sufficient transformation by enlarging, reducing, and rotating the original remote sensing images. The other method introduces other features, to make the feature more adaptive by adding scale-invariant features or rotation-invariant features. However, these two methods will bring a computational burden and make the classification algorithm complex.
By contrast, the deformable convolution [34] enhances its adaptability to complex remote sensing scenes by adding two offset parameters to the sampling position of the standard convolution. In this way, the sampling grid of the convolution can be shifted horizontally and vertically in the opposite direction. The comparison of standard convolution and deformable convolution is shown in Figure 3. The standard convolution is calculated as follows: where p i is the position of the regular grid R on the input feature map x; y is the output feature map, and W is the weight. After the offset, ∆p n is added to p i , p i + ∆p n represents a position of the feature map. The standard convolution is converted to deformable convolution as follows: The corresponding deformable pooling can be expressed as: where n(R) is the number of regular grids. Based on deformable convolution and deformable pooling, this study constructs D-CNN, and the framework is shown in Table 1. D-CNN is composed of a deformable convolution layer, a deformable pooling layer, and four deformable convolution blocks, ×n means the stack block is repeated n times. Specifically, the first layer is a deformable convolution layer with a convolution filter size of 7 × 7, and the number of convolution filters is 64. The feature is first extracted extensively by a larger size deformable convolution, and the information of the original image is preserved as much as possible, so that the feature can be extracted in detail by the deformable convolution blocks later. The second layer is a deformable pooling layer with a pooling filter size of 3 × 3. The third layer consists of three deformable convolution block 1. In deformable convolution block 1, 64 1 × 1 convolution, 64 3 × 3 deconvolution, and 256 1 × 1 convolution are stacked sequentially. Each block is connected internally through a shortcut connection to avoid network degradation caused by the increase of network depth. Other deformable convolution blocks are similar to deformable convolution block 1. In the deformable convolution block, stacking multiple 3 × 3 deformable convolutions can increase the number of sampling locations and improve the expressiveness of the feature with a significant reduction in the number of parameters. For example, comparing 3 stacked 3 × 3 deformable convolutions with 1 7 × 7 deformable convolution: (1) the number of sampling positions for 3 stacked 3 × 3 deformable convolutions are (3 × 3) 3 = 729, while the number of sampling positions for 1 7 × 7 deformable convolution are 7 × 7 = 49; (2) the number of parameters for 3 stacked 3 × 3 deformable convolutions are 3 × 3 × 3 × C out × C in = 27 C out C in , while the number of parameters for 1 7 × 7 deformable convolutions are 7 × 7× C out × C in = 49 C out C in , where C out and C in represent the number of channels of output and input, respectively; (3) 3 stacked 3 × 3 deformable convolutions have 2 more activation functions than 1 7 × 7 deformable convolution. As shown in Figure 4, by combining multiple deformable convolutions, the function of deformable convolution will be greatly improved. The small squares indicate the sampling points of the network, and the red arrow indicates the corresponding relationship between the feature maps of the two adjacent layers. From left to right, the feature maps are presented from low to high. The filter size of each layer is 3 × 3. The highlighted positions correspond to the highlighted units on the previous layer. It can be seen that the sampling position of the standard CNN is fixed for the object, while the deformable CNN can adapt to the shape of the object. The sampling point of the deformable CNN has a higher correlation with the object, which enhances the feature extraction of remote sensing scenes with complex structure and diverse distribution.

Feature Enhancement
Feature enhancement with attention mechanisms is a common and effective approach to improve deep learning methods. The attention mechanism in deep learning is similar to the human selective visual attention mechanism, which aims to select critical information from a multitude of information. In this study, SCA is proposed for remote sensing scenes. Spatial attention information and channel attention information are extracted by spatial attention module and channel attention module, respectively. Based on this, comprehensive attention information can be obtained.

Spatial Attention Module
The spatial attention module extracts the relationship between the spatial locations of the feature maps, as shown in Figure 5. Suppose the input feature maps P ∈ R C×H×W , where C, H, and W represent the channel, height, and width of the feature maps, respectively. First, P is converted to {A, B, D} ∈ R C×H×W by a convolution operation, and A and B are reshaped to R C×N , where N = H × W. Then, matrix multiplication is performed between A and the transpose of B, and the spatial attention matrix S ∈ R (H×W)×(H×W) is obtained by softmax: where s ij denotes the influence of position i on position j. The more similar the features of two locations, the greater the correlation between them. After this, the spatial attention matrix is multiplied with the feature map D to obtain the spatial location-enhanced feature map F s :

Channel Attention Module
The channel attention module extracts the relationship between the individual channels of the feature maps, as shown in Figure 6. Unlike the spatial attention module, matrix multiplication is performed directly between P and the transpose of P, and the channel attention matrix X ∈ R C×C is obtained by softmax: where x ji denotes the influence of channel i on channel j. After this, the channel attention matrix is multiplied with the feature map P to obtain the feature map F c for channel position enhancement:

Classification
In the classification, global average pooling is employed to reduce the dimension of the global average of F from R C×H×W to R C×1×1 , which greatly reduces the number of parameters. Then, the softmax function is used to achieve the final scene classification: where x is the result of feature concatenation; y is the scene label; θ is the classifier parameter; C is the number of scene categories, and N is the number of training samples.

Datasets
To evaluate the effectiveness of the proposed method on remote sensing scenes, experiments are conducted on three remote sensing scene image datasets, and the comparison of the three datasets is shown in Table 2.  Figure 9. Among the current remote sensing scene datasets, AID has the largest image size, which provides richer information for scene classification.

Evaluating Indexes
Overall accuracy (OA) and confusion matrix (CM) are used as evaluation indexes for remote sensing scene classification.
(1) OA: It is defined as the proportion of the number of correctly classified samples to the total number of samples in the test set. It simply and effectively represents the prediction capacity of the model on the overall dataset. OA is calculated as follows: where T is the total number of samples in the test set; m and n are the total number of categories and the number of samples of each category, respectively; f () is a classification function that predicts the category of a single sample x in the test set; y is the sample label indicating the real category of the sample; I() is the indicator function, which takes the value of 1 when it is true and 0 when it is false. (2) CM: It uses a matrix of N rows and N columns to represent the classification effect, where each row represents the actual category and each column represents the predicted value. It can indicate the categories that are prone to confusion, thus more intuitively representing the performance of the algorithm.

Implementation Details
The experiments are conducted on an AI Studio platform equipped with Tesla V100 (32GB memory). The initial learning rate is 0.01. Every 20 epochs, the learning rate is divided by 10. Besides, the momentum is set to 0.9.

Results on UCM
To demonstrate the superiority of our proposed method, it is compared with other methods on UCM, including Bidirectional adaptive feature fusion method (BDFF method) [25], Multiscale CNN (MCNN) [37], ResNet with weighted spatial pyramid matching collaborative representation-based classification (ResNet with WSPM-CRC) [38], VGG16 with multi-layer stacked covariance pooling (VGG16 with MSCP) [26], Gated bidirectional network (GBNet) [29], Feature aggregation CNN (FACNN) [39], Scale-free CNN (SF-CNN) [40], Deep discriminative representation learning with attention map method (DDRL-AM method) [41], and CNN based on attention-oriented multi-branch feature fusion (AMB-CNN) [42]. The training ratio of 80% is used on this dataset, and OA is taken as the evaluation index. The results are shown in Figure 10. Our method achieves the best OA of up to 99.62%. Then, CM is adopted to analyze the performance of our proposed method in detail, and the results are shown in Figure 11. The vast majority of results are correct. The error only occurs between confusing categories such as dense residential, medium residential, and mobile home park, while the classification results of other categories are correct.
The above experiments show that our method achieves good performance on UCM, which is a dataset with a small sample type and sample size.

Results on NWPU
Compared with UCM, NWPU has 45 scene classes and 700 images per class. Therefore, NWPU can better reflect the performance of scene classification methods. On NWPU, our proposed method is compared with other methods, including CNN-CapsNet [43], Discriminative CNN with VGG16 (D-CNN with VGG16) [44], VGG16 with MSCP [26], Skip-connected covariance network (SCCov Network) [45], and AMB-CNN [42]. The training ratios of 10% and 20% are used on this dataset, respectively, and the results are shown in Figure 12. Regardless of the training ratio, our method achieves the best classification accuracy.  Then, CM is adopted to analyze the performance of our proposed method in detail, and the results are shown in Figures 13 and 14. Our method achieves a good classification accuracy on each scene. When the training ratio is 10%, it achieves an OA of more than 90% on 26 of the 45 scenes and 80% on 44 scenes. When the training ratio is 20%, it achieves an OA of more than 90% on 43 of the 45 scenes and 100% on 9 scenes.  The above experiments show that our method still achieves a good classification accuracy under a large scene type and size.

Results on AID
Different from UCM and NWPU, the image size in AID reaches 600 × 600. To test the classification performance of our proposed method for large-scale remote sensing images, it is compared with other methods on AID, including CNN-CapsNet [43], D-CNN with VGG16 [44], VGG16 with MSCP [26], GBNet [29], SCCov Network [45], and AMB-CNN [42]. The training ratios of 20% and 50% are used on this dataset, respectively, and the results are shown in Figure 15. When the training ratio is 20%, our method achieves the best classification accuracy. When the training ratio is 50%, the difference between our method and the best classification method is 0.45%. Then, CM is adopted to analyze the performance of our proposed method in detail, and the results are shown in Figures 16 and 17. Our method achieves a good classification accuracy on each scene. When the training ratio is 20%, it achieves an OA of more than 90% on 24 of the 30 scenes, and more than 80% on all scenes, and even 100% on 9 scenes. When the training ratio is 50%, it achieves an OA of more than 90% on all scenes and even 100% for 12 scenes.

Analysis of D-CNN
In Section 2.2, the principle of D-CNN is described in detail. In the experiment, more tests are conducted to show the effectiveness of D-CNN. The test results are shown in Figure 18, the yellow point indicates the activation unit and the red point indicates the sampling location. In D-CNN, three deformable convolutional layers are stacked, and the size of the deformable filters in each layer is 3 × 3. Therefore, each active unit corresponds to (3 × 3) 3 = 729 sampling locations. It clearly shows that: if the activate unit is in the green space, the sampling locations are adjusted to the shape of the green space; if the activate unit is in the basketball court, the sampling locations are adjusted to the shape of the basketball court; if the activate unit is in the island, the sampling locations are adjusted to the shape of the island; if the activate unit is in the sea, the sampling locations are adjusted to the shape of the sea. The sampling locations are adaptively adjusted to the shape of objects in D-CNN.

Analysis of SCA
To evaluate the feature enhancement ability of SCA, SCA is added to other classical CNNs, and experiments are conducted on the AID dataset; 20% of the data in the AID dataset is randomly selected for training and the remaining data is used for testing. Meanwhile, OA is taken as the evaluation index. The experimental results are shown in Figure 19. It can be seen that SCA is applicable to a variety of classic CNNs and improves the classification accuracy. Especially for GoogLeNet, the classification accuracy is improved by 5.35%.
As each attention module has different functions, the arrangement strategies of attention modules affect the overall performance. Table 3 summarizes the experimental results on different attention arranging methods. Note that the spatial attention module outperforms using the channel attention module. In addition, the spatial-channel attention module performs better than the channel-spatial module. This is because the deformable convolution in D-CNN changes the sampling positions of the convolution kernel. Additionally, the feature maps have discriminative spatial features, which is conducive to the spatial attention module to enhance the spatial features. Moreover, the channel attention module associates scene types with the channels of the feature maps, which enhances the effectiveness of the overall approach. Reasonable attention module arrangement strategy improves the classification accuracy.

Visualization
In addition to using specific indices to evaluate the performance of the proposed method, this study also uses Gradient-weighted Class Activation Mapping (Grad-CAM) [45] to visualize the proposed method to analyze the concerns of the model. The results are shown in Figure 20. Grad-CAM reflects the distribution of the sensitive area of the proposed method to the scenes through heat map. The more contribution to the classification results, the redder the color on the heat map. It is obvious that our method focuses well on differentiated positions.

Conclusions
The complex shape and diverse distribution of remote sensing scenes bring challenges to remote sensing scene classification. To address this problem, this paper proposed a new feature extraction network called D-CNN and a new feature enhancement method called SCA. D-CNN uses deformable convolution to change the convolution sampling position.
Based on this, the applicability of the network to irregular remote sensing scene images and the feature extraction capability is improved. As for SCA, it first extracts spatial key information and then extracts key channels to enhance effective features while suppressing invalid features. Extensive experiments have been conducted on three data (UCM, NWPU, and AID). The experimental results indicate that our method achieves good classification performance under various scene types and sizes and training ratios.
Author Contributions: D.W. and J.L. contributed equally to the study and are co-first authors. All authors have read and agreed to the published version of the manuscript.