1. Introduction
With the rapid development of remote sensing technology and the increasing number of satellites, more and more solid sources of data-support for land use investigation can be obtained [
1,
2,
3]. Images used in remote sensing image processing and interpretation tasks contain more complex structures and have higher resolutions than before. The traditional method based on a single pixel with physical information is no longer suitable for complex remote sensing image classification. In recent years, some methods based on the entire image content have been used to extract high-level semantic information in remote sensing images [
4,
5].
The method for remote image sensing classification commonly includes two parts: a features extractor and a classifier. According to the method of extracting features, the methods can generally be divided into two groups: traditional methods and deep learning (DL) methods. Traditional methods are usually based on hand-crafted features, which are used to extract low-level features including surface information such as color and texture. In contrast, DL methods can extract more robust features through convolution operations and hierarchical structures, which can obtain high-level features including abstract information, such as objects in the image. In recent years, DL methods have gradually become mainstream in many artificial intelligence tasks such as speech recognition [
6,
7], semantic segmentation [
8,
9], and image retrieval [
10].
In traditional methods, sparse representations [
11,
12] and morphological profiles (MPs) [
13] are typically used as the fundamental feature extractor [
14,
15,
16]. For example, Yin et al. [
12] proposed a remote sensing image fusion algorithm with sparse representations and color matching instead of the intensity hue saturation (IHS) color model and Brovey transform. Logistic regression [
17,
18], decision tree [
19], random forest [
20], extreme learning machine (ELM), probabilistic neural network (PNN), and support vector machines (SVM) [
21,
22] are usually adopted as the classifiers. For example, Thaseen et al. [
23] constructed a multiclass SVM to decrease the training and testing time and increase the individual classification accuracy. Han et al. [
24] compared the behavior of random forest with ELM, PNN, and SVM for the intelligent diagnosis of rotating machinery, which proved that random forest outperforms the comparative classifiers in terms of recognition accuracy, stability, and robustness to features, especially with a small training set. In general, traditional methods often capture shallow information and so it is difficult for them to achieve excellent performance. Therefore, DL methods are more suitable for remote sensing image classification with rich semantic information.
As mentioned above, feature extractors and classifiers usually are separated in traditional methods. In DL methods, classification models are generally end-to-end, so feature extractors and classifiers are trained and predicted simultaneously. Due to their good performance, DL methods have also been widely used in remote sensing image processing and interpretation, like hyperspectral pixel classification [
25,
26,
27,
28] and scene classification [
29,
30]. In order to deal with different problems flexibly, the basic convolution operation also has many variants, such as dilated convolution [
31], deformable convolution [
32], and transposed convolution [
33]. For example, Rizaldy et al. [
34] introduced a multiscale fully connected network (MS-FCN) with dilated convolution operation and a fully connected network (FCN) [
35] to minimize the loss of information during the point-to-image conversion. Körez et al. [
36] proposed a multiscale Faster R-CNN method with deformable convolution for single/low graphics processing unit (GPU) systems, which can make up for the limitations of the anchor’s shape. Wang et al. [
33] proposed an anchor-free object detection method with a transposed convolution operation. It can be seen from the current development that the convolution operation is an excellent feature extraction method.
In image classification, many classical models have been proposed, such as AlexNet [
37], GoogLeNet [
38], and VGG [
39]. In addition, there are many improved classification methods. He et al. [
40] proposed ResNet by adding a skip connection between two specific layers, which can alleviate the gradient disappearance problem during backpropagation. Xie et al. [
41] proposed ResNeXt by introducing a split–transform–merge operation in Inception [
38] to ResNet. Huang et al. [
42] proposed DenseNet, aiming to make the output of a block contain information for all layers in this block. DenseNet consists of many dense blocks and a classifying layer like ResNet, and the inputs of each layer are the sum of the outputs of all the previous layers in each dense block. Chen et al. [
43] took advantage of both ResNet and DenseNet and proposed a Dual-Path Network (DPN), which consists of a residual block and a dense block in parallel. Hu et al. [
44] introduced Squeeze-and-Excitation (SE) to ResNet and proposed SE-Net. In the SE-block, the output features are first squeezed by max-pooling and then transformed into a group of weights by using two fully connected layers, and the weights are multiplied to original features in each channel.
The above models, based on convolutional neural networks, have been used in many remote sensing tasks. In the past few years, modeling of higher-order statistics for more discriminative image representations has attracted great interest in deep ConvNets. In 2019, many new methods based on second-order features in convolutional neural networks have been proposed [
45,
46,
47,
48], and have achieved good results in many image processing tasks, such as image classification [
45], semantic segmentation [
46], image super-resolution [
47], and pedestrian re-identification [
48]. Gao et al. [
45] proposed the GSoP-Net model, introducing Global Second-order Pooling (GSoP) from lower to higher layers for exploiting holistic image information. After nonlinear transformation, a covariance matrix obtained by the GSoP layer is used for tensor scaling along the channel dimension, so that GSoP-Net can make full use of the second-order statistics of the holistic image. At the same time, the attention mechanism in convolutional neural networks has also developed in many directions in image processing tasks. Wang et al. [
49], with an Efficient Channel Attention (ECA) module, proposed ECA-Net, which only involves k (<9) parameters but brings about a clear performance gain.
However, the features in the models mentioned above are all single scale, and the scale variation of the objects in image has a great influence on the models. As shown in
Figure 1, it is difficult to classify remote sensing scenes if the size of the objects changes a lot. Some studies are aiming to solve this problem. In [
50], the original images were cropped into different sizes and rescaled to original size, then the generated different scale images were used as inputs and a scale-aware model was acquired.
Zeng et al. [
51] proposed a novel end-to-end CNN by integrating the global-context features (GCFs) and local object-level features (LOFs), which allows the method to be more discriminative in scene classification. Liu et al. [
30] trained a CNN with multiple size samples and used the spatial pyramid pooling (SPP) [
52] method to make inputs with different sizes and outputs with the same size.
The methods for scene classification mentioned above mostly have the same receptive field, not taking full advantage of the flexibility of the convolution operation. The traditional image multiscale method needs to crop or resize the inputs, which will lead to some loss of detail. In addition, if the input size of the convolutional neural network is not fixed, additional methods, such as SPP, need to be introduced in the prediction. Due to the hierarchical structure of convolutional neural networks, the fusion of features of different layers has become a new multiscale method, and has achieved good results in the field of target detection. However, shallow low-level features are not robust, so the multiscale features obtained by this method are not suitable for classification tasks. A new method is needed to solve the problem that the scale variation of the objects in the image has a great influence on the models. The dilated convolution can expand the receptive field of the convolutional kernel without introducing additional parameters, so that the convolutional kernel can focus on objects of different sizes. Therefore, using the characteristics of dilated convolution, a novel multiscale features extraction module is proposed in this paper to extract features of objects of different sizes in the image.
In addition, the methods of directly adding or concatenating the two feature maps mean that the contribution of the two feature maps to the entire model is equal. In [
53], the features from different layers are combined by adding them element-wise. The local information in the shallow layer and the semantic information in the deep layer are fused, and no additional parameters are introduced. In [
38], the structure of inception combines features from different kernels by concatenating in different channels, which will introduce some parameters in the next layer. In fact, these methods are often too rigid and cannot fully utilize the information contained in the two input feature maps. The channel attention mechanism can assign different weights to each channel, so as to effectively select which channel information is more important. Therefore, the channel attention mechanism with a squeeze-and-excitation operation can be used to adaptively select the importance of the corresponding channels in the two feature maps through global information. By taking advantage of the squeeze-and-excitation operation, a multiscale feature-fusion module is proposed to fuse the two input feature maps according to the contribution to the model.
According to the above analysis of multiscale feature extraction and feature fusion methods, MSAA-Net is proposed for the problem of the object size in the same category image varying greatly. Different from the methods mentioned above using different sizes of samples as inputs, a low-level feature extraction module with different receptive fields is adopted to capture multiscale features. Then multiscale features are fed into a feature-fusion module to merge adaptively. Finally, combining the classic residual block structure and incorporating the attention mechanism, a deep feature extraction module is designed. In this module, the skipping connection in the residual block can solve the problem of gradient disappearance, and the attention mechanism can perform sparse and enhanced processing on the features extracted by convolution. Therefore, the deep classification module can enhance fusion features via a self-attention convolution block and output the final classification results.
The main contributions of this paper are as follows.
A novel multiscale features extraction module containing two convolution block branches is proposed to extract features at different scales. The two branches have the same structure, but different receptive fields. Convolutional kernels of different receptive field sizes can capture features of different scales, and no additional parameters are introduced, because the parameters in both branches are shared.
A multiscale feature-fusion module is designed for the proposed network. In this module, a squeeze process is used to obtain global information and the excitation process is used to learn the weights in different channels. With global information, the proposed method can select more useful information from the two feature maps for adaptive fusion.
A deep classification module with the attention mechanism is proposed to extract high-level semantic features and generate final classification results. In this module, the skipping connection can well solve the problem of gradient disappearance, and the attention mechanism can perform sparse and enhanced processing on the features.
The rest of the paper is organized as follows.
Section 2 presents the proposed method in detail.
Section 3 presents the data and experimental results. A discussion is provided in
Section 4. Finally, the conclusions are provided in
Section 5.