Attention-Based Pyramid Network for Segmentation and Classiﬁcation of High-Resolution and Hyperspectral Remote Sensing Images

: Unlike conventional natural (RGB) images, the inherent large scale and complex structures of remote sensing images pose major challenges such as spatial object distribution diversity and spectral information extraction when existing models are directly applied for image classiﬁcation. In this study, we develop an attention-based pyramid network for segmentation and classiﬁcation of remote sensing datasets. Attention mechanisms are used to develop the following modules: ( i ) a novel and robust attention-based multi-scale fusion method effectively fuses useful spatial or spectral information at different and same scales; ( ii ) a region pyramid attention mechanism using region-based attention addresses the target geometric size diversity in large-scale remote sensing images; and ( iii ) cross-scale attention in our adaptive atrous spatial pyramid pooling network adapts to varied contents in a feature-embedded space. Different forms of feature fusion pyramid frameworks are established by combining these attention-based modules. First, a novel segmentation framework, called the heavy-weight spatial feature fusion pyramid network (FFPNet), is proposed to address the spatial problem of high-resolution remote sensing images. Second, an end-to-end spatial-spectral FFPNet is presented for classifying hyperspectral images. Experiments conducted on ISPRS Vaihingen and ISPRS Potsdam high-resolution datasets demonstrate the competitive segmentation accuracy achieved by the proposed heavy-weight spatial FFPNet. Furthermore, experiments on the Indian Pines and the University of Pavia hyperspectral datasets indicate that the proposed spatial-spectral FFPNet outperforms the current state-of-the-art methods in hyperspectral image classiﬁcation.


Introduction
Supervised segmentation and classification are important processes in remote sensing image perception. Many socioeconomic and environmental applications, including urban and regional planning, hazard detection and avoidance, land use and land cover, as well as target mapping and tracking, can be handled by using suitable remote sensing data and effective classifiers [1,2]. A great deal of data with different spectral and spatial resolutions is currently available for different applications with the development of modern remote sensing technology. Among these massive remote sensing data, high-resolution and hyperspectral images are two important types. High-resolution remote sensing images usually have rich spatial distribution information and a few spectral bands, which contain the detailed shape and appearance of objects [3]. Semantic segmentation is a powerful and promising scheme to assign pixels in high-resolution images with class labels [4,5]. Hyperspectral images can capture hundreds of narrow spectral channels with an extremely fine spectral resolution, allowing accurate characterization of the electromagnetic spectrum of an object and facilitating a precise analysis of soils and materials [6]. Because each pixel can be considered a high-dimensional vector and to be surrounded by local spatial neighborhood, supervised spatial-spectral classification methods are suitable for hyperspectral images.
However, segmentation or classification of different types of remote sensing images is an exceedingly difficult process, which includes major challenges of spatial object distribution diversity ( Figure 1) and spectral information extraction. Specifically, the following are the challenges with segmentation and classification of remote sensing images: • Missing pixels or occlusion of objects: different from traditional (RGB) imaging methods, remote sensing examines an area from a significantly long distance and gathers information and images remotely. Due to the large areas contained in one sample and the effects of the atmosphere, clouds, and shadows, missing pixels or occlusion of objects are inevitable problems in remote sensing images. • Geometric size diversity: the geometric sizes of different objects may vary greatly and some objects are small and crowded in remote sensing imagery because of the large area covered comprising different objects (e.g., cars, trees, buildings, roads in Figure 1). • High intra-class variance and low inter-class variance: this is a unique problem in remote sensing images and it inspires us to study superior methods aiming to effectively fuse multiscale features. For example, in Figure 1, buildings commonly vary in shape, style, and scale; low vegetations and impervious surfaces are similar in appearance. • Spectral information extraction: hyperspectral image datasets contain hundreds of spectral bands, and it is challenging to extract spectral information because of the similarity between the spectral bands of different classes and complexity of the spectral structure, leading to the Hughes phenomenon or curse of dimensionality [7]. More importantly, hyperspectral datasets usually contain a limited number of labeled samples, thus making it difficult to extract effective spectral information from hyperspectral images. First, to solve the problem of spatial object distribution diversity in high-resolution images, it is necessary to effectively extract and fuse features in multiple scales. Recently, deep-learning methods have shown excellent performance in remote sensing image processing, especially deep convolutional neural networks (DCNNs), which have strong ability to express multiscale features (such as FCNs [8], S-RA-FCN [9], DeepLabv3 [10], and DeepLabv3+ [11]). To date, many models based on DCNNs for semantic segmentation of remote sensing images have been proposed. Sun and Wang [4] established a semantic segmentation scheme based on fully convolutional networks [8]. Wang et al. [12] proposed a gated network based on the information entropy of feature maps. This method can effectively integrate local details with contextual information. The cascaded convolutional neural networks [5,13] were utilized for the segmentation of remote sensing images by successively aggregating contexts. Most recently, many multiscale context-augmented models [9,14,15] have been proposed to exploit contextual information in remote sensing images. Remote sensing target segmentation problems such as object occlusion, geometric size diversity, and small objects have attracted increasing research attention [16][17][18][19].
Further analysis of these multiscale/contextual feature fusion models reveals that their common objective is to establish an effective feature attention weight fusion method. Attention mechanisms are widely used for various tasks such as machine translation [20], scene classification, and semantic segmentation. The non-local network [21] first adopts a self-attention mechanism as a submodule for computer vision tasks. Recently, many attention-reinforced mechanisms [9,22,23] have been proposed on the basis of non-local operation in semantic segmentation. Attention U-Net [24] learns to suppress irrelevant areas in an input image while highlighting useful features for a specific task on the basis of cross-layer self-attention. CCNet [25] harvests the contextual information of all the positions in one image by stacking two serial criss-cross attention modules. ACFNet [26] is a coarse-to-fine segmentation network based on the attention class feature module, which can be embedded in any base network. Most recently, various self-attention mechanisms have proven to be effective for solving the problem of multiscale feature fusion in feature pyramid-based models [27][28][29][30].
In summary, the above-mentioned multiscale feature fusion models based on attention mechanisms apply convolutional neural networks (CNNs) in three-band data, which have achieved significant breakthroughs in semantic segmentation. However, these models still cannot effectively solve the problem of spatial distribution diversity in remote sensing for the following reasons: (1) Most models only consider the fusion of two or three adjacent scales and do not further consider how to achieve the feature fusion of more or even all the different scale layers. Improved classification accuracy can be achieved by combining useful features at more scales. (2) Although a small part of the attention mechanism (such as GFF [31]) considers the fusion of more layers, it does not successfully solve the semantic gaps between high-and low-level features. The detailed analysis of different feature layers is discussed in Section 2.1.
The novel attention mechanisms based on self-attention mainly focus on spatial and channel relations for semantic segmentation (such as the non-local network [21]). Regional relations are not considered for the remote sensing images, and thus the relationship between object regions cannot be deepened.
B. Review of Spatial-spectral Classification for Hyperspectral Images by Multiscale Feature Processing.
To solve the problem of spectral information extraction in hyperspectral images and enhance the classification performance, spatial-spectral classification methods have gained prominent application in hyperspectral image processing, mainly including handcrafted feature-based approaches [32][33][34][35] and deep learning methods. Since deep learning methods (especially DCNNs) have proven to be more advantageous in feature extraction and representation compared with the traditional shallow learning method, this paper mainly focuses on deep spatial-spectral feature extraction and representation by multiscale feature processing in DCNNs. A review of DCNN-based classification methods for spatial-spectral approaches is given in [6], including 1D or 2D CNN [36,37], 2D + 1D CNN [38], and 3D CNN [39][40][41]. However, although these methods achieve promising performance for hyperspectral classification, they cannot fully extract and represent features, because they utilize the features of only the last convolutional layer for classification without considering multiscale features obtained by the previous convolutional layers. To this end, Zhao et al. [42] proposed a multiple convolutional layer fusion framework to fuse features extracted from different convolutional layers. The fusion process mainly involves the majority voting or direct concatenate mechanisms after applying the fully connected layer to each convolutional layer. The CNNs with multiscale convolution (MS-CNNs) [43] are proposed to address the limited number of training samples and class differences in variance for hyperspectral images by extracting deep multiscale features. By conducting experiments on three popular hyperspectral images, Imani and Ghassemian [44] demonstrated that although feature fusion methods are time-consuming, they can provide superior classification accuracy compared to other methods. Imani and Ghassemian [44] also showed that multiscale feature fusion is developed into one of the trends of hyperspectral image classification. Furthermore, attention mechanisms are used to extract and fuse contextual features. Haut et al. [45] is the first to develop a visual attention-driven mechanism for spatial-spectral hyperspectral image classification, which applies the attention mechanism to residual neural networks. Mei et al. [46] proposed a spatial-spectral attention network for hyperspectral image classification by the RNN and CNN both with the attention mechanism. However, these methods are only the initial application of multiscale fusion and the attention mechanism in hyperspectral datasets. There is still room for improvement in the following aspects in the area of hyperspectral image classification: (1) When dealing with hyperspectral spatial neighborhoods of the considered pixel, the semantic gap in multiscale convolutional layers is not considered, and simple fusion is not the most effective strategy. (2) The spectral redundancy problem is not considered sufficiently in the existing hyperspectral classification models. With regard to such a complex spectral distribution, there is exceedingly little work on extraction of spectral information from coarse to fine (multiscale) processing by different channel dimensions.
Bearing the above challenges in mind, in this study, we propose an attention-based pyramid network by using a self-attention mechanism flexibly. Our model utilizes attention mechanisms in the following three areas: We propose attention-based multiscale fusion to fuse useful features at different and the same scales to achieve the effective extraction and fusion of spatial multiscale information and extraction of spectral information from coarse to fine scales. (2) We propose cross-scale attention in our adaptive atrous spatial pyramid pooling (adaptive-ASPP) network to adapt to varied contents in a feature-embedded space, leading to effective extraction of the context features.
A region pyramid attention module based on region-based attention is proposed to address the target geometric size diversity in large-scale remote sensing images.
Through different combinations of these attention modules, different forms of feature fusion pyramid frameworks (two-layer and three-layer pyramids) are established. First, a novel and practical segmentation model, called the heavy-weight spatial feature fusion pyramid network (FFPNet), is proposed to solve the spatial object distribution diversity problem in high-resolution remote sensing images. The heavy-weight spatial FFPNet is a three-level feature fusion pyramid built on the basis of region pyramid attention and attention-based multiscale fusion modules. Furthermore, boundary-aware (BA) loss [47] is used to train the heavy-weight spatial FFPNet in an end-to-end manner. Second, a spatial-spectral FFPNet is developed to extract and integrate multiscale spatial features and multi-dimensional spectral features of hyperspectral images using the attention-based multiscale fusion module. The spatial-spectral FFPNet mainly consists of two modules: a light-weight spatial feature fusion pyramid (FFP) and a spectral FFP. The light-weight spatial FFP is a two-level pyramid, whose trainable parameters are less than one-third those of the heavy-weight spatial FFPNet. Thus, the light-weight module is suitable for a small number of labeled samples of the hyperspectral dataset. In addition, the spectral FFP, which is also a two-level pyramid, is proposed to better extract the spectral features from hyperspectral datasets by compressing spectral information from coarse to fine scales.
To evaluate the accuracy and efficiency of the proposed models, first, extensive experiments are conducted on two challenging high-resolution semantic segmentation benchmark datasets, namely the ISPRS (International Society for Photogrammetry and Remote Sensing) Vaihingen dataset and the ISPRS Potsdam dataset. The local experimental results demonstrate that the heavy-weight spatial FFPNet outperforms other predominant DCNN-based models (DeepLabv3+ [11] considered as the baseline). In addition, the effectiveness and practicability of these novel attention-based mechanisms is demonstrated by conducting an ablation study. Furthermore, we apply the spatial-spectral FFPNet to two popular hyperspectral datasets, namely the Indian Pines dataset and the University of Pavia dataset. The experimental results (the well-known CNN model [40] considered as the baseline) indicate that the spatial-spectral FFPNet is more robust for a small number of training samples of the hyperspectral dataset and can obtain state-of-the-art results under different training samples. Our proposed spatial-spectral FFPNet has excellent ability to extract and express multiscale spatial and spectral information. It is worth noting that the spatial-spectral FFPNet with data enhancement is a better choice for hyperspectral image classification when the sample size is extremely small.

Overview
In this study, we focus on the challenge of spatial and spectral distribution of remote sensing images in the "encoder-decoder" frameworks [9,11,12,[48][49][50]. The encoder part is based on a convolutional model to generate a feature pyramid with different spatial levels or spectral dimensions. Then, the decoder fuses multiscale contextual features. The interaction of adjacent scales can be formulated as where F l is the fused feature at the lth level, H represents a combination of multiplication [49,51], weight sum [31], concatenation [50], attention mechanism [27,48,52], and other operations [12]. However, these operations cannot solve the problem of multiscale feature fusion of objects in remote sensing images. The main reason is that the feature maps from the lower layers are of high resolution and may have excessive noise, resulting in insufficient spatial details for high-level features. Further, these integrated operations may suppress necessary details in the low-level features, and most of these fusion methods do not consider the large semantic gaps between the feature pyramids generated by the encoder. Furthermore, these operations do not consider effective extraction and fusion of multiscale spectral information in hyperspectral images.
Therefore, we propose a multi-feature fusion model based on attention mechanisms in this paper. Current attention mechanisms [9,22,24,53] are based on the non-local operation [21], which usually deal with spatial pixel and channel selections. These mechanisms cannot achieve regional relations of objects and cannot effectively extract and integrate multiscale features in remote sensing images. To address these issues, three novel attention modules are proposed: (1) A region pyramid attention (RePyAtt) module is proposed to effectively establish relations between different region features of objects and relationships between local region features by using a self-attention mechanism on different feature pyramid regions; (2) An adaptive-ASPP module aims to adaptively select different spatial receptive fields to tackle large appearance variations in remote sensing images by adding an adaptive attention mechanism to the ASSP [10,11]; (3) A multiscale attention fusion (MuAttFusion) module is proposed to fuse the useful features at different scales and the same scales effectively.
As shown in Figure 2, segmentation and classification schemes of remote sensing images are achieved through the different combinations of the proposed attention modules. First, for high-resolution images, most of the information is concentrated in spatial dimensions. The proposed heavy-weight spatial FFPNet segmentation model solves the spatial object distribution diversity problem in remote sensing images. We adopt ResNet-101 [54] pretrained on ImageNet [55] as the backbone of the segmentation model. A three-level feature fusion pyramid is designed as shown in Figure 2. In addition, the residual convolution (ResConv) module is used as the basic processing unit, while the adaptive-ASPP module is used to adaptively combine the context features generated from the ResNet-101 and ResConv. For the heavy-weight spatial FFPNet, given a high-resolution image (3-band), ResNet-101 pretrained on ImageNet [55] is used as the backbone for feature extraction (middle, where W denotes the height or width of the image). The heavy-weight spatial FFPNet is a three-level feature fusion pyramid. The detailed configurations of the heavy-weight spatial FFPNet are described in Section 2.5. Furthermore, BA loss is used to train the heavy-weight FFPNet in an end-to-end manner. For the spatial-spectral FFPNet, given a hyperspectral image with the size of p × H × W, where p is the number of spectral bands, the image is sent to the light-weight spatial FFP and the spectral FFP modules simultaneously. The light-weight spatial FFP is a two-level pyramid, and VGGNet16 pretrained on ImageNet [55] is used as the backbone. Notably, the initial parameters of the first convolutional layer in the pretrained network are copied until the p-channel inputs are attained. Furthermore, fully connected layers are used to effectively merge multiscale spatial feature obtained by the light-weight spatial FFP and spectral feature obtained by the spectral FFP and predict the class of all pixels. The detailed description of the spatial-spectral FFPNet is presented in Section 2.6.
Second, for hyperspectral images, the proposed spatial-spectral FFPNet extracts and integrates multiscale spatial and spectral features. Recalling Figure 2, the spatial-spectral FFPNet includes three parts: (1) multiscale spatial feature extraction with the light-weight spatial FFP; (2) multi-spectral feature extraction with the spectral FFP; (3) fusion of spatial and spectral features as well as classification prediction with fully connected layers. Specifically, the light-weight spatial FFP module is a shallow classification framework, which uses the blocks of VGGNet-16 [56]. It only has a two-level feature fusion pyramid based on MuAttFusion. In comparison, the trainable parameters of the light-weight spatial FFP module are less than one-third those of the heavy-weight spatial FFPNet. This is because of the small number of labeled samples in hyperspectral datasets. The more parameters a model has, the greater its capacity, but also more labeled data needed to prevent overfitting. Similarly, the spectral FFP module has a two-level feature fusion pyramid based on MuAttFusion, which reduces the amount of parameters while capturing as much spectral information as possible.

Region Pyramid Attention Module
Currently, the soft attention-based methods mainly aim to capture long-range contextual dependencies on the basis of the non-local mechanism and its variants. However, the geometric size of different objects in remote sensing images varies significantly, so it is challenging to achieve regional dependencies of objects using existing models. Inspired by the ideas of the feature pyramid, we propose the region pyramid to address the target scale diversity. After this, we combine the region pyramid and self-attention to effectively establish dependencies between different object region features and relationships between local region features. We illustrate our approach via a simple schematic in Figure 3. We first generate a region pyramid by partitioning the input feature maps (left) into four groups and employing the self-attention mechanism to extract the regional dependence. Finally, the output of the RePyAtt module is obtained by summation ('SUM') of different region groups.

Region Pyramid
We partition the input feature maps into different regions via a chunk operation. The region block size defined in this article are {single pixel level, 8 × 8 level, 4 × 4 level, 2 × 2 level, and 1 × 1 level}. In addition, we conduct an ablation study on different combinations as detailed in Section 3.3.2. For each group of the pyramid, we first feed the region blocks into a global pooling layer to obtain the regional representations. Then, we concatenate the representations of the region block to generate a regional representation of the whole input feature. It is worth noting that the single-pixel level is directly sent to the self-attention module without the global pooling operation.

Self-Attention on The Regional Representation
To exploit more explicit regional dependencies of objects, we compute the self-attention representations within the regional representation. Self-attention consists of one 3 × 3 convolution and one 1 × 1 convolution, with the number of channels F/2 and 1, respectively, where F denotes the number of channels of the input feature maps. Further experiments show that the parallel form of attention-weighted representations of different region groups can effectively enhance the dependencies across different region features and the relationships between local region features better than pixel-wise and channel-wise self-attention operators.
As illustrated in Group 3 of Figure 3, we first divide the input feature X into G (2 × 2) partitions. Then, we concatenate the point statistics after global pooling to obtain the regional representation X m3 ∈ R F×G . We apply self-attention on X m3 as follows: where A m3 ∈ R 1×G is an attention matrix based on the global information across all spectral bands, and Z m3 ∈ R F×G is the weighted output features. W 1 denotes the combination operation of one 3 × 3 convolution and 1 × 1 convolution. f (.) represents 1 × 1 convolution and * denotes convolution. Finally, the output of the RePyAtt module is obtained by the weighted sum of different region groups, which is formulated as where M represents the total number of groups in the region pyramid, ⊗ denotes region-wise multiplication, and U p(.) is the upsampling layer using the nearest interpolation.

Multi-Scale Attention Fusion
The main task of the proposed MuAttFusion module is to effectively integrate multiscale spatial and spectral features of different objects in remote sensing images. MuAttFusion selectively fuses same-layer, high-layer, and low-layer features by an adaptive attention method as shown in Figure 4.

Higher-and Lower-scales
The lower-layer branch propagates spatial information from the lower layers (T < t) to the current layer (t) by the downsampling aggregation module (DAM). As shown in Figure 4a, to minimize memory consumption, we first use a 1 × 1 convolutional layer to compress the incoming feature maps. To achieve a consistent size for all feature maps, low-level features are downsampled to the feature size of the current layer by using bilinear interpolation. To fully use the entire feature information, all lower-layer feature maps are concatenated. Introducing the lower layers into the current layer inadvertently passes noise as well. To tackle this, high-level (T > t) contextual information is simultaneously propagated into the current layer by the upsampling aggregation module (UAM). The UAM structure is similar to that of the DAM, as shown in Figure 4b.

Attention Fuse Module
The lower-layer features, although refined, may contain some unnecessary background clutter, whereas in the higher-layer features, the detailed information may be oversuppressed in the current layer. To address these issues, we introduce an attention fuse (AttFuse) module, shown in Figure 4c. This module combines features of these two branches by adaptive attention weights. Consider the two feature maps F LL and F HL ; the attention module concatenates them and feeds them through a set of convolution layers (3 × 3 conv and 1 × 1 conv) and a sigmoid layer to produce an attention map with two channels, with each channel specifying the importance of the corresponding feature map. The attention maps are calculated as follows: A f = sigmoid (Concat [F LL , F HL ]) . The attention maps thus generated are then multiplied element-wise to produce the final higher-and lower-layer fusion feature maps: Finally, the output featureF t of the MuAttFusion module is then fused with the same-layer features by the RePyAtt module:F t = Concat F SL , F f . It is worth noting that for the light-weight model, the output feature and the same-layer features are directly fused to reduce the model parameters.
To further refine the features and reduce network parameters. ResConv shown in Figure 5 is introduced. The ResConv block consists of one 1 × 1 convolution and two 3 × 3 dilated convolution, with rates = 1 and 3. The 1 × 1 convolution reduces the network channel, thereby reducing the network parameters. Two 3 × 3 dilated convolution can deepen the network to enhance its ability to capture sophisticated features.

Adaptive-ASPP Module
Objects within a remote sensing image typically have different sizes. Existing multiple branch structures such as ASPP [10,11] and DenseASPP [57] are developed to learn features using filters of different dilation rates in order to adapt to the scales. However, these approaches ignore the same problem: different local regions may correspond to objects with different scales and geometric variations. Thus, spatial adaptive filters are desired for different scales to tackle large feature variations in remote sensing images.
Toward this end, inspired by the MuAttFusion module described in Section 2.3, an adaptive-ASPP is designed to adapt to varied contents. The core of adaptive-ASPP is to adjust the combination weights for different contents in a feature-embedded space. The CASINet [30] was proposed to solve this problem; it first uses a non-local operation to achieve the adaptive information interaction across scales. However, the non-local operation [21] was used to exploit the long-range contexts for feature refinement and its calculation cost is high. The non-local operation is not applicable for cross-scale attention problems; this is also verified by our experiments. Different from the non-local operation, we propose a novel cross-scale attention (CrsAtt) module based on the self-attention mechanism.

Cross-Scale Attention Module
The structure of the proposed CrsAtt module is shown in the top of Figure 6. CrsAtt first uses two different scales to obtain the attention coefficients; then, it adaptively adjusts different scale feature weights by element-wise multiplication of the input scale feature maps and attention coefficients. As depicted in Figure 6, consider five intermediate feature maps, {X 1 , X 2 , X 3 , X 4 , and X 5 }, obtained from five branches of the ASPP with each X i ∈ R H×W×C (except X 5 , which is obtained by image pooling of features). Information interaction is performed across each scale feature of four scales {X 1 , X 2 , X 3 , X 4 }, with each scale being a feature node. Then, CrsAtt operations are performed on the four features. The feature of the ith scale is calculated as where σ 1 = max(0, x) and σ 2 = 1 1+e −x correspond to ReLU and Sigmoid activation functions, respectively. * denotes channel-wise Finally, the output of the adaptive-ASPP is obtained by concatenating {X 1 ,X 2 ,X 3 ,X 4 , and X 5 }.

Heavy-Weight Spatial FFPNet Model
The heavy-weight spatial FFPNet model is achieved through the combination of the three attention-based modules introduced in the previous sections. The configurations of the three-level heavy-weight FFPNet is shown in Table 1. Concretely, consider an input image X ∈ R C×H×W , in which C, H, and W denote the number of channels, height, and width of the image, respectively. First, the image is fed it into the ResNet-101 [54] pretrained on the ImageNet dataset [55] to generate different scale feature maps. In the first level of the pyramid, the features from the four stages of the backbone are fed into ResConv to generate different scale feature maps x 1 , x 2 , x 3 , and x 4 , with 256 channels, respectively. In addition, the output of the backbone is fed into the adaptive-ASPP module to generate the feature map x 5 to adaptively combine these context features. In the second level of the pyramid, the intermediate features x 2 and x 3 are sent to RePyAtt based on region-based attention; x 6 and x 7 are then generated after MuAttFusion. In the third level of the pyramid, the final predicted segmentation map is generated after using MuAttFusion for x 6 , x 7 , and x 5 again. Furthermore, BA loss [47] is utilized to train the heavy-weight spatial FFPNet in an end-to-end manner to optimize the model parameters. By the simple modification of cross entropy loss, the BA loss is utilized to solve the issue that the pixels surrounding the boundary are hard to predict. Table 1. Three-level heavy-weight spatial FFPNet configurations. The module parameters are denoted as "module name(receptive fields of different convolutions)-number of modules-number of module output channels". Note that some complex modules only give the module name

Level Detailed Configurations
First-level pyramid Parameter: 78.8 million

Spatial-Spectral FFPNet Model
To maximize the use of hyperspectral spatial and spectral information, instead of dealing with the hypercube as a whole, the proposed spatial-spectral FFPNet model includes two CNN modules: the light-weight spatial FFP module for learning multiscale spatial features and the spectral FFP module for extracting spectral features along multiple dimensions. The features from the two modules are then concatenated and fed to a fully connected classifier to perform spatial-spectral classification.

Light-Weight Spatial Feature Fusion Pyramid Module
The light-weight spatial FFP module is a relatively shallow spatial feature extraction framework, which only uses VGGNet-16 [56] as the backbone; the configurations of two-level light-weight spatial FFP module is shown in Table 2. Compared with the heavy-weight spatial FFPNet, the light-weight one only has 24.8 million trainable parameters owing to the small number of labeled hyperspectral samples. Furthermore, MuAttFusion is utilized to fuse the useful features from x 1 , x 2 , and x 3 , generated by the backbone after the execution of ResConv. Table 2. Two-level light-weight spatial feature fusion pyramid configurations. The module parameters are denoted as "module name(receptive fields)-number of modules-number of module output channels".

Level Detailed Configurations
First-level pyramid  Figure 2, the spectral module can use multiple convolutional kernels to automatically extract features from fine to coarse scales as convolutional layer progresses. Similarly, to solve the problem of spectral redundancy of hyperspectral images, the spectral information can be compressed by different channel dimensions from coarse to fine scales, and the useful features in the multiple scales are selected and merged by the attention mechanism. Thus, the spectral FFP module extracts the spectrum features of hyperspectral data more effectively. The configurations of the two-level spectral FFP module are presented in Table 3. Specifically, the multiscale features can be divided into three stages by different channels, with depths of 64, 32, and 16. Every stage contains a 3 × 3 convolutional layer to reduce the dimension of features and a 1 × 1 convolutional layer to further enhance the expression ability of spectral features. MuAttFusion is then harnessed to extract and combine useful features generated from the three stages. Table 3. Two-level spectral feature fusion pyramid configurations. The module parameters are denoted as "module name(receptive fields)-number of modules-number of module output channels".

Level Detailed Configurations
First-level pyramid In the spatial-spectral FFPNet, the last step is the combination of the output features of the light-weight spatial FFP and spectral FFP modules. The overall framework is shown in Figure 2. To effectively merge the spatial and spectral features as well as express the fused spatial-spectral features, first, the multiscale spatial features generated by the light-weight spatial FFP module and the multi-dimensional spectral features extracted by the spectral FFP module are converted into a one-dimensional tensor by a fully connected layer with the ReLU activation function. Then, the two types of features are directly merged through concatenation. Finally, another fully connected layer with the ReLU activation function is used to further refine and represent the combined spectral-spatial features, and a softmax activation layer predicts the probability distribution of each class.
Furthermore, to prevent the model from overfitting in case of limited hyperspectral datasets, the dropout method [58] is used for the fully connected layers. Specifically, the dropout method randomly selects hidden neurons as zero with a probability of 0.5. These dropped neurons will not play a role in the forward and backward processes of the model.

High-Resolution Datasets
The Vaihingen dataset consists of 3-band IRRG (Infrared, Red and Green) image data acquired by airborne sensors. There are 33 images with a spatial resolution of 9 cm. The average size of each image is 2494 × 2064 pixels. All datasets are labeled into the five foreground classes (impervious surfaces, buildings, low vegetation, trees, and cars) and one background class (see Figure 7). Following the setup in the online test, 16 images were used as a training set, while the remaining 17 images (Image IDs: 2, 4, 6,8,10,12,14,16,20,22,24,27,29,31,33,35,38) were used as a model testing set. We randomly sampled the 512 × 512 patches from the 33 images and the images were processed at the training stage with normalization, random horizontal flip, and Gaussian blur. Finally, 6200 images were generated in the training set and 1000 images in the testing set. The Potsdam dataset is composed of 38 images with a spatial resolution of 5 cm and consists of IRRGB channels. IRRG and RGB images are utilized for the segmentation model. The size of all images is 6000 × 6000 pixels, which are annotated with pixel-level labels of six classes corresponding to the Vaihingen dataset. To train and evaluate the heavy-weight spatial FFPNet, we also followed the data partition method used in benchmark methods. 24 images were selected as a training set and 14 images (Image IDs: 02_13, 02_14, 03_13, 03_14, 04_13, 04_14, 04_15, 05_13, 05_14, 05_15, 06_13, 06_14, 06_15, 07_13) as a testing set. We randomly sampled the 512 × 512 patches from the original images and generated 14,000 patches for the training set and 3000 patches for the testing set. Similar to the Vaihingen dataset, the patches were processed at the training stage with normalization, random horizontal flip, and Gaussian blur.

Hyperspectral Datasets
the AVIRIS Indian Pines (IP) dataset is gathered by the AVIRIS sensor. The image contains 224 spectral channels in the 400-2500 nm region of the visual and infrared spectra. As a conventional setup, 24 spectral bands were removed owing to noise and the remaining 200 bands were utilized for the experiments. The image is of size 145 × 145 with a spatial resolution of 20 m per pixel, and its ground truth contains sixteen different land-cover classes, which is shown in Figure 7. 10,249 pixels were selected for manual labeling according to the ground truth map. The University of Pavia (UP) dataset is recorded by the ROSIS-03 sensor. The image consists of 610 × 340 pixels and 115 bands with a spectral coverage ranging from 0.43 to 0.86 µm. After removing noisy bands, 103 bands were used.
Nine classes of land covers were considered in the ground truth of this image, which are shown in Figure 7.

Baselines
In order to evaluate the heavy-weight spatial FFPNet segmentation model, we chose DeepLabv3+ [11] as our baseline. DeepLabv3+ fuses multiscale features by introducing low-level features to refine high-level features; thus, state-of-the-art performance is achieved on many public datasets. Furthermore, for the spatial-spectral classification model, a generally recognized deep convolutional neural network proposed by Paoletti et al. [40] was utilized as the baseline for hyperspectral image classification. The CNN is a 3-D network using spatial and spectral information, which performs on hyperspectral datasets accurately and efficiently.

Evaluation Metrics
To evaluate the performance of the proposed models for segmentation and classification of remote sensing images, F1 score, Overall Accuracy (OA), Average Accuracy (AA), mean Intersection over Union (mIoU), and Kappa coefficient were used. First, for the measurement of heavy-weight spatial FFPNet performance, the F1 score was calculated for the foreground object classes and for a comprehensive comparison of the heavy-weight spatial FFPNet with different models, OA for all categories including background and mIoU were adopted. In addition, following a previous study [28], the ground truth with eroded boundaries was utilized for the evaluation in order to reduce the impact of uncertain border definitions. Second, AA, OA, and Kappa coefficient were used to measure the performance of spatial-spectral FFPNet classification. More importantly, the average results of three runs of all experiments of training and testing sets were calculated.

Implementation Details
The stochastic gradient descent (SGD) was employed as the optimizer of the heavy-weight spatial FFPNet with momentum = 0.9 and weight decay = 5 × 10 −4 . The initial learning rate = 2.5 × 10 −4 , a poly learning rate policy was used for the optimizer. The mini-batch size was set to 4 and the maximum epoch was 10. In addition, the batch normalization and the ReLU function were used in all layers, except for the output layers, where softmax units were used. Furthermore, inspired by the baseline model (Deeplabv3+), the dropout method (with probability = 0.5 and 0.1) was employed in the last layer of the decoder module to effectively avoid overfitting. We used Pytorch for implementation on a high-performance computing cluster, with one NVIDIA Titan RTX 24 GB GPU. The average training time for each experiment was approximately 20 h. Code will be made publicly available.

Experiments on Vaihingen Dataset
Ablation study. In the proposed heavy-weight spatial FFPNet, three novel attention based modules extract and integrate multiscale features adaptively and effectively in remote sensing images. To verify the effectiveness of these attention-based modules, extensive experiments in different settings were conducted and the results are listed in Table 4. In addition, to study the adaptability of different combinations of region sizes to object features, we investigated different combinations of groups in the RePyAtt module, and the results are presented in Table 5.
As can be seen in Table 4, the three novel attention modules result in significant improvement compared to the baseline (Deeplabv3+ with ResNet-101). Specifically, the use of the feature fusion pyramid framework yields an OA of 90.64% and an mIoU of 80.37% , which are 2.55% and 7.54% improvement, respectively, over the values yielded by the baseline. However, employing the ASPP module in the framework can lead to a slight decline on the model performance. This result is mainly because the ASPP module cannot solve well the issue of context feature fusion for geometric variations in remote sensing images. By contrast, the adaptive-ASPP adapts to varied contents by the CrsAtt module, thus improving the performance over the baseline by 2.82% and 8.51% in terms of OA and mIoU, respectively. Therefore, it is demonstrated that the adaptive-ASPP can be widely used in other related models in case of large appearance variations. Furthermore, the introduction of the BA loss can improve the performance by approximately 0.20% and 0.51% in terms of OA and mIoU compared with when the CE loss is used. Overall, the novel heavy-weight spatial FFPNet has great benefit in dealing with the spatial object distribution diversity challenge in remote sensing images.
We further studied the effects of different combinations of groups in the RePyAtt module. Table 5 shows that the performance is optimal and robust when the combination is set to { 4 × 4 level, 2 × 2 level, and 1 × 1 level}, in which the best OA and mIoU of 90.91% and 81.33%, respectively, are achieved. In addition, it can be observed that more combinations of groups do not necessarily result in better model performance. Thus, an optimal combination can be used to more effectively achieve region-wise dependencies of objects, resulting in improved model performance. Comparison with existing methods. To evaluate the effectiveness of the segmentation model, we compare our model with other leading benchmark models and the results are shown in Table 6. Specifically, FCNs [8] connect multiscale features by the skip architecture. DeepLabv3 [10] adopts the ASPP module with global pooling operation to capture contextual features. UZ_1 [59] is a CNN model based on encoder-decoder. Attention U-Net [24] fuses the adjacent-layer features based on attention mechanisms. DeepLabv3+ [11] fuses multiscale features by introducing low-level features to refine high-level features based on DeepLabv3 [10]. RefineNet [60] refines low-resolution semantic features with fine-grained low-level features in a recursive manner to generate high-resolution semantic feature maps. S-RA-FCN [9] produces relation-augmented feature representations by the spatial and channel relation modules. ONE_7 [61] fuses the output of two multiscale SegNets [62]. DANet [22] adaptively integrates local features with their global dependencies by two types of attention modules. GSN5 [12] utilizes entropy as a gate function to select features. DLR_10 [63] combines boundary detection with SegNet and FCN. PSPNet [64] exploits the capability of global context information by the pyramid pooling module. Importantly, most of the models adopt the ResNet-101 as the backbone.  Table 6 indicates that the heavy-weight spatial FFPNet outperforms other context aggregation or attention-based models in terms of OA. Specifically, we can see that the heavy-weight spatial FFPNet outperforms the baseline model (DeepLabv3+ [11]), with 2.82% and 8.5% increase in OA and mIoU, respectively. Importantly, the qualitative comparisons between our proposed model and the baseline model are provided in Figure 8. The quantitative and qualitative analyses indicate that our method outperforms the DeepLabv3+ [11] method by a large margin. Furthermore, compared with PSPNet [64], the heavy-weight spatial FFPNet achieves 0.06% improvement in OA but slightly inferior results in some categories such as low vegetation, trees, and cars. However, compared with other high-performance models (such as HMANet [28]) on the Vaihingen dataset, the performance of the heavy-weight spatial FFPNet can be further improved by adopting some strategies such as data augmentation and left-right flipping counterparts during inference.

Experiments on Potsdam Dataset
In order to further validate the effectiveness of the segmentation model, we conducted experiments on the Potsdam dataset. Table 7 shows the result of a comparison of the proposed model with other excellent models, including DAT_6 [65], an end-to-end self-cascaded network CASIA3 [5], and the other methods used for the comparison on the Vaihingen dataset. Notably, the heavy-weight spatial FFPNet achieves an OA of 92.44% and mIoU of 86.20%, which are 0.02% and 1.32% improvement, respectively, compared to the values achieved by PSPNet. In addition, our F1 score of cars is much higher than that achieved by PSPNet and exceeds the second-best value achieved by CCNet by 1.03%. Thus, the effectiveness of our proposed model in handling multiscale feature fusion for the segmentation of remote sensing images is demonstrated. In addition, the quantitative comparison results are shown in Figure 9. The third and fourth columns represent the results of the baseline and the proposed models, respectively. Obviously, the visualization results in the fourth column are better than those in the third column. Moreover, as Table 7 indicates, the proposed model shows an improvement of 1.5% in OA and 4.51% in mIoU compared with the values achieved by DeepLabv3+ [11]. Therefore, it has been further demonstrated that the heavy-weight spatial FFPNet can effectively extract and fuse the spatial features of remote sensing images, thereby improving the segmentation performance of high-resolution images. Comparison between IRRG and RGB. Generally, the high-resolution remote sensing images are RGB band data. Therefore, it is necessary to compare the two types of available input images for the high-weight spatial FFPNet, including IRRG and RGB modes. The results of the last two rows in Table 7 show that the overall results of the two modes are a little difference. Specifically, the IRRG images improve the average performance by about 0.5% comparing to the RGB images. Again, the visualization results in Figure 10 show that the IRRG images can obtain better segmentation maps.

Implementation Details
Data preprocessing. When the hyperspectral images are divided into training and testing sets, the imbalance between categories brings difficulties for model training. For example, in the IP dataset, the "Oats" class has 20 labeled pixels, while the "Soybean-mintill" class has 2455 labeled pixels. To ensure comparability of data as much as possible, the same data processing strategy as the baseline [40] is used to deal with the imbalance of categories, that is, optimally selecting the number of samples of each category. Importantly, some lower data setups were added to highlight the superiority of the spatial-spectral FFPNet. Specifically, a maximum number of samples per category was set as a threshold to select the number of samples, that is, 50, 100, 150, and 200, per category. For the richer class samples, we simplified split the samples on the basis of the threshold. On contrary, when the number of class samples was less than twice the threshold, 50% samples of the corresponding class were selected. Detailed training sample schemes for the IP dataset are listed in Table 8, and the same schemes are adopted for the UP dataset as shown in Table 9. It is worth noting that the number of samples in both datasets is less than or equal to that in the baseline [40]; we conducted more sampling schemes for subsequent experiments.
Model setup. The proposed spatial-spectral FFPNet was initialized with two strategies: the backbone of the light-weight spatial FFP module was initialized with VGGNet16 pretrained on ImageNet [55]. Importantly, the parameters (weights and bias) of the first convolutional layer in the pretrained network only include three channels, while the hyperspectral classification task requires p-channel inputs (for example, 200 channels for the IP dataset and 103 channels for the UP dataset). Therefore, we copied the initialization parameters of the first convolutional layer in the pretrained network until the p-channel inputs were reached, similarly to CoinNet [66]. By contrast, the spectral FFP module was initialized with Kaiming uniform distribution. In addition, different from high-resolution experiments, the cross-entropy loss function was used to minimize the spatial-spectral FFPNet parameters because of the quantity limitation of labeled hyperspectral images. Batch normalization and ReLU were used in all layers, except for the classifier layer. Adam [67] was employed as the optimizer with a learning rate of 0.001. The mini-batch size was set to 24 for both datasets, and the maximum epoch was 200. The experiments were conducted on a high-performance computing cluster with one NVIDIA Titan RTX 24 GB GPU.

Ablation Study
In order to analyze the effectiveness of the proposed spatial-spectral FFPNet in hyperspectral image classification, two aspects are mainly considered. First, to analyze how the training set, including the number of samples and sample patch size, and data augmentation affect the performance of hyperspectral image classification, we conducted different ablation studies. Second, in order to analyze the impact of spatial and spectral models on the performance of hyperspectral image classification, the ablation experiments of the spatial FFPNet, spectral FFPNet, and spatial-spectral FFPNet were conducted for the IP and UP hyperspectral image datasets.
(1) Sample patch size. We conducted an ablation study for different sample patch sizes. For the IP dataset, patch sizes d = 9, 15, 19 and 29 were considered, and for the UP dataset, d = 9, 15, 21, and 27 were tested. The different patch sizes determine the different amount of spatial information that can be utilized for hyperspectral image classification. Table 10 shows the total training times and accuracy results with different patch sizes for a fixed number of samples (100) per category. On both datasets, as more pixels are added, more useful contextual spatial information could be utilized by the spatial-spectral FFPNet model. Thus, the model achieves better performance in the case of more spatial information while also spending more training time. However, as the patch size is further increased, such as d = 27 for the UP dataset, the model performance slightly decreases; this is because a patch containing too many other classes can detract from the target pixel. Specifically, for the IP data, d = 29 obtains the best performance, with an OA of 98.76%. However, the average training time for d = 29 is considerably longer than that for the other groups (almost twice that for d = 19). In terms of the accuracy to time ratio, d = 19 yields the best performance for the IP dataset, with an OA of 98.50% for an average training time of 14.66 min. For the UP dataset, d = 21 achieves the best result in terms of the accuracy-time ratio, resulting in an OA of 98.82% for an average training time of 8.16 min. In addition, d = 15 requires the minimum training time (7.06 min) to achieve an acceptable accuracy (96.41%).
(2) Sample per category. In order to evaluate the impact of the number of samples per category on the model performance, many experiments with different patch sizes and different number of training samples, that is, 50, 100, 150, and 200, per category were conducted.
The classification accuracy results in terms of OA, AA, and kappa coefficients obtained for the IP dataset are presented in Table 11. Obviously, according to the results of each patch size (d = 9, 15, 19, and 29), as more training samples per category are added, the accuracy of the proposed model classification increases, and the training time also increases. Concretely, when the number of samples is small, the model can achieve a superior classification result; that is, with d = 9, 15, 19, and 29 and 50 samples per category, OA values of 82.12%, 86.44%, 89.79%, and 94.64%, respectively, are achieved. Therefore, it is confirmed that the spatial-spectral FFPNet model can fully use multiscale spatial and spectral information to achieve more robust and accurate end-to-end hyperspectral image classification with a small number of training samples. Furthermore, with 200 samples per category, the best OA value of 99.84% of all groups is achieved for d = 29, and the OA values for d = 9, 15, and 19 vary by not more than 0.8%. All of the groups with 200 samples per category attain values above 99%; this further shows the spatial-spectral FFPNet's robustness and the ability to express and extract multiscale features.  -mowed 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100 The qualitative results obtained for the IP dataset for patch sizes of d = 9, 15, 19, and 29, respectively, with 50, 100, 150, and 200 samples per category, are provided in Figures 11-14. First, the visualization results of the confusion matrix for each category indicate that as more training samples are added, the color of the diagonal area gets brighter, while the other areas become more unified to blue. This indicates that the classification results of each class are improving. In addition, as the patch size increases, the accuracy of each class increases. However, when relatively adequate training samples are used in the network, the accuracy of each class is relatively similar (e.g., d = 9, 15, 19, and 29 with 200 samples per category, and d = 29 with 150 samples per category). Second, according to the classification maps acquired from each experiment, shown in Figures 11-14, the best results are achieved with 200 samples per category, especially for d = 19 and 29 with 200 samples per category; these are the most similar to the ground truth map of the IP image. Specifically, when the number of spatial pixels is small (d = 9, 15, and 19 with 50 samples per category), a small part of the middle pixels of the areas in some categories could be misclassified, especially for d = 9 with 50 samples per category. However, when the number of training samples per category increases to 100, these middle pixels are accurately classified. Furthermore, when the number of training samples is less, a small number of pixels near the edges are easily misclassified; we call this the "boundary error effect". However, with increasing training samples, the boundary error effect gradually weakens or even disappears for 150 training samples per category. More importantly, the overall classification result is generally excellent when the sample size is extremely small (i.e., for d = 9, 15, 19, and 29 with 50 samples per category). This demonstrates that the proposed model can better address the problem of overfitting when less hyperspectral samples are available.     Table 12 lists the results for the UP dataset. For every patch size, as the training samples increase, the model performance gradually improves. Notably, the model performance is more sensitive when the sample size is exceedingly small. For example, for d = 9 and 15 with 50 samples per category, the model performance in terms of OA for the UP dataset is less than 80%, while it improves to more than 90% when the samples per category are increased to 100. Furthermore, the sensitivity of the model performance to the small amount of training data and a large number of model parameters can be better addressed by data enhancement methods such as random rotation and addition of random noise. The effectiveness of data augmentation will be discussed in the third ablation analysis.  (3) Data enhancement. We used random horizontal and vertical flips, random rotation (with angles 90°, 180°, and 270°) to enhance small-scale hyperspectral datasets. To further test the effectiveness of the spatial-spectral FFPNet subjected to data enhancement when the training samples are intensely limited, 50 samples per category for the IP dataset and the UP dataset were utilized. To highlight the superiority of the proposed model purely, it is worth noting that data enhancement techniques were not used in other experiments in this paper because the baseline model [40] does not use data enhancement.
The ablation study results of data enhancement, with sample patch sizes of d = 9, 15, 19, and 29 and 50 samples per category on the IP and UP datasets are shown in Tables 13 and 14, respectively. Clearly, the classification accuracy of the spatial-spectral FFPNet with data enhancement is significantly higher than that of the spatial-spectral FFPNet. The accuracy difference between the two datasets reaches 2-20% in terms of OA. The performance of the model with data enhancement is more dominant for the UP dataset, which is due to the lower ratio of training samples to total samples. Specifically, with d = 9, for the IP dataset, the difference between the spatial-spectral FFPNet with data enhancement and the spatial-spectral FFPNet models is 8.20%, while the corresponding difference for the UP dataset is 19.93%. As the spatial information increases (patch size increases), the performance advantage of the data-enhanced model gradually decreases. For example, in terms of OA, the performance of data-enhanced model is 3.22% higher than that of spatial-spectral FFPNet for the IP dataset with d = 29, and the performance of the spatial-spectral FFPNet improves by 2.71% for the UP dataset with d = 27 by data enhancement. Thus, in case of an extremely small quantity of the labeled hyperspectral dataset, the spatial-spectral FFPNet with data enhancement may be the best choice.
(4) Spatial FFPNet, spectral FFPNet, and spatial-spectral FFPNet. As mentioned in the method section, the spectral FFPNet focuses on the extraction and fusion of multi-scale spectral features, while the spatial FFPNet focuses more on effective extraction and integration of context spatial features by the use of attention-based modules. The effects of spectral-only and spatial-only models on the performance of hyperspectral data classification as well as the effectiveness of the spatial-spectral FFPNet model require further analysis. Thus, we conducted ablation studies on the spatial-only, spectral-only, and spatial-spectral FFPNet models. The spatial-only and spectral-only models correspond to the light-weight spatial FFP module and the spectral FFP module in Figure 2 with fully connected classifiers, respectively.   Tables 15 and 16, respectively, present the results of a comparison of different models on the IP and UP datasets with sample patch sizes of d = 9, 15, 21, and 27 and 100 samples per category. Obviously, on both datasets, the performance of the spatial-spectral FFPNet model is significantly better than that of the exclusive spatial FFPNet and spectral FFPNet, especially when a small amount of spatial information is considered. Specifically, when d = 9, for the IP dataset, the OA value of the spatial-spectral model shows an improvement of 3.15% and 6.77% compared with the values achieved by the spatial-only and spectral-only models, respectively. In addition, the spatial-only and spectral-only models are more unstable and have limited accuracy for the UP dataset when spatial information is restricted (i.e., when d = 9, the spatial-spectral model shows an improvement of 12.39% and 33.26% compared with the values achieved by spatial-only and spectral-only models, respectively). Therefore, it is demonstrated that spatial information (the neighboring pixels) and spectral information should be simultaneously considered in the model to obtain an excellent classification result. Furthermore, the proposed spatial-spectral FFPNet model can effectively extract and fuse multiscale spatial and spectral features to achieve high classification accuracy even with a few training samples. However, Tables 15 and 16 indicate that when the number of training samples is sufficient, the performance gap between the three models is not very large, especially for the IP dataset.

Comparison with Existing CNN Methods
To further verify the effectiveness and superiority of the spatial-spectral FFPNet model, we compare it with some of the state-of-the-art and well-known CNN models developed in recent years for hyperspectral classification. The main comparisons about the configuration and training settings of these models are briefly described as follows: CNNs by Chen et al. [41]: First, the model configuration includes 1-D, 2-D, and 3-D CNNs. The 1-D CNN consists of five convolutional layers with ReLU and five pooling layers for the IP dataset as well as three convolutional layers with ReLU and three pooling layers for the UP dataset; the 1-D CNN extracts only spectral information. The 2-D CNN contains three 2D convolutional layers and two pooling layers. The latter two 2-D convolutional layers use the dropout strategy to prevent overfitting. The 3-D CNN is designed to effectively extract spatial and spectral information. It includes three 3D convolution and ReLU nonlinear activation layers, and the dropout strategy is also used to prevent overfitting. Overall, the design of the proposed model represents the early application of DCNN methods in hyperspectral image classification. However, although the models are simple and effective, the spatial and spectral distribution diversities of hyperspectral datasets are not considered. Second, in training settings, 1765 labeled samples were used as the training set for the IP dataset and 3930 samples as the training set for the UP dataset. are designed, and each convolution layer is followed by a ReLU function. To reduce the spatial resolution, the first two convolution layers are followed by two 2 × 2 max pooling. In addition, to prevent overfitting, the dropout method is executed in the first two convolution layers of the model, with probability = 0.1 for the IP dataset and 0.2 for the UP dataset. Next, a four-layer full connection classifies the extracted features. Although the 3D model requires less parameters and layers, it cannot address the diversity problem of spatial object distribution in hyperspectral data and cannot make the full use of spectral information. In addition, 3D convolution processes hyperspectral data are uniform volumetric data, while the hyperspectral actual object distribution is asymmetrical. Attention networks [45]: A visual attention-driven mechanism applied to residual neural networks (ResNet) facilitates spatial-spectral hyperspectral image classification. Specifically, the attention mechanism is integrated into the residual part of ResNet, which mainly includes two parts, namely the trunk and mask. The trunk consists of some residual blocks that perform feature extraction from the data, while the mask consists of a symmetrical downsampler-upsampler structure to extract useful features from the current layer. Although the attention mechanism has been successfully applied to ResNet, this attention method does not solve the problems of spatial distribution (the different geometric shapes of the objects) and spectral redundancy of hyperspectral data. Second, the network was optimized using 1537 training samples with 300 epochs for the IP dataset and 4278 training samples with 300 epochs for the UP dataset. (4) Multiple CNN fusion [42]: Compared with other models, although the multiscale spectral and spatial feature fusion model is time-consuming, it can achieve superior classification accuracy and hence has been gaining prominence in hyperspectral image classification. For example, Zhao et al. [42] presented a multiple convolutional layers fusion framework, which fuses features extracted from different convolutional layers for hyperspectral image classification. This multiple CNN model only considers the fusion of spatial features at different scales, but not the effective extraction of spatial and spectral features at multiple scales. Specifically, the multiscale spectral and spatial feature fusion model is divided into two types according to the fusion mechanism. The first one is the side output decision fusion network (SODFN), which applies majority voting to many side classification maps generated by each convolutional layer. The other one is the fully convolutional layer fusion network (FCLFN), which combines all features generated by each convolutional layer. Second, the SODFN and FCLFN parameters were tuned using 1029 training samples for the IP dataset and 436 training samples for the UP dataset.
Indian Pines dataset benchmark evaluation. The classification results for the IP dataset obtained by different CNN models and our proposed model are shown in Table 17. Obviously, the spatial-spectral FFPNet without data enhancement generates the highest OA, AA, kappa coefficient and thus presents the best performance among all benchmark models. The proposed model also shows excellent performance in each class. The best result of the spatial-spectral FFPNet exceeds the best result of the baseline model [40] by 1.47% in terms of OA. Notably, the spatial-spectral FFPNet shows superior performance in all experimental configurations compared with the baseline model. In particular, when spatial information is limited (i.e., patch size d = 9), the proposed model outperforms the same configuration of the baseline model by 8.96%, and even exceeds the best configuration of the baseline model (d = 29, training samples = 2466) by 0.7%. The results presented in Table 17 further demonstrates the superiority of the proposed spatial-spectral FFPNet and its robustness in case of a small number of training samples.
A graphical comparison of the OA values for the IP dataset obtained by different CNN models is shown in Figure 15. The OA results obtained by our proposed model are presented in red, while those obtained by the benchmark models are presented in black. Clearly, our model shows more promising performance under different training samples compared with the well-known hyperspectral classification CNN models. Specifically, when the number of training samples is relatively small (600-800), the spatial-spectral FFPNet with data enhancement acquires outstanding performance. As the number of samples increases, the dominance of the spatial-spectral FFPNet over the other methods is relatively more. In addition, note that attention networks [45] and multiple CNN fusion [42] perform better than the CNN models by [40,41]; this is also in line with the current development trend of hyperspectral classification; that is, the application of multiscale feature fusion and attention mechanisms in the spatial and spectral dimensions. Figure 15. Comparison of the OA obtained by different CNN models for the IP dataset. The abscissa represents the total number of training samples (600-2500), and the ordinate represents the OA (%) of the CNN models. The figure mainly compares the accuracy of the existing CNN methods and the proposed spatial-spectral FFPNet under different training samples. In the legend, different shapes represent different methods for hyperspectral classification. In the same shape, different colors indicate different configurations of the same method. The OA results obtained by our proposed model is presented in red, and those obtained by other CNNs [41], CNN [40], attention networks [45], and multiple CNN fusion [42] are presented in black.  [40] in terms of OA. The spatial-spectral FFPNet also shows a great performance compared with the baseline [40] with the same experimental configuration. Furthermore, the performance of the spatial-spectral FFPNet with a lower patch size (d = 9) differs slightly from that of the baseline model [40] with d = 15. Notably, attention networks [45] perform the best among all model in terms of OA and AA results because of the sufficient training samples (4278), while our proposed model performs the best in terms of the kappa coefficient.   Figure 16 provides a graphical comparison of the OA of different CNN models for the UP dataset. Again, our model attains more homogenized and favorable classification results. As Figure 16 shows, the results of our proposed model are centered around 400-1900 training samples. However, the local OA comparison chart indicates that the multiple CNN fusion [42] is superior (98.17%) even for an extremely small number of training samples (approximately 400). As the training samples increase, the superiority of the spatial-spectral FFPNet gradually gains prominence. In addition, for a large training sample (>3000), attention networks [45] perform considerably better than the traditional CNN model [41]. Figure 16. Comparison of the OA obtained by different CNN models for the UP dataset. The abscissa represents the total number of training samples (400-4400), and the ordinate represents the OA (%) of the CNN models. The figure mainly compares the accuracy of different models (existing CNN methods and the proposed spatial-spectral FFPNet) under different training samples. In the legend, different shapes represent different methods for hyperspectral classification. In the same shape, different colors indicate different configurations of the same method. The OA results obtained by our proposed model are presented in red and those obtained by the existing CNNs [40,41], attention networks [45], and multiple CNN fusion [42] are presented in black.

Conclusions
In this study, we mainly focus on spatial object distribution diversity and spectral information extraction, which are the major challenges of high-resolution and hyperspectral remote sensing images. To address the spatial and spectral problems, three novel and practical attention-based modules were proposed: attention-based multiscale fusion, region pyramid attention, and adaptive-ASPP. We constructed different forms of feature fusion pyramid frameworks (two-layer or three-layer pyramids) by combining these attention-based modules. First, we developed a new semantic segmentation framework for high-resolution images, called the heavy-weight spatial FFPNet. Second, for the classification of hyperspectral images, an end-to-end spatial-spectral FFPNet was presented to extract and fuse multiscale spatial and spectral features. The experiments conducted on two high-resolution datasets demonstrated that the proposed heavy-weight spatial FFPNet achieves excellent segmentation accuracy. Detailed ablation studies further revealed the superiority of the three attention-based modules in processing the spatial distribution diversity of remote sensing images. Furthermore, detailed training parameter analysis and comparison with other state-of-the-art CNNs (such as [40]) were performed on the two hyperspectral datasets. The results demonstrated that the spatial-spectral FFPNet is more robust and achieves greater accuracy in case when the number of training samples of the hyperspectral dataset is small and that it can obtain state-of-the-art results under different training samples. Overall, the proposed methods can serve as a new baseline for remote sensing image segmentation and classification. In future work, we will focus on few-shot or zero-shot segmentation and classification in high-resolution or hyperspectral remote sensing data to promote practical application of deep learning in remote sensing image perception.

Code and Model Availability
Algorithms, trained models (including heavy-weight spatial FFPNet for segmentation of high-resolution remote sensing images and spatial-spectral FFPNet for classfication of hyperspectral images) are available to the public on Github under a GNU General Public License (https://github. com/xupine/FFPNet).