A New CBAM-P-Net Model for Few-Shot Forest Species Classiﬁcation Using Airborne Hyperspectral Images

: High-precision automatic identiﬁcation and mapping of forest tree species composition is an important content of forest resource survey and monitoring. The airborne hyperspectral image contains rich spectral and spatial information, which provides the possibility of high-precision classiﬁcation and mapping of forest tree species. Few-shot learning, as an application of deep learning, has become an effective method of image classiﬁcation. Prototypical networks (P-Net) is a simple and practical deep learning network, which has signiﬁcant advantages in solving few-shot classiﬁcation problems. Considering the high band correlation and large data volume associated with airborne hyperspectral images, how to fully extract effective features, ﬁlter or reduce redundant features is the key to improving the classiﬁcation accuracy of P-Net, in order to extract effective features in hyperspectral images and obtain a high-precision forest tree species classiﬁcation model with limited samples. In this research, we embedded the convolutional block attention module (CBAM) between the convolution blocks of P-Net, the CBAM-P-Net was constructed, and a method to improve the feature extraction efﬁciency of the P-Net was proposed, although this method makes the network more complex and increases the computational cost to a certain extent. The results show that the combination strategy using Channel First for CBAM greatly improves the feature extraction efﬁciency of the model. In different sample windows, CBAM-P-Net has an average increase of 1.17% and 0.0129 in testing overall accuracy (OA) and kappa coefﬁcient (Kappa). The optimal classiﬁcation window is 17 × 17, the OA reaches 97.28%, and Kappa reaches 0.97, which is an increase of 1.95% and 0.0214 along with just 49 s of training time expended, respectively, compared with P-Net. Therefore, using a suitable sample window and applying the proposed CBAM-P-Net to classify airborne hyperspectral images can achieve high-precision classiﬁcation and mapping of forest tree species.


Introduction
Fine-grained tree species classification is the basis of forest management planning and interference monitoring, which is conducive to the scientific management and effective use of forest resources. Numerous continuous narrow spectral bands and high spatial resolution of hyperspectral images (HSI) can provide a wealth of available spectral information for each pixel on the land cover mapping [1]. Forest-type recognition is an important aspect Attention plays an important role in human perception. A significant characteristic of the human visual system is that it does not try to process the entire scene immediately, but selectively focuses on the salient parts in order to better capture the visual structure. Attention can be directed to the focal point, and the expression ability can be improved by using the attention mechanism, that is, focusing on important features and suppressing unnecessary features. Convolutional block attention module (CBAM) is a simple and effective attention module for feedforward convolutional neural networks [32]. Given an intermediate feature map, CBAM will sequentially infer the attention map along two separate dimensions (channel and spatial) [32] and then multiply the attention map by the input feature map for adaptive feature refinement [33]. Since CBAM is a lightweight general-purpose module, it can be seamlessly integrated into any convolutional neural network (CNN) architecture [25,34], while the overhead is negligible, and it can engage in end-to-end training together with the basic CNN [35][36][37]. Applying CBAM to the tree species classification of hyperspectral images aims to overcome the dimensional dilemma, adaptively reduce the impact of redundant bands on classification, and achieve a precise and efficient feature extraction so as to improve classification performance.
Matching networks uses the latest advances in attention to achieve fast learning. It is a weighted nearest-neighbor classifier applied within an embedding space. During the training process, the model imitates the test scenario of the few-shot task by subsampling the class labels and samples [19]. The training process of the network is to establish the relationship or mapping between labels and samples in the training set, and directly apply it to the test set in the same way. Prototypical networks is part of special matching networks and has a simple network structure, which does not require complex hyper-parameters, guiding the learning of new tasks using past prior knowledge and experience [38]. Moreover, it has great potential for solving few-shot classification problems. Compared with matching networks, it has fewer parameters and is more convenient to train. However, for the classification of hyperspectral images, the general prototypical networks structure is simple, and the problem of weak model generalization is prone to occur [39].
Considering the large and fine-grained spatial and spectral characteristics of airborne hyperspectral images, we are faced with two major challenges:

1.
Extracting effective features for classification based on a large amount of spatial and spectral information of hyperspectral images.

2.
Obtaining a high-precision forest tree species classification model with limited samples.
In this study, we proposed a CBAM-P-Net model by embedding a CBAM module into the prototypical networks, analyzed the influence of the convolutional attention module on the network operation efficiency and results, optimized the prototypical networks structure and tuning parameters, proposed a training sample size and method suitable for tree species classification based on airborne hyperspectral data, and discussed the classification performance of CBAM-P-Net on hyperspectral images under the condition of few-shot. The results show that our method improved the efficiency of feature extraction, although it makes the network more complex and increases the computational cost to a certain extent.

Study Area
The study area is a sub-area of the Jiepai branch of Gaofeng Forest Farm in Nanning City, Guangxi Province, China (108 • 22 1 ~108 • 22 30 E, 22 • 57 42 ~22 • 58 13 N), belonging to the subtropical monsoon climate, with an area of 74.1 hm2. The average temperature is 21.6 • C, the average annual precipitation is 1200-1500 mm, and the average relative humidity is 79%. It is a hilly landform with an altitude of 149-263 m and a slope of 6~35 • . The forest composition and structure in the study area have the typical characteristics of subtropical forests, with diverse tree species, fragmented and irregular distribution, complex tree structure, and a varied and luxuriant understory vegetation, which brings challenges to the classification of tree species. This paper classifies 11 categories in the study area, including 9 tree species, cutover land, and road. Of these, coniferous species include Cunninghamia lanceolate (C. lanceolate, CL), Pinus elliottii (P. elliottii, PE), and Pinus massoniana (P. massoniana, PM), and broadleaf species include Eucalyptus urophylla (E. urophylla, EU), Eucalyptus grandis (E. grandis, EG), Castanopsis hystrix (C. hystrix, CH), Acacia melanoxylon (A. melanoxylon, AM), Mytilaria laosensis (M. laosensis, ML), and other soft broadleaf species (SB). C. lanceolate, P. elliottii, P. massoniana, A. melanoxylon, and M. laosensis are mixed forests, and the remains are pure forests. Exploring the high-precision classification method of tree species in the study area has important guiding significance for the classification and mapping of forest stands with complex structures and composition.

Airborne Hyperspectral Data
The hyperspectral data acquisition was conducted on 13 January and 30 January 2019, under cloudless conditions, at noon. The hyperspectral equipment was equipped with the CAF's (Chinese Academy of Forestry) LiCHy (LiDAR, CCD and Hyperspectral) system integrated by the German IGI company, including LiDAR sensors (LMS-Q680i, produced by RIEGL company), CCD cameras (DigiCAM-60), AISA Eagle II hyperspectral sensors (produced by Finland SPECIM company), and inertial navigation units (IMU). The aircraft had a flying speed of 180 km/h, a relative altitude of 750 m, an absolute altitude of 1000 m, and a course spacing of 400 m. The hyperspectral data is the radiance data after radiometric calibration and geometric correction. It contained 125 bands with a wavelength range of 400-987 nm, a spectral resolution of 3.3 nm, and a spatial resolution of 1 m. Table 2 summarizes the detailed parameters of the hyperspectral sensors. The Quick Atmospheric Correction (QUAC) method was used to perform atmospheric correction on hyperspectral images to eliminate the interference of light and the atmosphere on the reflectivity of ground objects. Due to the complex terrain of the study area, the brightness value of the images was uneven, so the hyperspectral image was corrected by the DEM based on the synchronously acquired LiDAR data, which eliminated the changes in the image radiance value caused by the undulation of the terrain. Savitzky-GOlay (SG) filtering [40] was used to smooth the spectral data and effectively remove the noise caused by various factors.

Field Survey Data
The field data survey was conducted at the Jiepai branch of Gaofeng Forest Farm from 16 January 16 to 5 February 2019. First, through the visual interpretation of GF-2 satellite images with a resolution of 1 m, the sample plots were set up with uniform distribution. Ten plots with the size of 25 m × 25 m and nine plots with 25 m × 50 m were laid out, of which seven were C. lanceolate pure forest, three were E. urophylla pure forests, three were E. grandis pure forests, and the remaining six were other forest stands and mixed forests. The tree species mainly included C. lanceolate and P. massoniana, E. urophylla, E. grandis, C. hystrix, etc., with a total of 1657 trees. In the plot, each tree was positioned using the China Sanding STS-752 Series Total Station and measured including tree species, tree height, crown width, branch height, diameter at breast height, and other factors. At the same time, for areas where it was not possible to set up plots due to the complex terrain, a field positioning survey was conducted by handheld GPS, with 10-20 points for each tree species.
In order to keep the number of sample points of each feature category consistent and evenly distributed, for categories with too many sample points, the points located at plot edges and that were too densely distributed were deleted. For categories where the number of sample points was too few, based on field survey GPS location points and sample site survey data, combined with a 0.2 m resolution digital orthophoto map (DOM) and forest sub-compartments survey data, the sample points were manually marked on the image of the study area. In this way, 112 sample points were obtained for each category, a total of 1232 sample points ( Figure 1). The field data survey was conducted at the Jiepai branch of Gaofeng Forest Farm from January 16 to February 5, 2019. First, through the visual interpretation of GF-2 satellite images with a resolution of 1 m, the sample plots were set up with uniform distribution. Ten plots with the size of 25 m × 25 m and nine plots with 25 m × 50 m were laid out, of which seven were C. lanceolate pure forest, three were E. urophylla pure forests, three were E. grandis pure forests, and the remaining six were other forest stands and mixed forests. The tree species mainly included C. lanceolate and P. massoniana, E. urophylla, E. grandis, C. hystrix, etc., with a total of 1657 trees. In the plot, each tree was positioned using the China Sanding STS-752 Series Total Station and measured including tree species, tree height, crown width, branch height, diameter at breast height, and other factors. At the same time, for areas where it was not possible to set up plots due to the complex terrain, a field positioning survey was conducted by handheld GPS, with 10-20 points for each tree species.
In order to keep the number of sample points of each feature category consistent and evenly distributed, for categories with too many sample points, the points located at plot edges and that were too densely distributed were deleted. For categories where the number of sample points was too few, based on field survey GPS location points and sample site survey data, combined with a 0.2 m resolution digital orthophoto map (DOM) and forest sub-compartments survey data, the sample points were manually marked on the image of the study area. In this way, 112 sample points were obtained for each category, a total of 1232 sample points ( Figure 1).

Sample Data and Prototypical Networks Construction
In the previous research, we have produced a complete set of sample data and constructed the classification framework of the prototypical networks [39]. The sample data set is based on hyperspectral images, as the data source, centered on the screen coordinate representation of the actual measured point's latitude and longitude, and clipped with different window sizes through the open source framework GDAL. The window size

Sample Data and Prototypical Networks Construction
In the previous research, we have produced a complete set of sample data and constructed the classification framework of the prototypical networks [39]. The sample data set is based on hyperspectral images, as the data source, centered on the screen coordinate representation of the actual measured point's latitude and longitude, and clipped with different window sizes through the open source framework GDAL. The window size starts from 5 × 5, with a step length of 2 m, and then clips the sample data until 31 × 31, when the clipping area exceeds the study area. Finally, a sample data set (11 classes, 112 samples in each class, a total of 1232 samples) consistent with the number of sample points was obtained in different window sizes. The sample data were divided into training samples and test samples according to the ratio of 80% and 20%.
The classification principle of the prototypical networks is that the points of each class are clustered around a prototype. Specifically, the neural network learns the nonlinear mapping of the input to the embedding space and uses the average value of the support set as the prototype of its class in the embedding space. Next, the nearest class prototype is found to classify the embedded query points [39]. The classification framework of the prototypical networks is shown in Figure 2, which mainly includes three parts: sample data input, image feature extraction, distance measurement and classification. In the prototypical networks, the sample data are divided into a support set and a query set. The support set is used to calculate the prototype, and the query set is used to optimize the prototype. If there are A classes and B samples in each class as the support set, it is A-way-B-shot. The image feature extraction part is to construct the embedding function ( f ∅ : R D → R M , ∅ is the learning parameter) to calculate the M-dimensional representation of each sample, that is, the image feature, and each class prototype (c k R M ) is the mean value of the feature vector obtained by the embedding function of the support set samples of its class. The square of the Euclidean distance is used to construct a linear classifier. From the projection of the sample to the embedding space, prototypical networks uses the distance function to calculate the distance from the query set x to the prototype, and then uses the softmax to calculate the probability of belonging to the category.
Remote Sens. 2021, 13, 1269 6 of 23 starts from 5 × 5, with a step length of 2 m, and then clips the sample data until 31 × 31, when the clipping area exceeds the study area. Finally, a sample data set (11 classes, 112 samples in each class, a total of 1232 samples) consistent with the number of sample points was obtained in different window sizes. The sample data were divided into training samples and test samples according to the ratio of 80% and 20%. The classification principle of the prototypical networks is that the points of each class are clustered around a prototype. Specifically, the neural network learns the nonlinear mapping of the input to the embedding space and uses the average value of the support set as the prototype of its class in the embedding space. Next, the nearest class prototype is found to classify the embedded query points [39]. The classification framework of the prototypical networks is shown in Figure 2, which mainly includes three parts: sample data input, image feature extraction, distance measurement and classification. In the prototypical networks, the sample data are divided into a support set and a query set. The support set is used to calculate the prototype, and the query set is used to optimize the prototype. If there are A classes and B samples in each class as the support set, it is A-way-B-shot. The image feature extraction part is to construct the embedding function ( ∅ : → , ∅ is the learning parameter) to calculate the M-dimensional representation of each sample, that is, the image feature, and each class prototype ( ) is the mean value of the feature vector obtained by the embedding function of the support set samples of its class. The square of the Euclidean distance is used to construct a linear classifier. From the projection of the sample to the embedding space, prototypical networks uses the distance function to calculate the distance from the query set x to the prototype, and then uses the softmax to calculate the probability of belonging to the category. This research uses slice data of H × W × C (Height × Width × Channels) as the input of prototypical networks. The image feature extraction architecture is composed of different numbers of convolution blocks (Conv Block 1...Conv Block N, Last Conv Block) according to the size of the clipped data window. Each convolution block includes a convolution layer (Conv2d, output dimension F is 64, convolution kernel is 3 × 3), a batch normalization layer (Batch_norm), a non-linear activation function (ReLU), and a maximum Pooling layer (Max_pool2d, pooling kernel is 2 × 2). At the same time, in order to avoid This research uses slice data of H × W × C (Height × Width × Channels) as the input of prototypical networks. The image feature extraction architecture is composed of different numbers of convolution blocks (Conv Block 1...Conv Block N, Last Conv Block) according to the size of the clipped data window. Each convolution block includes a convolution layer (Conv2d, output dimension F is 64, convolution kernel is 3 × 3), a batch normalization layer (Batch_norm), a non-linear activation function (ReLU), and a maximum Pooling layer (Max_pool2d, pooling kernel is 2 × 2). At the same time, in order to avoid model overfitting, the L2 regularization (α 2 = 0.001) of the convolution kernel is added to the convolution layer, and Dropout is added after the maximum pooling layer (Keep_prob is 0.7). Feature values are processed through the fully connected layer (Flatten) and softmax as the basis for classification. The same embedding function is taken to operate the support set and the query set, then using them it as input parameters for the loss and precision calculations. All models are trained through Adam-SGD. The initial learning rate is 10 −4 , halving the learning rate every 2000 training sessions. Euclidean distance is used as the measurement function, and the loss function is a negative log likelihood function to train the prototypical networks.

Convolutional Block Attention Module
Convolutional Block Attention Module (CBAM) is an attention module that combines the spatial and channel dimensions ( Figure 3). CBAM can achieve better results compared with SENet [41] because the attention mechanism adopted by the latter only focuses on channels. In addition, MaxPool is added to the network structure of CBAM, which makes up for the information lost by AvgPool to a certain extent. It can be seen from Figure 3 that taking the feature map (F ∈ R H×W×C ) extracted by CNN as input, CBAM sequentially obtains a one-dimensional Channel Attention feature map (M c ∈ R 1×1×C ) and a twodimensional Spatial Attention feature map (M s ∈ R H×W×1 ). The entire attention process can be summarized as: model overfitting, the L2 regularization ( =0.001) of the convolution kernel is added to the convolution layer, and Dropout is added after the maximum pooling layer (Keep_prob is 0.7). Feature values are processed through the fully connected layer (Flatten) and softmax as the basis for classification. The same embedding function is taken to operate the support set and the query set, then using them it as input parameters for the loss and precision calculations. All models are trained through Adam-SGD. The initial learning rate is 10 −4 , halving the learning rate every 2000 training sessions. Euclidean distance is used as the measurement function, and the loss function is a negative log likelihood function to train the prototypical networks.

Convolutional Block Attention Module
Convolutional Block Attention Module (CBAM) is an attention module that combines the spatial and channel dimensions (Figure 3). CBAM can achieve better results compared with SENet [41] because the attention mechanism adopted by the latter only focuses on channels. In addition, MaxPool is added to the network structure of CBAM, which makes up for the information lost by AvgPool to a certain extent. It can be seen from Figure 3 that taking the feature map (F ∈ × × ) extracted by CNN as input, CBAM sequentially obtains a one-dimensional Channel Attention feature map ( ∈ × × ) and a twodimensional Spatial Attention feature map ( ∈ × × ). The entire attention process can be summarized as: The symbol ⨂ means multiply by element, and represents a new feature obtained through channel attention, which is used as the input of spatial attention. Finally, the feature of the entire CBAM output is obtained. The soft mask mechanism proposed by Wang et al. [42] can guarantee a better performance of the attention module. The equation is modified as:

Channel Attention
Producing a channel attention map by exploiting the inter-channel relationship of the features (Figure 4). In order to effectively calculate channel attention, the spatial size of The symbol ⊗ means multiply by element, and F represents a new feature obtained through channel attention, which is used as the input of spatial attention. Finally, the feature F of the entire CBAM output is obtained.
The soft mask mechanism proposed by Wang et al. [42] can guarantee a better performance of the attention module. The equation is modified as:

Channel Attention
Producing a channel attention map by exploiting the inter-channel relationship of the features (Figure 4). In order to effectively calculate channel attention, the spatial size of the input feature map needs to be compressed, and average pooling and maximum pooling are commonly used. It can be seen from Figure 4 that the module takes the feature map as input, and obtains the features (F c max ∈ R 1×1×C , F c avg ∈ R 1×1×C , C represents the number of channels) through spatial-based global maximum pooling and global average pooling. Then, through a multi-layer perceptron (MLP) composed of two dense layers, the features output by the MLP are element-wise summation multiplication, and then, through the sigmoid activation function, the channel attention feature map (M c ∈ R 1×1×C ) is generated. The feature map will multiply the input feature by the element to obtain a new feature (F ). The calculation equation is indicated by Formula (5).
where σ represents the sigmoid activation function, W 0 R C/r×C is the weight of the first hidden layer in the MLP, r is the feature compression rate, and W 1 R C×C/r is the weight of the second hidden layer in the MLP.
the input feature map needs to be compressed, and average pooling and maximum pooling are commonly used. It can be seen from Figure 4 that the module takes the feature map as input, and obtains the features ( ∈ × × , ∈ × × , C represents the number of channels) through spatial-based global maximum pooling and global average pooling. Then, through a multi-layer perceptron (MLP) composed of two dense layers, the features output by the MLP are element-wise summation multiplication, and then, through the sigmoid activation function, the channel attention feature map ( ∈ × × ) is generated. The feature map will multiply the input feature by the element to obtain a new feature ( ). The calculation equation is indicated by Formula (5).
where represents the sigmoid activation function, / × is the weight of the first hidden layer in the MLP, r is the feature compression rate, and × / is the weight of the second hidden layer in the MLP.

Spatial Attention
Generating a spatial attention map by utilizing the inter-spatial relationship of the features ( Figure 5). First, average pooling and maximum pooling operations along the channel axis are applied to generate the corresponding feature vectors, which are connected according to the channel axis to form an effective feature descriptor. On this basis, the convolutional layer is applied to generate the spatial attention feature map ( ∈ × × ), which will be multiplied by the input feature to obtain a new feature ( ). The calculation equation is indicated in Formula (6).
where σ represents the sigmoid activation function, and * is a convolution operation with a kernel size of 7 × 7.

Spatial Attention
Generating a spatial attention map by utilizing the inter-spatial relationship of the features ( Figure 5). First, average pooling and maximum pooling operations along the channel axis are applied to generate the corresponding feature vectors, which are connected according to the channel axis to form an effective feature descriptor. On this basis, the convolutional layer is applied to generate the spatial attention feature map (M s ∈ R H×W×1 ), which will be multiplied by the input feature to obtain a new feature (F ). The calculation equation is indicated in Formula (6).
where σ represents the sigmoid activation function, and f 7 * 7 is a convolution operation with a kernel size of 7 × 7.
Remote Sens. 2021, 13, 1269 As shown in Figure 6, this study inserts CBAM between the convolution blo the prototypical networks to construct CBAM-P-Net. CBAM focuses on channel an tial features. In addition, the soft mask mechanism is used in each convolution blo tention sub-module to ensure the performance of the model. As shown in Figure 6, this study inserts CBAM between the convolution blocks of the prototypical networks to construct CBAM-P-Net. CBAM focuses on channel and spatial features. In addition, the soft mask mechanism is used in each convolution block attention sub-module to ensure the performance of the model. As shown in Figure 6, this study inserts CBAM between the convolution blocks of the prototypical networks to construct CBAM-P-Net. CBAM focuses on channel and spatial features. In addition, the soft mask mechanism is used in each convolution block attention sub-module to ensure the performance of the model.

Accuracy Verification
The classification accuracy of CBAM-P-Net includes training accuracy and testing accuracy. The training accuracy is expressed by last epoch accuracy (LEA). The testing accuracy is expressed by average accuracy (AA, Equation (7)), overall accuracy (OA, Equation (8)), and Kappa coefficient (Kappa, Equation (9)).
where n is the number of categories, X ii is the number of correct classifications of a category in the error matrix, X +i is the total number of true reference samples of that category, X i+ is the total number of categories classified into this category, and M is the total number of samples.

Experiments Design
In this study, we designed four experimental schemes, as shown in Table 3. Table 3. Experimental schemes.

Experiments Name Description
A Classification using the prototypical networks in different windows Rotating (the image of each band is rotated 90, 180, and 270 degrees with the center as the axis) and flipping (the image of each band are flipped up-down and left-right respectively) training samples in different windows (from 5 × 5 to 29 × 29). Then, using 11-way-5-shot, and the optimal number of iterations is sought to train the prototypical networks.

B
CBAM combination strategy selection B 1 : Channel attention prior to spatial attention. B 2 : Spatial attention prior to channel attention. B 3 : Channel attention parallel with spatial attention.

C
The effect of training set ratio on CBAM-P-Net classification accuracy Divide the sample into proportions, set the training set to 20%, 40%, 60%, and 80%, and divide the rest into the test set for experimentation.

Comparative experiment
Using the same sample and computer configuration, train the 2D CNN, 3D CNN, and 3D-1D CNN matching networks and combine the matching networks with CBAM to compare the classification accuracy with CBAM-P-Net.

Classification Using Prototypical Networks in Different Windows
Taking the sample data from the 5 × 5 to the 29 × 29 window as the input, rotating (the image of each band is rotated 90, 180, and 270 degrees with the center point as the axis) and flipping (the image of each band is flipped up-down and left-right, respectively) the sample, and applying prototypical networks for classification, the results are shown in Table 4. As the sample window increases, the model training time gradually increases. The model training accuracy (LEA), testing accuracy OA, and Kappa are above 95.89%, 79.39%, and 0.7733, respectively. Figure 7 illustrates the changes of the classification accuracy in different windows, showing an overall upward trend. The testing accuracy of samples from the 5 × 5 to the 17 × 17 window is obviously improved, slightly increases in the 19 × 19 window compared to the 17 × 17 window, and decreases from the 21 × 21 to the 23 × 23 window, increases again from the 25 × 25 to the 29 × 29 window, and the testing accuracy reaches the maximum in the 29 × 29 window.

Channel Attention Prior to Spatial Attention
Taking channel attention prior to spatial attention (Channel First), the CBAM is inserted between the convolutional blocks of the prototypical networks to classify samples in different windows, and the results are shown in Table 5. The testing accuracy (OA, Kappa) of the model has been improved to a certain extent, with an average increase of 1.17% for OA and 0.0129 for Kappa. Figure 8 illustrates the change trend of the classification accuracy in different windows. It can be seen that the testing accuracy of samples from the 5 × 5 to the 17 × 17 window increases tremendously, decreases slightly from the 19 × 19 to the 23 × 23 window, and successively improves from the 25 × 25 to the 29 × 29 window. The testing accuracy reaches the highest value, and the model training time is short when the sample window size is 17 × 17.

Spatial Attention Prior to Channel Attention
The CBAM of spatial attention prior to channel attention (Spatial First) is inserted between the convolutional blocks of the prototypical networks to classify samples in different windows, and the results are shown in Table 6. The testing accuracy (OA, Kappa) of the model has been improved, and OA and Kappa increased by 0.86% and 0.0096, on average, which is lower than in Channel First. Figure 9 shows the change trend of the classification accuracy in different windows. We found that the variation trend of the classification accuracy of samples in different windows is consistent with the results of Channel First. The testing accuracy reaches the highest value in 29 × 29, followed by 17 × 17. The OA and Kappa of 17 × 17 are about 0.09% and 0.0025, lower than those of the 29 × 29 window, but the training time is nearly halved. Therefore, we can consider 17 × 17 to be the best window size for classification with high testing accuracy and short model training time.

Channel Attention Parallel with Spatial Attention
CBAM paralleled by channel attention and spatial attention is inserted between the convolutional blocks of the prototypical networks to classify samples in different windows. The results are shown in Table 7. To a certain degree, the testing accuracy (OA, Kappa) of the model has been improved, OA and Kappa increased by 0.84% and 0.0091, on average, which is even worse than in Spatial First. The change trend of the classification accuracy in different windows using Parallel can be seen in Figure 10. Consistent with Spatial First, the OA and Kappa of the 17 × 17 window are about 0.07% and 0.0014, lower than those of the 29 × 29 window, but the training time is nearly halved. It is obvious that 17 × 17 is the best window size for classification with high testing accuracy and short model training time. Therefore, adding CBAM to the prototypical networks can play a positive role in the classification results. From a spatial perspective, channel attention is applied globally and represents feature information, whereas spatial attention is applied locally and represents location information. It is obvious that the weight of local information is determined on

The Effect of Training Set Ratio on CBAM-P-Net Classification Accuracy
Inserting CBAM between the various convolution blocks of the prototypical networks to construct CBAM-P-Net with Channel First. The best window of sample is 17 × 17. This study classified 11 categories, and each category selected 5, 10, and 15 samples as the support set, so three scenarios of 11-way-5-shot, 11-way-10-shot, and 11-way-15-shot were taken into trials. At the same time, in order to verify the impact of the number of training samples on the classification accuracy of the prototypical networks, we conducted classification tests on all ground-measured samples according to the proportions of 80%, 60%, 40%, and 20%. The results are shown in Table 8.
It can be seen from Table 8 that when the number of training samples drops from 80% of the actual measured samples to 20%, the testing accuracy of CBAM-P-Net also decreases sequentially. Yet even when 20% of the samples are used for training, the testing accuracy of CBAM-P-Net still reaches more than 90%. From the testing accuracy of different tree species, it can be found that among the coniferous species, the testing accuracy of C. lanceolata is the highest, and the testing accuracy of P. elliottii is the lowest; among the broad-leaved species, 80% and 60% of the samples are used for training, and the testing accuracy is above 90%. However, when the training samples are 40% and 20%, the testing accuracy of A. melanoxylon and soft broadleaved species is still above 90%, but is decreases in other tree species.

Comparative Experiments
The optimal window of the sample is 17 × 17. Under the same experimental conditions, we compared different models for hyperspectral image classification tasks.
Below are the methods included in our comparison. 3D CNN: Five convolution blocks are considered, each of which includes a 3D convolutional layer with a kernel of 3 × 3 × 3, a nonlinear activation function (ReLU), and a batch normalization layer. The first and last convolution blocks set a maximum pooling layer, which was used to rapidly reduce the data dimension. Subsequently, we flattened the feature output by the last 3D convolutional layer and transformed the 3D feature cube into features with dimensions of 1 × 128 through the first dense layer. These feature vectors passed the linear activation function and entered the second dense layer, obtaining the features with the dimension of 1 × 11. We used the activation function (Softmax) to calculate the probability of belonging to each category as the basis for classification [16,39]. 3.
3D-1D CNN: This network converts the joint spatial-spectral feature extracted by the last 3D convolutional layer into a 1D feature. In this way, 3D-1D CNN reduces the training parameters [16].

4.
Matching networks: It is a weighted nearest-neighbor classifier applied within an embedding space. The training process of the network is to establish the relationship or mapping between labels and samples in the training set, and directly apply it to the test set in the same way. We trained the matching networks and the matching networks combined with CBAM for classification schemed with [19].
The results are shown in Tables 9 and 10. In CNNs, 3D CNN outperform 2D CNN, whereas 3D-1D CNN reduces training time and brings about a 1% accuracy loss compared with 3D CNN. However, the few-shot learning method, i.e., matching networks and prototypical networks, obtained a higher testing accuracy than CNNs in the case of few samples. A comparison of few-shot learning methods shows that the model training time of the matching networks is much longer than that of the prototypical networks, and the testing accuracy is less than that of the prototypical networks. In different training scenarios, the matching networks and the prototypical networks have a high testing accuracy when the shot is 5 and 10, and the classification testing is low when the shot is 15. Inserting CBAM into matching networks and prototypical networks, the testing accuracy of the model is obviously improved, and it still shows that CBAM-P-Net is obviously better than CBAM-M-Net in training time cost and classification performance. In terms of network structure, the trainable parameters of CBAM-P-Net are at the hundred thousand level, which is far lower than the tens of millions in CBAM-M-Net. The testing accuracy of different tree species also proved that CBAM-P-Net is higher than CBAM-M-Net.

Classification Results
Taking each pixel as the sample center point and the window size (17 × 17) as the best sample window, the classification maps of CBAM-P-Net constructed with 20%, 40%, 60%, and 80% of the training samples and P-Net constructed with 80% of the samples were generated, respectively ( Figure 11).

The Size of the Sample Windows on the Classification of Prototypical Networks
When using prototypical networks to classify airborne hyperspectral images, the different window sizes show significant diversities in classification accuracy (Figure 7). Appropriately increasing the window size helps to improve the classification performance of prototypical networks, but samples with an excessively large window may cause extra noise. In addition, the increase of the window will also increase the time for network training and the prediction calculations. In this study, with the increase of the window size, From the classification maps, it can be seen that, when CBAM-P-Net is trained with 20% (Figure 11a), 40% (Figure 11b), and 60% (Figure 11c) samples, and P-Net is trained with 80% samples (Figure 11d), the boundaries of the coniferous species C. lanceolate and P. elliottii are not clear, there is confusion between C. hystrix and A. melanoxylon, and E. urophylla and E. grandis are also mixed together. When CBAM-P-Net is trained with 80% (Figure 11e) samples, C. lanceolate, P. elliottii, E. urophylla, and E. grandis can be well distinguished. When there are fewer training samples, the elliptical box area in Figure 11a,b is classified as C. hystrix. However, as the number of training samples increases (Figure 11c-e) elliptical boxes), the area is classified as soft broadleaved, and the classification accuracy is high, which is more in line with the distribution of tree species in real life.

The Size of the Sample Windows on the Classification of Prototypical Networks
When using prototypical networks to classify airborne hyperspectral images, the different window sizes show significant diversities in classification accuracy (Figure 7). Appropriately increasing the window size helps to improve the classification performance of prototypical networks, but samples with an excessively large window may cause extra noise. In addition, the increase of the window will also increase the time for network training and the prediction calculations. In this study, with the increase of the window size, the classification accuracy of the samples obviously improved till the window was 17 × 17, indicating that the spatial feature and channel feature extraction of the prototypical networks for this window size basically meets the requirements of high-precision classification. When the window size increased to 25 × 25, the classification accuracy fluctuated. This is because the data noise caused by the enlarged window affects the classification performance. Classification accuracy increases with the window from 25 × 25 to 29 × 29, especially in the 29 × 29 window, where the classification accuracy reaches the maximum. On the one hand, this shows that the spatial characteristics of the samples using this window size are enough to cover the data noise, which reflects the classification potential of prototypical networks; on the other hand, it also shows that the current prototypical networks model still has potential for improvement in feature extraction. Improving its feature extraction method can fully mine the feature information of samples on a small window in further study.

The Influence of Convolutional Attention Module on Prototypical Networks
In the field of image classification based on deep learning, the output of each layer in the network can be represented as a three-dimensional feature map. In order to improve the effectiveness of the image features, the attention mechanism [43] is applied to few-shot image classification algorithms as a form of image feature enhancement. The previous research achieved a single attention mechanism [44] and a multiple attention mechanism [45], and compared these two methods. The experimental results show that the model based on the hybrid attention mechanism can extract image information more sufficiently and achieve a better classification performance [35,36,45]. Therefore, our research focuses on the hybrid attention mechanism. In this experiment, we compared three different arrangements of the channel attention and spatial attention submodules in CBAM-P-Net: Channel First, Spatial First, and Parallel. The three combination methods have shown an improvement in classification performance, and the best classification performance has been obtained in the 17 × 17 window, which is consistent with our expectations. Since each module has different functions, we speculate that the performance improvement comes from accurate attention and noise reduction of irrelevant clutter. From a spatial perspective, channel attention is applied globally, while spatial attention is applied locally. Combining two attention outputs to construct a three-dimensional attention map, it can be predicted that the sequential mode should be better than the parallel mode, and the application mode of Channel First is more in line with our optimal arrangement strategy. The average growth of Channel First and Spatial First in OA and Kappa is higher than that of Parallel, and the best classification performance obtained by Channel First shows the better interpretability of CBAM.

The Effect of the Number of Training Samples on Classification Accuracy
In the deep learning algorithm, a large number of training samples are the guarantee for obtaining the optimal model, and the training samples need to contain a variety of different scenarios, so that the testing model has sufficient robustness [46]. Therefore, more effective sample information is of great significance to the training of a high-precision classification model. In this study, we re-divided the sample ratio, and the overall classification performance showed a gradient decline with the reduction of training samples, which demonstrated the importance of training samples for the CBAM-P-Net model. When we use 80% (96 samples per class, 1056 samples in total) of samples as training samples, we get 97.28% as the highest OA and 0.9701 as the highest Kappa, whereas using 40% (48 samples per class, 528 samples in total) of samples, the highest OA is 94.83%, and the highest Kappa is 0.9432. In the same research area, Zhang et al. [16] constructed a 3D-CNN model for 12 categories of classification using 5342 training samples, and the OA and Kappa were 95.74% and 0.9705, respectively. Another study [39] used the training samples consistent with this paper (96 samples per class, 1056 samples in total), but over-fitting occurred when the original hyperspectral image was classified. The OA and Kappa were 71.08% and 0.6819, respectively. This confirms the advantages of CBAM-P-Net proposed in this paper for classifying few-shot data sets. Due to the different application scenarios, this research aims to consider the equilibrium of various categories of samples. Therefore, there are differences in the number of categories and the number of samples within the category from the data set constructed by the previous researches. When the training sample is reduced to 20% (24 training samples for each class), the classification accuracy increases up to 92%, which shows that the CBAM-P-Net is practical and that the model still has a large improvement.

Comparison of Prototypical Networks and Matching Networks
The comparison of the proposed method with those in the literature is presented in Tables 9 and 10. Generally speaking, our method performs better than other methods with significant margins in terms of training time cost and test accuracy. The performance improvements of the proposed method are mainly due to the effectiveness of the proposed model. Furthermore, two few-shot learning methods are compared. Prototypical networks and matching networks have the same structure in the feature extraction part, and so the way to insert CBAM is also consistent. In this study, under the same experimental conditions, we constructed prototypical networks and matching networks, as well as CBAM-P-Net and CBAM-M-Net after joining CBAM. Although the model after joining CBAM has caused a certain consumption of training time, the advantages of CBAM-P-Net and CBAM-M-Net in classification accuracy are inspiring enough to show the positive effect of CBAM on feature extraction. In terms of network structure, prototypical networks unifies the coding layer and the classification layer, which has fewer parameters than matching networks. This is particularly evident for 125-bands hyperspectral images, which directly reflected the huge difference in training time between our model and the matching networks model. CBAM-P-Net is superior to CBAM-M-Net in the classification accuracy of different tree species, which not only illustrates the advantages of the prototypical networks in the classification performance compared with the matching networks, but also shows that the Euclidean distance that satisfies the Bregman divergences is better than cosine measurement distance in clustering.

Conclusions
The CBAM-P-Net proposed in this paper has a great performance in the forest tree species classification. When using the optimized prototypical networks to classify nine complex forest tree species, cutover land, and roads in the 74.1 hm2 study area with airborne hyperspectral images, training only takes 1209 s. The highest testing accuracy OA reaches 97.28%, and Kappa reaches 0.9701, which can be used for the regional fine-grained classification and mapping of tree species.

1.
When samples of different window sizes are input into prototypical networks, there is an optimal window for classification. The window size should be determined according to the area size and forest stand distribution pattern.

2.
For the classification of hyperspectral remote sensing images with hundreds of bands, the feature extraction method of conventional prototypical networks is slightly insufficient. Adding CBAM between the convolutional blocks of prototypical networks and configuring channel attention prior to spatial attention (Channel First) can improve the feature extraction efficiency. Thus, the proposed CBAM-P-Net can effectively solve the few-shot classification problem.

3.
Compared with matching networks, prototypical networks has shorter training time and higher testing accuracy for tree species classification using hyperspectral images. From tens of millions of training parameters to one hundred thousand, the training time of the prototypical networks is shortened by thousands of times, and the classification accuracy of the prototypical networks is higher. Compared with CBAM-M-Net, CBAM-P-Net shows a higher classification accuracy on different tree species. Therefore, using CBAM-P-Net to classify and map tree species distribution based on airborne hyperspectral images can achieve better results.