Triple-Attention-Based Parallel Network for Hyperspectral Image Classiﬁcation

: Convolutional neural networks have been highly successful in hyperspectral image classiﬁcation owing to their unique feature expression ability. However, the traditional data partitioning strategy in tandem with patch-wise classiﬁcation may lead to information leakage and result in overoptimistic experimental insights. In this paper, we propose a novel data partitioning scheme and a triple-attention parallel network (TAP-Net) to enhance the performance of HSI classiﬁcation without information leakage. The dataset partitioning strategy is simple yet effective to avoid overﬁtting, and allows fair comparison of various algorithms, particularly in the case of limited annotated data. In contrast to classical encoder–decoder models, the proposed TAP-Net utilizes parallel subnetworks with the same spatial resolution and repeatedly reuses high-level feature maps of preceding subnetworks to reﬁne the segmentation map. In addition, a channel–spectral–spatial-attention module is proposed to optimize the information transmission between different subnetworks. Experiments were conducted on three benchmark hyperspectral datasets, and the results demonstrate that the proposed method outperforms state-of-the-art methods with the overall accuracy of 90.31%, 91.64%, and 81.35% and the average accuracy of 93.18%, 87.45%, and 78.85% over Salinas Valley, Pavia University and Indian Pines dataset, respectively. It illustrates that the proposed TAP-Net is able to effectively exploit the spatial–spectral information to ensure high performance. average pooling, the fully connected layer, the kernel size and the stride, respectively. The triple-attention module(CSSA) has two inputs: (1) the feature maps at the highest level of the preceding subnetwork, and (2) the corresponding low-level feature maps of the same stage. The total weight is generated by aggregating the three types of attention modules.


Introduction
With the rapid development of hyperspectral imaging technologies, it is feasible to collect hundreds of contiguous narrow spectral bands for each pixel in a scene [1,2]. This abundant spectral and spatial information in hyperspectral remote sensing data has been widely used in a broad range of applications with unprecedented accuracy [3]. Among these applications, hyperspectral image (HSI) classification (or semantic segmentation), which aims at assigning a unique label to each pixel of HSI, is a critical enabling step for land-cover monitoring, ecological science, environmental science, and precision agriculture [4,5]. Even though it has attracted considerable attention, it remains a challenging problem because of the limited number of training samples and the spatial variability of spectral signatures [6].
Both spectral and spatial information should be considered in HSI classification, whereas early HSI classification methods primarily focused on the study of a continuous spectrum in an effort to classify pixels using distinguishable spectral features [7,8]. Typical classifiers include support vector machines (SVMs), dictionary learning, and neural networks [9,10]. However, the classification performance of these methods is usually unsatisfactory for small sample sizes and high-dimensional datasets. It has been demonstrated that spatial information is complementary to spectral features and contributes to the improvement of classification performance [11]. Spatial contextual information can be incorporated through feature combination and decision fusion [4]. For instance, Chen et al. explored spectral-spatial information by flattening a neighbor region as a vector and feeding this vector into classifiers [12]. Fauvel et al. combined morphological information with the original hyperspectral data and concatenated these two attribute vectors into one feature vector [13]. In addition, classification methods based on Markov random fields (MRF), SVMs, ensemble decision trees, and deep belief networks were recently proposed for decision fusion [13,14]. An MRF or conditional random field was used to enhance spatial smoothing and further refine the classification results [15,16]. However, poor generalization is usually observed when the classification is based on handcrafted features. The representation power is limited and may not fully represent the abundancy of spectral-spatial information.
Recently, deep convolutional neural networks (CNNs) have achieved tremendous successes in a broad range of applications, such as speech recognition, gesture recognition, and natural language processing [17][18][19][20]. Their considerable feature extracting power also contributes to their success in HSI classification. For instance, the authors of [1,21] proposed a two-branch network to extract the spectral and spatial features separately, and then used the fused features for classification. However, since the spectral and spatial features are extracted independently, the mutual excitation is generally ignored. Three-dimensional CNNs have allowed the extraction of deep spectral-spatial features by using a 3D convolution kernel [22][23][24]. In [24], a 3D CNN and the Jeffries-Matusita distance were introduced to select effective bands and reduce the redundancy of spectral information. In [22], 3D convolution was combined with a traditional self-encoder and wavelet technology to maximize the extraction of spectral-spatial structure information. Despite the robustness of 3D CNNs has been demonstrated in some previous works [22,25], the significant increasing of learnable parameters introduced by the 3D convolution kernels often excluded their application in cases with limited training samples [26].
In addition to feature extraction, the optimization of classification algorithms has attracted considerable attention. Inspired by the human visual attention process, a large number of attention-based models have achieved remarkable performance in semantic segmentation, pattern recognition, target detection, and other fields [27][28][29]. As shown in Figure 1, three categories of attention mechanisms are used in HSI: channel-wise, spectralwise, and spatial-wise. For a C × H × W × B-sized HSI feature map, the channel-wise attention determines the importance of feature channels (with the size of C) and adjust their weights in network propagation. The spectral-wise attention recalibrates the importance of different spectral bands (with the size of B). Different from these two attention mechanisms, probability maps are generated for each pixel in the H × W region to constrain the pixels in neighboring region by using spatial-wise attention. Various CNNs have been proposed to handle the correlation of channel, spectral, and spatial features from HSI. For instance, in [8,30], spectral-wise attention was introduced to select important bands and enhance the distinguish ability of spectral features, thus improving the classification performance of the trained models. In [21], the authors proposed a 3DCNN-based doublebranch network to extract spectral as well as spatial information separately, and utilized the channel-wise attention as well as spatial-wise attention to focus on the most informative features. In [31], a recurrent neural network with an attention module was designed to learn inner spectral correlations within a continuous spectrum, and a CNN with an attention module was proposed to focus on saliency features and spatial dependency in neighboring regions.Although CNN-based models have exhibited promising performance in HSI classification, certain issues should be addressed.
First, most CNN-based HSI classification methods often have potential training-test information leakage, leading to overly optimistic results. For example, in the traditional data partitioning method, in a single class, S × S neighborhoods of the center pixel are divided into partitioned patches by the sliding window strategy [32]. Since most of the patches exist more or less overlap, the information leakage will occur when the partitioned patches are assigned to the training or test set. Several strategies have been proposed to avoid information leakage [33,34]. However, owing to their patch-wise classification strategy, they cannot achieve satisfactory performance without using spatial information. Zou et al. developed a training/test partitioning method using a patch-based algorithm from the input images [32], whereby the original image is divided into blocks that do not overlap, and subsequently multi-class blocks (i.e., blocks with more than one pixel type) are selected as the training set. However, this patch-wise method has the defect that some class may be missed in the training, validation, and test sets. The construction of an impeccable training/test partition require further investigation. Second, the number of training patches is not sufficiently large to train the deep learning framework when the dataset is divided without overlap by the traditional data partitioning method [32]. Although some 1D CNN frameworks can obtain a fair result by using a spectral vector without information leakage [35,36], they cannot achieve satisfactory performance without using spatial information. Compared with the traditional classification framework, fully convolutional networks (FCNs) classify each pixel in the entire patch. Under the premise of the same number of training patches as in the traditional framework, FCNs can utilize more annotated information without overlap and assign all labels to the patches. However, FCNs have not been extensively used for pixel classification in HSI, and therefore there is room for considerable improvement in the design of this framework.
Third, although some attention-based frameworks have achieved remarkable performance in HSI classification, most studies have been concerned with the internal architecture of the attention module. They tend to put attention weights in one or two dimensions, and ignore the fact that the HSI is a 3D cube. For instance, in [8], a single-attention module was designed for the spectral dimension. In [21], a double-attention module was proposed to reduce the interference between channel and spatial information. The spectral and spatial dimensions are weighted by the spectral and spatial attention module, respectively, in [31]. The combination of attention mechanisms in one or two dimensions may improve performance. However, it is necessary to integrate all dimensional attention mechanisms for better classification.
To address these issues, we introduce a novel routine for data partitioning and a triple-attention parallel network for HSI classification. The main contributions of this study are as follows: • We propose a FCN-based parallel network as our baseline. It is composed of four parallel subnetworks with the same spatial resolution, and the high-level feature maps of any subnetwork are reused by anti-cross-layer connectivity to refine the low-level feature maps of the succeeding subnetwork. • We apply a triple-attention mechanism, consisting of channel-wise, spectral-wise, and spatial-wise attention, between different subnetworks in a parallel network. The attention mechanism filters the feature maps of any subnetwork to obtain stronger spectral-spatial information and more important feature channels as input for the succeeding subnetwork. • We introduce a novel partitioning method, which can be the gold standard for HSI classification. It not only allows designing a framework without information leakage, but also suits actual application scenarios.

Proposed Methods
Herein, we introduce a parallel fully convolutional network based on a triple-attention mechanism. The proposed framework has three key components: (1) A parallel network is used to replace the mainstream serial CNN structure, and it is verified that this can develop deeper aggregation structures to enhance the feature extracting ability in [37,38]. We set four parallel subnetworks, which are distributed horizontally and represented in different colors in Figure 2. (2) A variable spectral residual block (VSRB) is proposed to refine the feature maps and adjust the map dimensions at different convolution stages [39]. In each subnetwork, the feature maps from low to high level should be refined by the VSRB.
(3) Inspired by the success of multiple-attention mechanisms in computer vision [28,40], a channel-spectral-spatial-attention module (CSSA) is applied to adjust the weights of different dimensions when feature maps are transmitted to the same convolution stage of an adjacent subnetwork. Channel-spectral-spatial-attention parallel network. Each row and column represents the same subnetwork and the same convolution stage, respectively; VSRB represents the variable spectral residual block; CSSA represents the channel-spatial-spectral-attention module.

The Parallel Network and Anti-Cross-Layer Connectivity
Recent research has demonstrated that cross-layer connectivity and multi-scale context fusion achieve promising performance. Gao et al. (and several other researchers) used densely connected convolutional networks to encourage feature reuse [41]. This is a widely used strategy and achieves satisfactory performance in serial networks. The framework is shown in Figure 3a. U-net series frameworks have achieved considerable success in medical imaging by using multi-scale resolution strategies and cross-layer connectivity [42], as shown in Figure 3b. In HSI classification, the strategy of cross-layer connectivity and multiscale context fusion is also highly successful [8,21]. In addition, Ma et al. proposed a double branch network in which one branch is used for spectral information exploration, and the other for spatial feature extraction [21]. An overview of this double-branch framework is shown in Figure 3c. In this paper, we propose a novel parallel CNN framework, which is different from those in existing approaches, as shown in Figure 3d. In most cases, cross-layer connectivity is designed to encourage feature reuse by incorporating the feature maps of low layers into higher layers, as shown in Figure 3a,b. Although cross-layer connectivity can often improve performance, semantic gaps and performance deterioration may result from the fusion of incompatible feature maps [43]. In combination with recent research [37,38], we designed anti-cross-layer connectivity and a basic parallel network. Compared with traditional cross-layer connectivity, anticross-layer connectivity fuses refined feature with rough maps in a subnetwork, and the fusion maps are re-refined through the succeeding subnetwork. In addition, we enable the reuse of feature maps between different subnetworks. The connections between adjacent sub-networks are shown in Figure 3d. We first extract the multi-scale spectral information between different stages in SN1. Then, we use the fusion of the high-level feature SN1-3 and lower-level features (SN1-1, SN1-2), which is called anti-cross-layer connectivity, as the input of each stage in subnetwork SN2. In the parallel network, four subnetworks are utilized, and high-level feature maps of the preceding subnetworks are repeatedly reused to refine the segmentation map.

Variable Spectral Residual Block
Residual blocks have been successfully used in various deep learning applications [44]. A residual connection can be expressed as: where X l and X l−1 are the l th and l − 1 th feature maps, respectively, and the function F is a nonlinear transformation. Through a residual connection, the learning goal of the convolution kernel is converted to F(X l−1 ) = X l − X l−1 , and the residual mapping is easier to optimize than the original function F [45]. In addition, a residual block was also introduced in [45] to reduce computation and training time.
To better refine spectral-spatial features and achieve better performance, we designed the VSRB to match the dimension of the feature maps. In the proposed framework, different stages in the same subnetwork require VSRB refinement, and the VSRB architecture is shown in Figure 4. We employ a 1 × 1 × 1 convolution layer and a batch normalization layer as the first component of the VSRB, and the layers are used to unify the number of channels, downsample spectral bands, and combine information. Then, a residual block is applied to improve feature extraction. Finally, we set a band-correction layer to adjust the band dimension and maintain dimensional consistency in the same stage of different subnetworks. It should be noted that, as the spatial dimension information is limited, we only reduce the number of spectral bands in VSRB.

Triplet Attention Mechanism
For better HSI classification performance, we propose a novel triple-attention module(CSSA) that consists of channel-wise, spectral-wise, and spatial-wise attention. The module is introduced in detail in the following.

Channel-Wise Attention Module
Different feature map channels in each convolution stage can be regarded as different feature representations [46,47]. A large number of channels are applied to represent different feature maps, but this results in several meaningless feature channels. Therefore, we introduce a channel-wise attention module to increase the weights of useful feature channels and weaken meaningless channels.
The proposed channel-wise attention module is shown in Figure 1a. We assume that the dimension of the input feature maps is C × W × H × B, where C, W, H, and B are the channels, width, height, and bands of feature maps in each convolution stage. The input feature maps are first sent into a global average pooling (GAP) layer to aggregate spectralspatial information into a C × 1 × 1 × 1 feature vector. Then, this vector passes through a fully connected (FC) and a relu activation layer to introduce more feature representation nodes. Subsequently, another FC layer is employed to adjust the dimension of the vector back to C × 1 × 1 × 1, and a sigmoid activation function maps the feature vector to a probability vector. Finally, this probability vector is treated as a channel weight multiplied by the original feature maps. The channel weighting function can be expressed as: where x is the C × W × H × B input feature and F GAP is the global average pooling function. F f c1 and F f c2 are the first and second FC layers, respectively. σ relu is the relu function, and σ sigmoid is the sigmoid activation layer.

Spectral-Wise Attention Module
Spectral bands can be represented as a continuous spectral curve that contains the value of each spectrum. For spectral classification, hundreds of spectral bands are directly used as inputs to different convolution stages. This inevitably involves some noise bands, resulting in poor classification performance. Accordingly, we propose a spectral-wise attention module to enable the network to recalibrate the importance of different spectral bands and strengthen useful spectral features.
The proposed spectral-wise attention module is shown in Figure 1b. The C × W × H × B input feature is sent into the convolution and relu layer with only one filter to fuse information in different channels. Then, the dimension of the feature is 1 × W × H × B. We set the size of the convolution kernel to W × H × 1, and the strategy of kernel padding to be invalid. This operation merges the spatial information and produces a 1 × 1 × 1 × B feature vector with the same number of channels as that of the spectral bands. Subsequently, this vector is sent to a sigmoid function layer to obtain the probability vector. For convenience, the probability vector is upsampled to the size of the input feature. The spectral weighted feature maps by multiplying the input feature and the probability map. The spectral-wise attention module can be computed as: where x is the C × W × H × B input feature, F conv is the convolution function, F resize denotes upsampling, σ relu is the relu function, and σ sigmoid is the sigmoid activation layer.

Spatial-Wise Attention Module
Environmental factors such as light, temperature, and humidity have a great influence on hyperspectral imaging. Pixels with similar spectra may belong to different classes, and pixels with different spectra may have the same label. Intra-class inconsistency and interclass homogeneity greatly affect classification performance, and using spatial information to constrain neighboring region pixels is the key to addressing this [47]. Herein, a spatialwise attention module is introduced to strengthen the associativity between adjacent pixels.
The proposed spatial-wise attention module is shown in Figure 1c. In contrast to the channel-wise and spectral-wise attention modules, the spatial-wise attention module is designed by using a double-branch strategy. In the two branches, we use different feature maps of different sizes as the input. Branch 1 uses the original feature maps to generate weights for each pixel of the input image. The spatial dimension of the feature maps is downsampled twice, and then the downsampled feature maps are sent to Branch 2 to obtain the probability maps. In both branches, we first use convolution with only one filter to aggregate the channel information, as in the design of the spectral-wise attention module. The feature dimensions of the two branches are 1 × W × H × B and 1 × (W/2) × (H/2) × B. Then, the sigmoid activation function is applied to generate probability maps. The probability result of Branch 1 directly weights the original feature maps. The probability maps of Branch 2 first upsample to 1 × W × H × B, and then multiply the weighted feature maps of Branch 1. Thus, the pixels in the 2 × 2-region share the same weights, thereby increasing the probability that adjacent pixels belong to the same class. The spatial-wise attention module can be expressed as: where x is the C × W × H × B input feature, and x ds represents the feature maps after downsampling. F conv is the convolution function, F resize denotes upsampling, σ relu is the relu function, and σ sigmoid is the sigmoid activation layer.

Aggregation of Attention Modules
To take full advantage of the information in each dimension, we design the tripleattention block to aggregate channel, spectral, and spatial information.
As shown in Figure 5, when the attention channel is fed into the CSSA module, it first passes through the channel-wise attention module. Then, the generated channel-wise attention probability vector is multiplied by the low-level feature, and the result is sent to the spectral-wise and the double-branch spatial-wise attention module. Thereafter, the channel-wise, spectral-wise, and spatial-wise attention maps are sampled to the same size and averaged. Finally, the low-level feature is multiplied by the average weight, and the result is added to the low-level feature as the output of the CSSA module. The strategy of aggregating three weights into one is equivalent to generating a weight of the same size as that of the low-level feature, and then weighting each value of the 3D feature maps. Compared with the result of previous single or double attention mechanisms, the tripleattention mechanism comprehensively considers each dimension in the feature extraction process and adjusts the weight of the corresponding dimension. Figure 5. The triple-attention module, involving in channel-, spectral-and spatial-attention (CSSA). GAP, FC, K, S represents global average pooling, the fully connected layer, the kernel size and the stride, respectively. The triple-attention module(CSSA) has two inputs: (1) the feature maps at the highest level of the preceding subnetwork, and (2) the corresponding low-level feature maps of the same stage. The total weight is generated by aggregating the three types of attention modules.

Data Partition
Traditional patch-based partitioning methods have been verified to have potential test information leakage in [33]. The training/test set division greatly affects the fairness of model comparison, the evaluation of framework performance, and the feasibility of practical application. Accordingly, there is an urgent need for a data-partitioning approach that allows the fair comparison of various models. To this end, some researchers have explored data-partitioning methods that do not lead to data leakage [32]. However, in [32], two potential issues should be addressed. First, some classes are missing in the training or test set of several experiments, such as C7 and C9 in the Indian Pines dataset. Second, the partitioning results of the training set indicate that the labeled pixels are clustered in each training patch. In practical applications, it is difficult to accurately mark all pixels in a neighborhood.
To address these issues, we propose a novel data partitioning standard for future studies in HSI classification. The proposed method first partitions the original image into nonoverlapping training and test blocks, and then randomly selects pixels as training pixels from the training blocks. Thereby, there is no information leakage because of the lack of overlap between the training and test blocks. The details are as follows. Before data partitioning, we determine the ratio ω t of training pixels in the original image, and the number N p of labeled training pixels in each pre-partitioned training block. Therefore, the number of training blocks in each class can be expressed as: where N blocks (1), N blocks (2), ...N blocks (c) are the numbers of training blocks in each class (N blocks is set to be at least 1 in each class), and N pixels (1), N pixels (2), ...N pixels (c) refer to the number of pixels of each class in the original image. The data partitioning process is shown in Figure 6. We first divide the original image into random partitioning blocks and the remaining image, as shown in Figure 6a,d. Subsequently, we randomly reserve N p labeled pixels in the random partitioning blocks and save the reserved blocks as training blocks, as shown in Figure 6b. The remaining labeled pixels in the random partitioning blocks are set as leaked images, as shown in Figure 6c. Then, as shown in Figure 6e,f, we divide the remaining image into validation blocks and the remaining test image randomly. Finally, we partition the leaked image and the test image into leaked patches and test patches, respectively. In addition, we apply the same sliding window strategy as in [32] to the training and validation blocks for more training and validation patches. Furthermore, traditional data augmentation methods (i.e., flip up/down, flip right/left and rotate right/left with an angel of π/2 or π) were applied on the training and validation patches. In order to ensure the randomness in the expanded dataset, each patch will be augmented with only two of the three methods.
In the proposed partitioning method, there is no overlap between the training, validation, and test sets, and a comparison between the leaked and the test set can verify the seriousness of the potential information leakage. Moreover, this method is suitable for practical applications because it is easy to identify several pixel labels that are randomly distributed in a training block area. Furthermore, labeled pixels in a patch can also enable FCN models to extract spectral-spatial information for improved performance.

Evaluation Matrices
To evaluate the performance of the proposed TAP-Net, three popular evaluation matrices, overall accuracy (OA), average accuracy (AA), and Kappa coefficient were employed in this study. The OA metric is used to calculate the ratio of correct predictions over the total test pixels [48]. The AA metric is designed to evaluate the average performance across different classes [48]. The Kappa coefficient is calculated from a confusion matrix, and it represents a statistical measurement of agreement between the predictions and the ground-truth [48]. For all these three evaluation matrices, the higher values represent the better classification performance. Taking M ∈ R N×N as the confusion matrix, M i,j represents the number of i-th class pixels that are predicted to be j-th class [49]. These three matrices are defined as follows: Average

Experimental Dataset
In the experiments, three well-known hyperspectral datasets were used for testing: the Salinas Valley, Pavia University, and Indian Pines datasets. Five-fold cross validation was carried out in the experiments. Moreover, each fold was repeated three times to reduce the effect of randomness. To obtain a fair comparison, we also maintained almost the same amount of data usage as in [32]. The details regarding the three dataset and the settings are as follows: Salinas Valley: This dataset was obtained by the AVIRIS sensor (with a resolution of 3.7 m/pixel) in Salinas Valley, CA, USA. The spatial size of the dataset is 512 × 217 pixels, the number of spectral bands is 224, and the class of ground pixels is 16. The five cross-validation folds were randomly partitioned, and the training pixels can be seen in Figure 7a(1-5), whereas the ground truth is shown in Figure 7a(0). Pavia University: This dataset was captured by the ROSIS sensor in Pavia, northern Italy. The sensor initially photographed 115 bands, and 12 noisy bands were removed. It has 610 × 340 pixels with a high resolution of 1.3 m/pixel and 9 land-cover classes. Figure 7b(0-5) shows a visualization of the dataset. Specifically, Figure 7b(0) shows the ground truth, and Figure 7b(1-5) shows the five random folds.
Indian Pines: This dataset was obtained by the AVIRIS sensor in Indian Pines, Northwestern Indiana. The sensor can capture 220 spectral band images with a resolution of 3.7 m/pixel. In this dataset, 20 water absorption bands were removed, and 200 spectral bands remained with a spatial size of 145 × 145 pixels. The ground truth and the distribution of the five folds are shown in Figure 7c(0-5).
To obtain a fair comparison with other methods, we used the same number of training pixels as in [32]. As shown in Tables 1-3, we divided the labeled pixels into training, validation, leaked, and test sets. Owing to the randomness of the data partitioning method, the values in Tables 1-3 46.41%, and 72.16% compared with test pixels. In fact, the number of leaked pixels in the traditional centered partitioning method is significantly greater, and therefore the resulting information leakage may yield more overoptimistic results.

Parameter Setting and Network Configuration
In data partitioning, considering the different spatial sizes and the limited pixels in each dataset, we set different block and patch sizes when dividing different datasets. In Salinas Valley, we selected the block size as 12 × 12, and the patch size as 10 × 10 to ensure the extraction of spatial information. This selection can also provide sufficiently many training patches by using the traditional sliding window strategy. The number of labeled pixels in each training block (N p) was set to 10 in the experiment. Similarly, the block-patch size was 12-10 and 6-4, and the parameter N p was set to 15 and 8 for Pavia University and Indian Pines, respectively. Wt in the three datasets is a free variable that is used to control the proportions of the training, validation, and test sets.
In the framework design, we used nearly the same parameter settings for the three datasets. As shown in Table 4, with the Salinas Valley as an example, the triple-attention parallel network consists of four subnetworks, and each subnetwork is divided into four stages. In each subnetwork, two neighbor stages are connected by VSRB for efficient feature extraction. The kernel size of the VSRB and the output size of the feature maps in each stage are shown in Table 4. In adjacent subnetworks, the feature maps in the preceding subnetwork are filtered by the triple-attention module and are fused with the maps in the corresponding stage in the succeeding subnetwork. For example, stage 1, 2 and 3 of subnetwork 0 is fused with stage 1, 2 and 3 of subnetwork 2, respectively. In the kernel size settings, the first stage of each subnetwork is an aggregation layer. A convolution kernel with a size of 3 × 3 × 3 is used to aggregate the channel-spectral-spatial information. In the remaining stages, a 1 × 1 × 3 convolution kernel is used to enhance the extraction of spectral information.
The TAP-Net was implemented in Python 3.6 and Keras 2.2.4. The Adam (beta_1 = 0.9, beta_2 = 0.999, epsilon = 1 × 10 −8 ) was employed as the optimizer. The learning rate was initially set as 0.01, and shrunk to 1/10 of the previous value after every 20 epochs. The batch size was set to 32. We utilized focal loss as the loss function, which is designed to reduce the impact of easy samples [50]. All the experiments were performed with the same configuration on the platform with Intel i7-8700K, 32 GB RAM, and NVIDIA GeForce GTX 1080 GPU.

Classification Result
We compare the TAP network with other state-of-the-art CNN-based methods that do not exhibit information leakage. Furthermore, to evaluate the contribution of each module to the final performance, we conducted experiments with or without a parallel network and the triple-attention module. In the comparison, we used the following networks: VHIS was proposed in [33], where the information leakage problem in traditional CNN-based methods was explained. The authors introduced a 1D CNN-based network to extract spectral features and classify pixels without information leakage.
DA-VHIS was introduced in [34], where three data augmentation methods were proposed to generate more training samples. We report the best performance of these three data augmentation methods in the comparison.
Auto-CNN was proposed in [51], where a 1D Auto-CNN was applied to optimize the classifier without test information leakage. The best framework is automatically selected for HSI classification.
SS3FCN was introduced in [32], where a data partitioning strategy was designed to avoid information leakage. This strategy also allows the exploration of spectral-spatial information by the proposed 1D and 3D double-branch networks.
SerialNet was shown in Figure 8a. It is a serial network with the same number of stages as in the proposed TAP-Net. Moreover, the VSRB module are used for feature extraction, and the corresponding parameter setting and network configuration are the same as TAP-Net.
ParallelNet was shown in Figure 8b. The ParallelNet, consists of VSRB module and parallel module, is compared with serial net to validate the performance of parallel module. The hyper parameter and network configuration are same as that in TAP-Net. The main difference between the ParallelNet and TAP-Net is that there is no attention module in ParallelNet.

Results on Salinas Valley
The average classification results and standard deviation on the Salinas Valley dataset are listed in Table 5. It can be seen that the proposed method achieves the best performance when as many training samples are used as in SS3FCN, and fewer samples than in VHIS, GAN-VHIS, and AutoCNN. Compared with traditional spectral classification methods (VHIS, GAN-VHIS, and AutoCNN), TAP-Net, by using spectral-spatial information, improves accuracy by more than 10%. Considering the accuracy of the frameworks under comparison, the proposed framework is more robust even with a small number of training samples. Figure 9 shows classification maps by certain methods only, as the other methods are not described in sufficient detail to reproduce the classification maps.

Results on Pavia University
The OA, AA, and class-specific accuracy obtained by different methods on the Pavia University dataset are shown in Table 6. As in the case of Salinas Valley, SerialNet, Par-allelNet, and the proposed TAP-Net achieve significantly better OA & AA than VHIS, GAN-VHIS, AutoCNN, and SS3FCN. In addition, the data partitioning method, parallel framework, and triple-attention module are highly successful. In the comparison of each sub-module, the accuracy of C6 increased from 23.59% to 56.58% by introducing the proposed data partitioning method. Finally, compared with the state-of-the-art spectral-spatial framework (SS3FCN), TAP-Net achieved 11.75% & 10.85% better performance than SS3FCN in terms of OA and AA. Figure 10 the classification maps obtained by some of the methods under comparison.

Results on Indian Pines
The overall classification results and accuracy of each class are shown in Table 7. Regarding the Indian Pines dataset, the number of pixels in each class is quite imbalanced, and this significantly affects the results of the experiment. For example, VHIS, GAN-VHIS, AutoCNN, and SS3FCN only obtain an accuracy of 0 & 23.8% & 35.63% 20.14% & 21% for the C7 class. The same occurs in the classification of C1 and C9. These classes of samples are not sufficiently large to learn differentiable features in other frameworks. In the proposed method, TAP-Net achieves an accuracy of 70.98%, 80.40%, and 70.03% in the classification of C1, C7, and C9, respectively. This demonstrates that TAP-Net can better extract features in these limited classes and achieve higher classification accuracy with almost the same number of training pixels. In addition, the proposed framework achieves the best performance, with the OA & AA & Kappa of TAP-Net is 81.35% & 78.85% & 0.787, which is significantly better than the performance of the other frameworks. The classification maps of some frameworks are shown in Figure 11. Table 7. Classification results for the Indian Pines dataset, including per-class, overall (OA), and average (AA) accuracy (in %), and the Kappa scores.

Analysis and Discussion
The results in Section 4 demonstrate that the proposed method yields the best results in HSI classification, particularly for classes that are difficult to classify using other methods. Herein, we explore the effect of various factors on model performance. The leakage of test information, block-patch size, number of labeled pixels in each block, and the effect between different attention mechanisms are discussed for future reference.

Effect of Information Leakage
We now revisit the information leakage problem, which may reduce the credibility of the results obtained by existing traditional data partitioning strategies and classification networks. As shown in Figure 12, we compare the average results of leaked and non-leaked datasets generated by the same training model. "Real acc" is obtained from non-leaked test samples that do not overlap with the training block, whereas "Leaked acc" is obtained from samples with information leakage. In Figure 12a, the accuracy of all 16 classes and OA & AA in Real acc is lower than in Leaked acc. Among them, the largest gap between Real acc and Leaked acc is in C15, and the performance gap between the two test sets is more than 10% in the classification of C15. Moreover, there is an optimistic performance of approximately 3% in OA & AA for Leaked acc. In Figure 12b, C3, C6, and C7 are the main factors that cause accuracy distortion; the largest performance gap between the two test sets increases by up to 25%. The OA & AA values also increase from 91.64% & 87.45% to 97.91% & 95.65%. The most serious accuracy distortion occurs in the Indian pines dataset, as shown in Figure 12c, where almost all classes exhibit highly optimistic performance. The real OA & AA in this dataset increases by 12.08% & 12.47% on the leaked test set. In general, we hope that an accuracy comparison between the real and leaked test sets will enable recognizing and avoiding the information leakage problem in future studies.

Effect of Block-Patch Size
To achieve the best classification results, we now discuss the effect of different blockpatch sizes. In this study, blocks are set to select the training and validation data. The size of these blocks in the experiment determines the spatial information of the training and validation samples. Moreover, the number of training and validation samples is controlled by the patch size in the sliding window strategy. We empirically set the difference between block and patch size to 2 in the comparison experiment. The results on the three datasets are shown in Table 8. For example, in the Salinas Valley dataset, the proposed TAP-Net achieves the highest accuracy with an OA/AA of 90.31%/93.18% when the block-patch size is 10-8. Performance decreases gradually with an increase or decrease in the block-patch size. In fact, the spatial information in the training patch decreases if smaller block size is used. Conversely, the correlation between pixels decreases when the same number of labeled pixels are randomly distributed in larger blocks. Similar results are obtained on the other datasets; on the Pavia University  and the Indian Pines (6-4) datasets, the best performance is achieved, with an OA & AA of 91.64% & 87.45% and 81.35% & 78.85%, respectively. It should be noted that when the block-patch size is set to 10-8 for Indian Pines, all the labeled pixels appear in the training set, and thus no test results are obtained.

Effect of the Number of Labeled Pixels in Each Block
The labeled pixels in each block (Np in the data partitioning part) determine the distribution of the training pixels. In this study, the number of training, validation, and test pixels have a fixed upper limit. For example, in the training set, the total number of training blocks decreases if more labeled pixels appear in each training block. As the sampling blocks in a class decrease, the location of partitioning blocks within the class becomes unbalanced. As the percentage of labeled pixels in a training block decreases, more blocks should be partitioned to ensure that the same number of pixels are used for training. However, the continuity of spatial information within the block will be reduced because of the decreased labeled pixels. The balance between the number of blocks and pixels is important in classification. Figure 13 shows the effect of different Np in different datasets. The Salinas Valley, Pavia University, and Indian pine datasets achieve the best performance when Np is set to 10, 15, and 8, respectively.

Advantages and Limitations
The problem of information leakage leads to overoptimistic performance, which might be unreliable. In this paper, a novel data partitioning strategy without information leakage and a well designed deep learning architecture are proposed. Although the proposed modules are meaningful for HSI classification, there are still a few limitations needed to be addressed in future research.

Impact of the Attention Module
To verify the effectiveness of the triple-attention module, we added a channel-spectral attention module (ParallelNet-CS), spectral-spatial attention module (ParallelNet-SS), and triple-attention module (TAP-Net) to ParallelNet for comparison. For simplicity, we take the Pavia University as an example, and the results of these attention-based ParallelNet are shown in Table 9. Evidently, the proposed triple-attention module provides the best performance, with OA & AA of the module is 91.64% & 87.45%, which is significantly better than that of other attention modules. Moreover, the smaller standard derivation of OA, AA and Kappa score corresponding to TAP-Net indicates its better stability under the same parameter settings. Similar results shown in supplementary material were observed on the Salinas Valley and Indian Pines dataset. The statistical testing can make claims about whether the distribution of one set of results are different from another set. In this study, we execute a two-tailed Wilcoxon's test over per-class AA to verify if the differences between the investigated modules are statistically important. The statistical difference between SerialNet, ParallelNet, ParallelNet-SS, ParallelNet-CS and TAP-Net is shown in Table 10, demonstrating that the proposed triple attention module has significant improvement (p-value < 0.05) over SerialNet, ParallelNet, ParallelNet-CS and ParallelNet-SS. Similarly, the channel-spectral and spectral-spatial attention module also delivered large improvement in comparing with SerialNet and Paral-lelNet without attention. However, there is no statistical difference between ParallelNet-SS and ParallelNet-CS, although the performance of ParallelNet-CS is slightly better than that of ParallelNet-SS. Data partitioning has a great influence on the classification performance, as the distribution of partitioned training and test sets might be balanced or unbalanced (i.e., some classes are missed in the training or test set). In contrast to VHIS [33], our proposed method is able to provide a more balanced data splits. As shown in Table 11, the proposed TAP-Net achieves better performance with both data split strategies. It demonstrates that the importance of data split strategy and the effectiveness of the proposed TAP-Net.
In [34], the authors introduced three training-and test-time data augmentation techniques. The VHIS with data augmentation consistently got better performance than that without data augmentation across three benchmark datasets, with an improvement of 0.58-14.54% on OA/AA. However, these pixel-wise augmentation methods are not applicable for 3D network. Inspired by [34], we applied a simple data augmentation strategy, including flip and rotate. As shown in Table 11, the performance of TAP-Net with data augmentation are significantly better than that without data augmentation. Taking the Salina Valley dataset for instance, the overall accuracy, average accuracy, and the Kappa is 90.31%, 93.18% and 0.881 respectively with data augmentation, whereas the corresponding value is 85.57%, 89.04% and 0.850 without data augmentation. However, developing more sophiscated data augmentation ways and evaluating the difference between them is outside the scope of this paper, and we will work on this topic in our future research. Table 11. Physicochemical characteristics of some grains of sorghum sampled in the main markets of Maroua town and used for the production of the indigenous beers.

Conclusions
Hyperspectral image classification, which aims to assign a unique label to each pixel of HSI, is a critical step for HSI analysis. Although CNN-based models have exhibited promising performance, most CNN-based HSI classification methods have potential training-test information leakage, leading to overoptimistic results. In this study, we proposed a tripleattention-based parallel network and a novel data partitioning strategy for pixel-wise classification. First, we introduced a parallel network that utilizes parallel subnetworks with the same spatial resolution and repeatedly reuses high-level feature maps of preceding subnetworks to refine the segmentation map. Subsequently, to further improve the performance of the classifier, we proposed the triple-attention module to strengthen useful information and weaken meaningless information. Furthermore, we introduced a novel data partitioning method to serve as a standard for future research in this field. It provides a balanced training/test-set without information leakage, and is suitable for practical HSI annotations. Ablation studies regarding to the attention mechanism, parallel network and data split strategy demonstrate the effectiveness of these modules across three benchmark datasets. The CSSA and Parallel Net module used in TAP-Net can be used as separate modules in other algorithms. Considering the effectiveness of data augmentation, a more sophisticated ways to enhance the HSI data is highly desired, and we will focus it in future study.