Next Article in Journal
Identifying Dike-Pond System Using an Improved Cascade R-CNN Model and High-Resolution Satellite Images
Previous Article in Journal
Estimating the Horizontal and Vertical Distributions of Pigments in Canopies of Ginkgo Plantation Based on UAV-Borne LiDAR, Hyperspectral Data by Coupling PROSAIL Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhanced TabNet: Attentive Interpretable Tabular Learning for Hyperspectral Image Classification

1
Department of Electrical and Computer Engineering, Mississippi State University, Starkville, MS 39762, USA
2
Cotiviti Inc., South Jordan, UT 84095, USA
*
Author to whom correspondence should be addressed.
Remote Sens. 2022, 14(3), 716; https://doi.org/10.3390/rs14030716
Submission received: 5 December 2021 / Revised: 27 January 2022 / Accepted: 29 January 2022 / Published: 3 February 2022

Abstract

:
Tree-based methods and deep neural networks (DNNs) have drawn much attention in the classification of images. Interpretable canonical deep tabular data learning architecture (TabNet) that combines the concept of tree-based techniques and DNNs can be used for hyperspectral image classification. Sequential attention is used in such architecture for choosing appropriate salient features at each decision step, which enables interpretability and efficient learning to increase learning capacity. In this paper, TabNet with spatial attention (TabNets) is proposed to include spatial information, in which a 2D convolution neural network (CNN) is incorporated inside an attentive transformer for spatial soft feature selection. In addition, spatial information is exploited by feature extraction in a pre-processing stage, where an adaptive texture smoothing method is used to construct a structure profile (SP), and the extracted SP is fed into TabNet (sTabNet) to further enhance performance. Moreover, the performance of TabNet-class approaches can be improved by introducing unsupervised pretraining. Overall accuracy for the unsupervised pretrained version of the proposed TabNets, i.e., uTabNets, can be improved from 11.29% to 12.61%, 3.6% to 7.67%, and 5.97% to 8.01% in comparison to other classification techniques, at the cost of increases in computational complexity by factors of 1.96 to 2.52, 2.03 to 3.45, and 2.67 to 5.52, respectively. Experimental results obtained on different hyperspectral datasets demonstrated the superiority of the proposed approaches in comparison with other state-of-the-art techniques including DNNs and decision tree variants.

1. Introduction

Hyperspectral imagery (HSI) consists of abundant spatial and spectral information in a 3D data cube with hundreds of narrow spectral bands. Due to high spectral resolution, it has been applied in many applications, such as pollution monitoring, urban planning, analysis for land use, and land cover [1,2,3,4]. However, an increase in spatial and spectral information poses a challenge in HSI analysis. Thus, analysis of HSI, such as classification, dimensionality reduction [1,5], and feature extraction [6,7], has obtained much attention among the remote sensing community for decades [8]. Moreover, such approaches can be applicable towards vision technology applications in other engineering domains [9,10,11], multispectral remote sensing, and synthetic aperture radar (SAR) imagery [12,13].
In the last decades, spectral-based classification approaches such as support vector machine (SVM) and composite kernel SVM (SVM-CK) have been widely used in remote sensing [14,15,16]. In addition, different spatial-spectral features have been introduced for HSI classification [17,18]. Sparse representation (SR) for HSI classification was successfully applied in [19], inspired by the successful application of sparse representation in face recognition [20]. Consequently, many sparse and collaborative representation-based classifiers have been introduced, such as the joint sparse representation classifier (JSRC) [21], joint version of spatial-aware collaborative-competition preserving graph embedding with Tikhonov regularization (JSaCCPGT) [22], nonlocal weighted JSRC (NLW-JSRC) [23], and correntropy-based robust JSRC (RJSRC) [24]. Furthermore, multiple morphological operations were utilized in [25] for constructing spatial-spectral features of HSI, and a spatial-spectral classifier was proposed in [26] for addressing the issue of mixed pixel characterization. Multiple kernel learning has also been designed in [27] to improve the SVM classifier.
Moreover, tree-based techniques, such as the random forest method, were introduced in [28]. More recently, enhanced performance of the random forest classifier is presented in [29,30] for HSI classification. Similarly, the performance of extreme gradient boosting (XGBoost) was investigated in [31,32] for HSI. Tree-based approaches have the advantages of efficient representation for decision manifolds with approximate hyperplane boundaries, being interpretable by tracking the decision nodes, and being fast to train. A deep neural network (DNN) based on multiscale spectral-spatial fusion was proposed for HSI classification in [33,34]. However, classification performance will decrease with a deeper network because input for such architecture is one-dimensional and it lacks neighborhood information in the spatial dimension. Moreover, a convolutional deep neural network based on stacked convolutional layers or multilayer perceptron (MLPs) fails to find an optimal solution for decision manifolds in the spectral domain due to lack of appropriate inductive bias [35]. In addition, convolutional neural networks (CNNs) have drawn much attention to classification of image [36], and patch-to-patch CNN was presented in [37] to obtain better performance than existing techniques. However, CNNs have the shortcoming of not considering spectral information effectively.
When a DNN is used for large datasets, classification performance can be improved because it enables gradient descent-based end-to-end learning. Tree learning lacks the use of backpropagation in its inputs for guidance from error signals [38], thus limiting its performance for large datasets. TabNet, a new canonical deep neural architecture for tabular data, was proposed in [39,40]. It can combine the valuable benefits of tree-based methods with DNN-based methods to obtain high performance and interpretability. The high performance of DNNs can be made more interpretable by substituting them with tree-based methods. Inspired by this work, we propose to use TabNet for HSI classification in this paper, as spectral signatures of pixels in HSI are organized as a tabular dataset. One of the aims of this paper is to overcome the deficiencies of existing neural networks and decision trees in HSI classification. In this regard, we explore TabNet and modify its original architecture for HSI. The original TabNet takes raw data without any feature processing and is trained with a gradient descent-based method. Moreover, at each decision step, it uses sequential attention. It enables local interpretability that determines the combination and importance of input features, and global interpretability that measures the contribution of each input feature to the trained model. However, sequential attention-based TabNet has some drawbacks as well. Although TabNet can provide good performance to analyze spectral signatures of HSI, it lacks proper use of local contextual information in the spatial domain. For this reason, we have modified the original architecture of TabNet by incorporating spatial information in an attentive transformer called TabNet with spatial attention (TabNets). Specifically, a 2D convolution neural network (CNN) is used in the attentive transformer to spatially process the masks that contribute to soft feature selection of the abstract features. TabNets can overcome the deficiency of CNNs by considering spatial information with a sequential attention.
Recently, different integrated networks, such as stacked auto encoders (SAE) and convolution autoencoders (CAE), were presented in [30,41] for feature extraction. However, such methods lack the powerful capability of feature extraction in the spatial and spectral domains.
In this work, we observed enhanced performance of unsupervised pretraining on TabNet (uTabNet) for HSI classification, and pretraining was extended to TabNets, resulting in uTabNets. The unsupervised pretrained version of TabNets, i.e., uTabNets, can consider sequential attention in addition to spatial processing of masks by using 2D CNN in the attentive transformer.
Moreover, the existing TabNet does not include any preprocessing stage, weakening its ability to learn in a better way. Certainly, including spatial information in a spectral classifier has led to increased classification accuracy. Many deep learning classifiers, such as recurrent neural networks (RNN) [42] and generative adversarial network (GAN) [43], use CNN for deep feature extraction with several convolutional and pooling layers [44,45]. However, most deep learning methods need massive training to accurately learn of parameters. To deal with such issues, various classification frameworks, such as active learning [46] and ensemble learning [47], are introduced. In addition, spatial optimization using structure profile (SP) is introduced in [48] for feature extraction purposes. In this paper, we incorporate SP in the TabNet with structure profile (sTabNet). Similarly, SP is used in extended versions of TabNet, including uTabNet with SP (suTabNet), TabNets with SP (sTabNets), and uTabNets with SP (suTabNets).
The main contribution of this work can be summarized as follows:
  • It introduces TabNet for HSI classification and improves classification performance by applying unsupervised pretraining in uTabNet;
  • It develops TabNets and uTabNets after including spatial information in the attentive transformer;
  • It includes SP in sTabNet as a feature extraction to further improve the classification performance of SP versions of TabNet, i.e., suTabNet, sTabNets, and suTabNets.
The remainder of this article is organized as follows. Section 2 presents related work. Section 3 discusses the proposed TabNet versions for hyperspectral image classification. Section 4 shows experimental results along with a discussion. Section 5 summarizes the article conclusively.

2. Related Work

Features should be picked wisely for meaningful prediction in machine learning. Global feature selection methods are techniques of selecting appropriate features based on the entire training dataset. Forward selection and LASSO regularization are broadly used global feature selection techniques [49]. Forward selection uses an iterative approach in a step-by-step fashion to select appropriate features from each iteration, and Lasso regularization can allocate zero weights for irrelevant features in a linear model. As stated in [50], instance-wise feature selection can be used to select individual features for each input and explainer model to maximize the mutual information between the response variable and the selected features. Moreover, the actor–critic framework can be used to mimic a baseline by optimizing the feature selection [51]. Using the actor–critic framework, reward can be generated by the predicting network for the selecting network. However, TabNet can be used for soft feature selection by controlling the sparsity that can perform feature selection and output mapping, and can provide better representations of features to enhance performance.

2.1. Tree Based Learning

Tree-based methods are well suited for tabular data learning, as they can provide statistical information gains by picking global features [52]. Ensembling can be done to enhance the performance of tree-based models, such that random forests (RF) can use random subsets of data with randomly selected features to grow many trees [28,30]. Furthermore, CatBoost [53], XGBoost [31,32], and LightGBM [54] are recent ensemble decision tree approaches that can provide better performance for classification. Deep learning can be implemented by using the feature selecting capability to provide better performance than tree-based techniques.

2.2. Attentive Interpretable Tabular Learning (TabNet)

TabNet is based on tree-like functionality, as it can be used for the linear combination of features by determining the coefficients for the contribution of features in the decision process. It uses sparse instance-wise feature selection that can be learned in a training dataset, and it constructs a sequential multi-step architecture such that the portion of a decision can be determined at each decision step by using the selected features. Furthermore, features are nonlinearly processed. In an advanced task, such as HSI classification or anomaly detection, intrinsic spectral features need to be considered in detail to avoid the problems of non-identical spectra from the same materials or similar spectra from different materials [55]. Conventional DNNs, such as multi-layer perceptron (MLP) or stacked convolutional layers, lack the proper mechanisms to select soft features. TabNet should be implemented in comparison to conventional DNN-based approaches because TabNet has powerful soft feature selection capability, in addition to controlling the sparsity with sequential attention.

3. Proposed Method

The different variants of enhanced TabNet classifiers proposed in this work are summarized in Table 1.

3.1. TabNet for Hyperspectral Image Classification

Suppose that a hyperspectral dataset with d spectral bands contains M labeled samples for C classes, and each is represented by X = { x 1 , x 2 , , x M } R M × d and the corresponding label vector is Y = { y 1 , y 2 , , y C } R M × C . As shown in Figure 1, spectral features are used as inputs to TabNet. Suppose the training data X is passed to the initial decision step with batch size B . Then, the feature selection process includes the following steps:
(1)
The “split” module separates the output of the initial feature transformer to obtain features a [ i 1 ] in Step 1 when i = 1;
(2)
If we disregard the spatial information in the attentive transformer of TabNets shown in Figure 4 below, it becomes the attentive transformer for TabNet. It uses a trainable function h i , consisting of a fully connected (FC) and batch normalization (BN) layer to generate features with high dimensions;
(3)
In each step, interpretable information is provided by masks for selecting features, and global interpretability can be attained by aggregating the masks from different decision steps. This process can enhance the discriminative ability in the spectral domain by implementing local and global interpretability for HSI feature selection.
The attentive transformer then generates masks M [ i ] R B × d as a soft selection of salient features with the use of processed features a [ i 1 ] from the previous step as:
M [ i ] = e n t m a x ( P [ i 1 ] h i ( a [ i 1 ] ) )
Entmax normalization [56] inherits the desirable sparsity of sparsemax and can provide smoother, and differentiable curvature, whereas sparsemax is piecewise linear denoted as s p a r s e m a x ( P [ i 1 ] h i ( a [ i 1 ] ) ) . Here, P i is the prior scale term that denotes how much a particular feature has been used previously:
P [ i 1 ] = j = 1 i ( γ M [ i ] )
where γ is a relaxation parameter such that a feature is used at one decision step when γ = 1 and features can be used in multiple decisions steps when γ increases. For input attention z = P [ i 1 ] h i ( a [ i 1 ] ) , its sparsemax output can be estimated as:
sparsemax ( z ) = arg min p Δ D | | p z | | 2
where Δ D represents the probability distribution and sparsemax ( z ) provides zero ability to choices with low scores.
However, entmax normalization provides continuous probability distribution. estimating better distributions in comparison to sparsemax normalization, which can be stated as:
entmax ( z ) = arg max   p Δ D p T z + F υ T ( p )
where F υ T ( p ) is a continuous function denoted as F υ T ( p ) = 1 υ ( υ 1 ) n ( p n p n υ ) n p n log p n , υ = 1 , υ 1 ;
(4)
The sparsity regularization term can be used in the form of entropy [57] for controlling the sparsity of selected features.
L s p a r s e = i = 1 N s t e p s b = 1 B j = 1 d M b , j [ i ] N s t e p s B log ( M b , j [ i ] + )
where takes a small value for numerical stability. Sparsity regularization λ s p a r s e is also added to the overall loss as λ s p a r s e × L s p a r s e , which can provide favorable bias for convergence to high accuracy for datasets with redundant features;
(5)
A sequential multi-step decision process with N s t e p s is used in TabNet’s encoding. The processed information from ( i 1 ) t h step is passed to the i t h step to decide which features to use. The outputs are obtained by aggregating the processed feature representation in the overall decision function as shown by feature attributes in Figure 1.
With the masks M [ i ] obtained from the attentive transformer, the following steps are used for feature processing.
(1)
The feature transformer in Figure 2 is used to process the filtered features, which can be used in the decision step output and information for subsequent steps:
[ d [ i ] , a [ i ] ] = X i ( M [ i ] X )
where the [ , ] operator denotes splitting of d [ i ] = R B × N d and a [ i ] = R B × N a , with N d being the width of the prediction layer for the decision and N a being the width of the attention layer for the masks;
(2)
For efficient learning with high capacity, the feature transformer is comprised of layers that are shared across decision steps such that the same features can be input for different decision steps, and decision step-dependent layers in which features in the current decision step depend upon the output from the previous decision step;
(3)
In Figure 2, it can be observed that the feature transformer consists of the concatenation of two shared layers and two decision step-dependent layers, in which each fully connected (FC) layer is followed by batch normalization (BN) and a gated linear unit (GLU) [58]. Normalization with 0.5 is also used for ensuring stabilized learning throughout the network [59];
(4)
All BN operations, except applied at input features, are implemented in ghost BN [60] by selecting only part of samples rather than using an entire batch at one time to reduce the cost of computation. This improves performance by using the virtual or small batch size B v and momentum m B instead of using the entire batch. Moreover, decision tree-like aggregation is implemented by constructing overall decision embedding as:
d o u t = i = 1 N s t e p s LeakyReLU ( d [ i ] )
where N s t e p s represents the number of decision steps;
(5)
The linear mapping W f i n a l d o u t is applied for output mapping and softmax is employed during training for discrete outputs.

3.2. TabNet with Unsupervised Pretraining

To include unsupervised pretraining in TabNet (uTabNet), a decoder architecture is incorporated [39,40]. As shown in Figure 3, the decoder is composed of a feature transformer, and FC layers at each decision step to reconstruct features by combining the outputs. Missing columns of features can be predicted using other feature columns. Suppose S { 0 , 1 } B × d is a binary mask and r is the pretraining ratio of features to randomly discard for reconstruction such that the variable r represents the ratio of masking inside the binary mask S . The term in the encoder is initialized as P [ 0 ] = ( 1 S ) such that the model focuses on the known features, and the last FC layer of the decoder is a result of the product of S and unknown output features. For this purpose, the reconstruction residual ( L r e c ) used in an unsupervised manner without label information is formed as:
L r e c = i = 1 B j d ( X ^ i , j X i , j ) S i , j i = 1 B ( X i , j 1 / B i = 1 B X i , j ) 2 2
where X ^ i , j represents the reconstructed output and X i , j denotes the original input.

3.3. TabNet with Spatial Attention (TabNets)

The generated masks M [ i ] in Equation (1) are used in Equation (2) to update the prior P [ i ] in the attentive transformer for soft feature selection. Spatial information is incorporated by including a 2D CNN inside the attentive transformer, resulting in TabNet with spatial attention (TabNets), as shown in Figure 4. The output feature maps of each layer in TabNets are shown in Table 2.
In CNN, 2D kernels are used for convolving the input data after calculating the sum of the product of the kernel and the input data. To cover the total spatial area, the kernel is strided on the input data. Nonlinearity is introduced with an activation function on the convolved features. The value after activation A k , l u , v at spatial position ( u , v ) for the k-th layer with the l-th feature map can be expressed as:
A k , l u , v = ψ ( e k , l + δ = 1 o m 1 θ = τ τ β = Φ Φ f k , l , δ β , θ × A k 1 , l u + β , v + θ )
where ψ represents the function of activation, with e k , l being the bias parameter. o m 1 denotes the number of feature maps in the ( m 1 ) th layer with the depth of the kernel f k , l at the k-th layer for the l-th feature map. 2 τ + 1 represents the width of the kernel and 2 Φ + 1 denotes the height of the kernel with weight parameter f k , l .
First of all, the 3D patch input T × P × P for the T reduced channels from principal component analysis (PCA) and patch size P × P is converted to a 1D input vector. For instance, in the Indian Pines data, the 3D input of size 10 × 25 × 25 becomes a 6250 × 1 vector. The feature size from each layer in the encoder is shown in the second part of Table 2:
(1)
The first BN generates a 6250 × 1 vector;
(2)
It is converted by the first feature transformer layer before Step 1 into a feature vector of size N d + N a = 512 ;
(3)
The Split layer divides it into two parts and provides a feature of size N a = 256 for the attentive transformer;
(4)
The Attentive transformer layer generates output masks for the 6250 × 1 feature;
(5)
The Mask layer in Step 1 generates the multiplicative output M [ i ] X to the feature transformer layer with the 6250 × 1 feature;
(6)
The feature transformer generates the feature of size N d + N a = 512 , which is separated into two parts: N d = 256 in LeakyReLu and N a = 256 for the attentive transformer in Step 1;
(7)
The output of each decision step is then concatenated in the TabNets encoder and converted to a feature map with 16 classes by the FC layer.
For spatial attention inside an attentive transformer, a feature map of different layers is shown in the first part of Table 2.
(1)
The output of e n t m a x from Equation (1) is reshaped to 10 × 25 × 25 as input to the first 2D convolution layer. For a kernel size of 3 × 3 and stride = 3, the first 2D convolution layer provides a 16 × 8 × 8 output;
(2)
The second convolution layer generates an output of size 32 × 6 × 6 with a kernel size of 3 × 3 and stride = 1;
(3)
The third convolutional layer generates an output shape of 64 × 4 × 4 with a kernel size of 3 × 3 and stride = 1;
(4)
The flatten layer provides an output of size 1024 × 1;
(5)
Finally, the FC layer generates an output of size 6250 × 1 that is provided as input to the prior scales for updating the abstract features generated by the FC and BN layers inside the attentive transformer.
In addition, TabNets with unsupervised pretraining (uTabNets) can be obtained by using steps of unsupervised pretraining and Equation (8) on TabNets.

3.4. Structure Profile on TabNet (sTabNet)

By using spatial feature extraction with structure profile (SP) [48] in the preprocessing stage, the performance of TabNet can be enhanced by using the TabNet with structure profile (sTabNet).
Spatial feature extraction with structure profile:
First of all, the original input image is divided into M subsets. The structure profile S can be extracted from the input image X using an adaptive texture smoothing model as:
arg min s | | S X | | 2 2 w + λ | | S | | T V
where λ is the free parameter and w is the weight that controls the similarity of adjacent pixels. For smoothing purposes, a local polynomial can be implemented as p = l = 1 m c l p l of degree L denoted as L , with m being the number of elements in L . For N pixels in Ω ( x ) , assume Ω ( x ) = { x 1 , x 2 , , x N } is a set of points around x in X . To obtain the structure profile, S can be obtained as S ( x ) : p ( x ) for each x Ω with the optimization function as:
arg min p L { i = 1 N | | p ( x i ) X ( x i ) | | 2 2 w ( x , x i ) + λ | | p ( x i ) | | T V }
where L = { x α : x R 2 , α z + , | α | 1 L } is a polynomial with a degree L , and w decides the contribution of pixels X ( x i ) towards the construction of polynomial p ( x i ) , such that
w ( x i , x ) = exp ( y Y ( x i ) | | X ( x i + y ) X ( x + y ) | | 2 2 G σ ( | | y | | ) h 0 2 )
where Y ( ) is the small region that can be used for comparing patches around x i and x, the scale parameter h 0 is set to 1, and G σ is the Gaussian function with standard deviation σ . Equation (10) can now be expressed as:
arg min p L { i = 1 N | | p ( x i ) X ( x i ) | | 2 2 w ( x , x i ) + λ | | p ( x i ) | | 1 }
Using the Bregman iteration algorithm [61], Equation (13) can be solved as below:
Update p k + 1 ( x i ) :
p k + 1 ( x i ) = arg min p L i = 1 N | | p ( x i ) X ( x i ) | | 2 2   w ( x , x i ) + λ | | d k ( x i ) p ( x i ) b k ( x i ) | | 2 2   w ( x , x i )
Update d k + 1 ( x i ) :
arg min d | d ( x i ) | 1 + λ | | d ( x i ) p k + 1 ( x ) b k ( x ) | | 2 2
The soft thresholding method can be used:
d k + 1 ( x i ) = s o f t ( p k + 1 ( x i ) + b k ( x i ) , 1 / λ )
Update b k + 1 ( x i ) :
b k + 1 ( x i ) = b k ( x i ) + p k + 1 ( x i ) d k + 1 ( x i )
These steps of updating p k + 1 ( x i ) , d k + 1 ( x i ) , and b k + 1 ( x i ) are repeated until convergence is attained.
After obtaining convergence, the aforementioned TabNet classifier is implemented on the extracted SPs to obtain the classification results for sTabNet.

3.5. Structure Profile on Unsupervised Pretrained TabNet (suTabNet)

After applying SP in feature extraction before uTabNet, the performance of the TabNet with unsupervised pretraining with SP feature extraction (suTabNet) can be obtained. Similarly, SP feature extraction can be applied to TabNets and uTabNets to obtain their SP-extracted versions sTabNets and suTabNets, respectively, and on other comparative methods for a fair comparison.

4. Experiments

4.1. Datasets

Three different datasets were used to validate the proposed methods.
The first dataset used for the experiment is the Indian Pines dataset collected by the Airborne Visible and Infrared (AVIRIS) sensor. It consists of 16 different classes with a spatial size of 145 × 145 pixels and spectral bands of 220 (200 after noise removal). The water-absorption bands 104–108, 150–163, and 220 were removed. The spectral wavelength ranges from 0.4 to 2.5 μ m. Ten percent of training samples were taken into consideration from each class for training and the remaining were used for testing. The number of training and testing samples for each class is listed in Table 3.
The second dataset used is the University of Pavia dataset, which was acquired by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor in Italy. It has a spatial size of 610 × 340 pixels. It consists of a total of 103 spectral bands after noisy band removal. It includes spectral bands in the range 0.43 to 0.86   μ m. Nine different classes exist in this dataset and 200 training samples were taken from each class as training samples; the remaining were used as testing samples. Table 4 shows the number of training and testing samples for each class.
The third dataset is the Salinas dataset, which is collected with an AVIRIS sensor in Salinas Valley, California. It comprises a spatial size of 512 × 217 pixels with 224 bands (204 bands after band removal). Water-absorption bands 108–112, 154–167, and 224 were removed. It has a spatial resolution of 3.7 m-pixels with 16 different classes. For training, 200 samples from each class were taken and remaining were used for testing. Table 5 shows the number of training and testing samples in different classes.

4.2. Experimental Setup

For all methods in comparison, such as RF, MLP, LightGBM, CatBoost, XGBoost, and CAE, parameters were estimated according to [28,29,30,31,32,35,41,53,54]. For our proposed methods, the Adam optimizer was used to estimate the optimal parameters. In all three datasets, 10% of training samples were allocated for validation and the remaining 90% of training samples were allocated for learning optimal weights of the network for tuning the hyper parameters of the network. The performance of TabNet, uTabNet, TabNets, uTabNets, and their SP-extracted versions sTabNet, suTabNet, sTabNets, and suTabNets on different parameters was investigated from a predefined set of parameters. N d and N a were selected from the range of { 8 , 16 , 24 , 32 , , 1024 } , γ = { 1 , 1.5 , 2 } , λ s p a r s e = { 0 , 0.0001 , , 0.1 } , B = { 16 , 32 , , 16384 } , m B = { 0.2 , , 1 } , N s t e p s = { 1 , 2 , , 10 } , and B v = { 16 , 32 , , 1024 } was used as a range of different parameters for TabNet, uTabNet, TabNets, uTabNets, and their SP versions. In all three datasets, λ s p a r s e = 0.01 , γ = 1.5 , B = 64 , m B = 0.6 , N s t e p s = 5 , and B v = 128 were selected. The proposed TabNets, and uTabNets can provide enhanced results in a smaller number of epochs, such as 200 epochs for Indian Pines data and 500 epochs for the other two datasets. Each experiment was repeated 10 times and the average value is reported to reduce the ambiguity. The optimal parameters of the proposed methods are listed in Table 6 for all three datasets.
In addition, varying window size in the range of { 19 × 19 , 21 × 21 , 23 × 23 , 25 × 25 , 27 × 27 } was investigated to incorporate more spatial information. However, choosing a too large window size may add redundancy due to interclass variation among neighboring pixels. As shown in Table 7, 25 × 25 was found to be the most suitable for all datasets. For Indian Pines and Salinas data, 10 × 25 × 25 was used, and 7 × 25 × 25 was used for University of Pavia data.

4.3. Result of Classification

Classification accuracies in terms of overall accuracy, average accuracy, Kappa coefficients, and per-class accuracy are enlisted in Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13. It can be observed that TabNet shows better classification accuracy than the other methods of RF [28,29,30], MLP [35], LightGBM [54], CatBoost [53], and XGBoost [31,32]. In addition, TabNet with spatial attention (TabNets) and its unsupervised pretrained version (uTabNets) outperform TabNet and its unsupervised version uTabNet in all three datasets. Additionally, uTabNets outperforms the convolutional autoencoder (CAE) [30,41] in all three datasets. Moreover, sTabNet outperforms TabNet and SP-extracted versions of other methods, such as sRF, sMLP, sLightGBM, sCatBoost, and sXGBoost. Additionally, SP on TabNets (sTabNets) and its unsupervised pretrained version (uTabNets) outperform TabNet, uTabNet, TabNets, and uTabNets, along with all other SP-extracted versions in all three datasets.
In Figure 5, Figure 6 and Figure 7, the classification map of the three datasets is consistent with the results in Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13. In Figure 5, the classification map for Indian Pines is shown, which consists of ground truth for the original image in Figure 5a,b. In addition, in these classification maps, labeled pixels are listed, in which sTabNet outperforms TabNet and the SP versions of other techniques. Furthermore, suTabNet outperforms uTabNet and sTabNet. The proposed TabNets shows less noise in the area of Soybean-notill and Woods, and uTabNets shows less noise in the region of Woods.
Moreover, their SP-extracted versions sTabNets, and suTabNets show less noise in the areas of Soybean-mintill and Woods, respectively. In Figure 6, the classification map for the University of Pavia is shown. It can be observed that the maps from the proposed TabNets and uTabNets are smoother in the regions of Bare soil and Meadows, respectively. Similarly, their SP- extracted versions sTabNets and suTabNets produce smoother areas of Bare soil and Meadows, respectively. In Figure 7, the classification map for different methods on the Salinas dataset are shown. It is illustrated that the maps from the proposed TabNets and uTabNets are less noisy in the regions of Corn-seneseed-green-weeds and Grapes-untrained. In addition, the maps from their SP-extracted versions sTabNets and suTabNets contain less noise in the areas of Grapes-untrained and Vinyard-untrained.
Figure 8 shows the classification performance of different methods for varying numbers of training samples in all datasets. For Indian pines, each class’ training sample is varied as { 10 % ,   20 % ,   30 % ,   40 % ,   and   5 0 % } . The training samples per class are varied as { 100 ,   20 0 ,   30 0 ,   40 0 ,   and   5 00 } in both the University of Pavia and Salinas datasets. It can be observed that the proposed TabNets, uTabNets, sTabNets, and suTabNets outperform all other methods, such as RF, MLP, LightGBM, CatBoost, XGBoost, TabNet, uTabNet, CAE, and their SP versions for all numbers of training samples in all three datasets.
To evaluate statistical significance in OA performance improvement, the McNemar’s test [62] is shown in Table 14 among different pairs of methods. Two methods are statistically different if z , the value of McNemar’s test denoted as ( | z | > 0 ) , is larger than 1.96 or 2.58, which represents statistical difference at 95% or 99% confidence levels, respectively. The comparison among TabNet, uTabNet, TabNets, uTabNets, sTabNet, suTabNet, sTabNets, suTabNets, and other classifiers is illustrated, which indicates their superiority over their counterparts.
To estimate the computational complexity involved in the proposed algorithms, execution time for different algorithms on three hyperspectral datasets is illustrated in Table 15. All the experiments were run using a NVIDIA Tesla K80 GPU and MATLAB on an Intel(R) Core (TM) i7-4770 central processing unit with 16 GB of memory.
It can be observed that TabNet has higher computational complexity in comparison to other tree-based methods, which may be due to the sequential attention involved in tabular learning. In addition, the unsupervised pretraining version of TabNet (uTabNet) has higher complexity than TabNet because of the pretraining operation.
Additionally, the proposed TabNets and its unsupervised pretraining version uTabNets show slightly higher complexity than TabNet and uTabNet because of the convolution layer in the attentive transformer for spatial processing of masks. Moreover, the SP-extracted versions TabNets, uTabNets, sTabNets, and suTabNets are slightly costlier than their counterparts due to SP extraction.

5. Conclusions

In this work, we propose a TabNets network that uses spatial attention to enhance the performance of the original TabNet for HSI classification by including a 2D CNN in the attentive transformer. Moreover, unsupervised pretraining on TabNets (uTabNets) was introduced, which can outperform TabNets. SP-extracted versions of TabNet, uTabNet, TabNets, uTabNets were also developed to further utilize spatial information. The experimental results obtained on different hyperspectral datasets illustrate the superiority of the proposed TabNets and uTabNets and their SP versions in terms of classification accuracy over other techniques, such as RF, MLP, LightGBM, CatBoost, XGBoost, and their SP versions. However, the proposed networks show slightly higher complexity for network optimization. In future work, more spatial and spectral information will be incorporated into TabNet to enhance the classification performance with reduced computational cost. Moreover, the performance of the enhanced TabNet on hyperspectral anomaly detection will be investigated. This has potential applications for solving similar classification and feature extraction problems for high-resolution thermal or remote sensing images.

Author Contributions

Conceptualization, C.S., Q.D. and Y.X.; methodology, C.S. and Q.D.; writing—original draft, C.S.; writing—review and editing, C.S. and Q.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

The authors would like to thank the authors of all references used in the paper, the editors, and the anonymous reviewers for their detailed comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Shah, C.; Du, Q. Collaborative and Low-Rank Graph for Discriminant Analysis of Hyperspectral Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5248–5259. [Google Scholar] [CrossRef]
  2. Shah, C.; Du, Q. Spatial-Aware Probabilistic Collaborative Representation for Hyperspectral Image Classification. In Proceedings of the Image and Signal Processing for Remote Sensing XXVI (Proc. Of SPIE), Edinburgh, UK, 21–25 September 2020. art no 115330Q. [Google Scholar] [CrossRef]
  3. Li, W.; Du, Q. Joint Within-Class Collaborative Representation for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2200–2208. [Google Scholar] [CrossRef]
  4. Shah, C.; Du, Q. Modified Structure-Aware Collaborative Representation for Hyperspectral Image Classification. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021. [Google Scholar] [CrossRef]
  5. Pan, L.; Li, H.-C.; Deng, Y.-J.; Zhang, F.; Chen, X.-D.; Du, Q. Hyperspectral Dimensionality Reduction by Tensor Sparse and Low-Rank Graph-Based Discriminant Analysis. Remote Sens. 2017, 9, 452. [Google Scholar] [CrossRef] [Green Version]
  6. Li, W.; Wang, Z.; Li, L.; Du, Q. Feature extraction for hyperspectral images using local contain profile. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 5035–5046. [Google Scholar] [CrossRef]
  7. Hong, D.; Wu, X.; Ghamisi, P.; Chanussot, J.; Yokoya, N.; Zhu, X.X. Invariant attribute profiles: A spatial-frequency joint feature extractor for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 3791–3808. [Google Scholar] [CrossRef] [Green Version]
  8. Chang, C.-I. Hyperspectral Data Exploitation: Theory and Applications; Wiley: Hoboken, NJ, USA, 2007. [Google Scholar] [CrossRef]
  9. Chen, M.; Tang, Y.; Zou, X.; Huang, Z.; Zhou, H.; Chen, S. 3D global mapping of large-scale unstructured orchard integrating eye-in-hand stereo vision and SLAM. Comput. Electron. Agric. 2021, 187, 106237. [Google Scholar] [CrossRef]
  10. Wu, F.; Duan, J.; Chen, S.; Ye, Y.; Ai, P.; Yang, Z. Multi-Target Recognition of Bananas and Automatic Positioning for the Inflorescence Axis Cutting Point. Front. Plant Sci. 2021, 12, 705021. [Google Scholar] [CrossRef]
  11. Cao, X.; Yan, H.; Huang, Z. A Multi-Objective Particle Swarm Optimization for Trajectory Planning of Fruit Picking Manipulator. Agronomy 2021, 11, 2286. [Google Scholar] [CrossRef]
  12. Du, P.; Samat, A.; Waske, B.; Liu, S.; Li, Z. Random Forest and rotation forest for fully polarized SAR image classification using polarimetric and spatial features. ISPRS J. Photogramm. Remote Sens. 2015, 105, 38–53. [Google Scholar] [CrossRef]
  13. Samat, A.; Persello, C.; Liu, S.; Li, E.; Miao, Z.; Abuduwaili, J. Classification of VHR multispectral images using extratrees and maximally stable extremal region-guided morphological profile. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 3179–3195. [Google Scholar] [CrossRef]
  14. Melgani, F.; Bruzzone, L. Classification of Hyperspectral Remote Sensing Images with Support Vector Machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [Google Scholar] [CrossRef] [Green Version]
  15. Camps-Valls, G.; Gomez-Chova, L.; Munoz-Mari, J.; Vila-Frances, J.; Calpe-Maravilla, J. Composite Kernels for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2006, 3, 93–97. [Google Scholar] [CrossRef]
  16. Li, J.; Bioucas-Dias, J.M.; Plaza, A. Spectral–Spatial Hyperspectral Image Segmentation Using Subspace Multinomial Logistic Regression and Markov Random Fields. IEEE Trans. Geosci. Remote Sens. 2012, 50, 809–823. [Google Scholar] [CrossRef]
  17. Hughes, G. On the Mean Accuracy of Statistical Pattern Recognizers. IEEE Trans. Inf. Theory 1968, 14, 55–63. [Google Scholar] [CrossRef] [Green Version]
  18. Fauvel, M.; Tarabalka, Y.; Benediktsson, J.A.; Chanussot, J.; Tilton, J.C. Advances in Spectral-Spatial Classification of Hyperspectral Images. Proc. IEEE 2013, 101, 652–675. [Google Scholar] [CrossRef] [Green Version]
  19. Cui, M.; Prasad, S. Class-Dependent Sparse Representation Classifier for Robust Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 2683–2695. [Google Scholar] [CrossRef]
  20. Wright, J.; Yang, A.Y.; Ganesh, A.; Sastry, S.S.; Yi, M. Robust Face Recognition via Sparse Representation. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 210–227. [Google Scholar] [CrossRef] [Green Version]
  21. Chen, Y.; Nasrabadi, N.M.; Tran, T.D. Hyperspectral Image Classification Using Dictionary-Based Sparse Representation. IEEE Trans. Geosci. Remote Sens. 2011, 49, 3973–3985. [Google Scholar] [CrossRef]
  22. Shah, C.; Du, Q. Spatial-Aware Collaboration-Competition Preserving Graph Embedding for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
  23. Zhang, H.; Li, J.; Huang, Y.; Zhang, L. A Nonlocal Weighted Joint Sparse Representation Classification Method for Hyperspectral Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2056–2065. [Google Scholar] [CrossRef]
  24. Peng, J.; Du, Q. Robust Joint Sparse Representation Based on Maximum CORRENTROPY Criterion for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 7152–7164. [Google Scholar] [CrossRef]
  25. Benediktsson, J.A.; Palmason, J.A.; Sveinsson, J.R. Classification of Hyperspectral Data from Urban Areas Based on Extended Morphological Profiles. IEEE Trans. Geosci. Remote Sens. 2005, 43, 480–491. [Google Scholar] [CrossRef]
  26. Khodadadzadeh, M.; Li, J.; Plaza, A.; Ghassemian, H.; Bioucas-Dias, J.M.; Li, X. Spectral–Spatial Classification of Hyperspectral Data Using Local and Global Probabilities for Mixed Pixel Characterization. IEEE Trans. Geosci. Remote Sens. 2014, 52, 6298–6314. [Google Scholar] [CrossRef]
  27. Fang, L.; Li, S.; Duan, W.; Ren, J.; Benediktsson, J.A. Classification of Hyperspectral Images by Exploiting Spectral–Spatial Information of Superpixel via Multiple Kernels. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6663–6674. [Google Scholar] [CrossRef] [Green Version]
  28. Ho, T.K. The Random Subspace Method for Constructing Decision Forests. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844. [Google Scholar]
  29. Xia, J.; Ghamisi, P.; Yokoya, N.; Iwasaki, A. Random Forest Ensembles and Extended Multiextinction Profiles for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 202–216. [Google Scholar] [CrossRef] [Green Version]
  30. Rasti, B.; Hong, D.; Hang, R.; Ghamisi, P.; Kang, X.; Chanussot, J.; Benediktsson, J.A. Feature Extraction for Hyperspectral Imagery: The Evolution from Shallow to Deep: Overview and Toolbox. IEEE Geosci. Remote Sens. Mag. 2020, 8, 60–88. [Google Scholar] [CrossRef]
  31. Chen, T.; Guestrin, C. XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
  32. Samat, A.; Li, E.; Wang, W.; Liu, S.; Lin, C.; Abuduwaili, J. Meta-XGBoost for Hyperspectral Image Classification Using Extended MSER-Guided Morphological Profiles. Remote Sens. 2020, 12, 1973. [Google Scholar] [CrossRef]
  33. Li, Z.; Huang, L.; Zhang, D.; Liu, C.; Wang, Y.; Shi, X. A Deep Network Based on Multiscale Spectral-Spatial Fusion for Hyperspectral Classification. Proc. Int. Knowl. Sci. Eng. Manag. 2018, 11062, 283–290. [Google Scholar]
  34. Li, Z.; Huang, L.; He, J. A Multiscale Deep Middle-Level Feature Fusion Network for Hyperspectral Classification. Remote Sens. 2019, 11, 695. [Google Scholar] [CrossRef] [Green Version]
  35. Heaton, J. Ian Goodfellow, Yoshua Bengio, and Aaron Courville: Deep Learning. Genet. Program. Evolvable Mach. 2017, 19, 305–307. [Google Scholar] [CrossRef] [Green Version]
  36. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  37. Zhang, M.; Li, W.; Du, Q.; Gao, L.; Zhang, B. Feature extraction for classification of Hyperspectral and LIDAR data using patch-to-patch CNN. IEEE Trans. Cybern. 2020, 50, 100–111. [Google Scholar] [CrossRef]
  38. Hestness, J.; Narang, S.; Ardalani, N.; Diamos, G.; Jun, H.; Kianinejad, H.; Patwary, M.M.A.; Yang, Y.; Zhou, Y. Deep Learning Scaling Is Predictable, Empirically. Available online: https://arxiv.org/abs/1712.00409 (accessed on 29 October 2021).
  39. Arik, S.O.; Pfister, T. TabNet: Attentive Interpretable Tabular Learning. arXiv 2020, arXiv:1908.07442. Available online: https://arxiv.org/abs/1908.07442v4 (accessed on 6 November 2021).
  40. Arik, S.O.; Pfister, T. TabNet: Attentive Interpretable Tabular Learning. AAAI 2021, 35, 6679–6687. Available online: https://ojs.aaai.org/index.php/AAAI/article/view/16826 (accessed on 29 October 2021).
  41. Kemker, R.; Kanan, C. Self-Taught Feature Learning for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2693–2705. [Google Scholar] [CrossRef]
  42. Hang, R.; Liu, Q.; Hong, D.; Ghamisi, P. Cascaded Recurrent Neural Networks for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5384–5394. [Google Scholar] [CrossRef] [Green Version]
  43. Zhu, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Generative Adversarial Networks for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 5046–5063. [Google Scholar] [CrossRef]
  44. Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [Google Scholar] [CrossRef] [Green Version]
  45. Cheng, G.; Li, Z.; Han, J.; Yao, X.; Guo, L. Exploring Hierarchical Convolutional Features for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 6712–6722. [Google Scholar] [CrossRef]
  46. Haut, J.M.; Paoletti, M.E.; Plaza, J.; Li, J.; Plaza, A. Active Learning with Convolutional Neural Networks for Hyperspectral Image Classification Using a New Bayesian Approach. IEEE Trans. Geosci. Remote Sens. 2018, 56, 6440–6461. [Google Scholar] [CrossRef]
  47. Chen, Y.; Wang, Y.; Gu, Y.; He, X.; Ghamisi, P.; Jia, X. Deep Learning Ensemble for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 1882–1897. [Google Scholar] [CrossRef]
  48. Duan, P.; Ghamisi, P.; Kang, X.; Rasti, B.; Li, S.; Gloaguen, R. Fusion of Dual Spatial Information for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7726–7738. [Google Scholar] [CrossRef]
  49. Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
  50. Chen, J.; Song, L.; Wainwright, M.J.; Jordan, M.I. Learning to Explain: An Information-Theoretic Perspective on Model Interpretation. International Conference to Machine Learning (ICML) 2018. Available online: https://arxiv.org/abs/1802.07814 (accessed on 2 November 2021).
  51. Yoon, J.; Jordon, J.; Schaar, M. Invase: Instance-wise variable selection using neural networks: Semantic scholar. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019; Available online: https://openreview.net/forum?id=BJg_roAcK7 (accessed on 2 November 2021).
  52. Grabczewski, K.; Jankowski, N. Feature Selection with Decision Tree Criterion. In Proceedings of the Fifth International Conference on Hybrid Intelligent Systems, Rio de Janeiro, Brazil, 6–9 November 2005. [Google Scholar]
  53. Catboost. Catboost/Benchmarks: Comparison Tools. Available online: https://github.com/catboost/benchmarks (accessed on 4 November 2021).
  54. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  55. Wang, X.; Tan, K.; Du, Q.; Chen, Y.; Du, P. Caps-Triplegan: Gan-Assisted CapsNet for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7232–7245. [Google Scholar] [CrossRef]
  56. Peters, B.; Niculae, V.; Martins, A.F. Sparse Sequence-to-Sequence Models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
  57. Yves, G.; Yoshua, B. Entropy Regularization. Semi-Supervised Learn. 2006, 151–168. [Google Scholar] [CrossRef] [Green Version]
  58. Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. 2016. Available online: https://arxiv.org/abs/1612.08083 (accessed on 28 October 2021).
  59. Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional Sequence to Sequence Learning. 2017. Available online: https://arxiv.org/abs/1705.03122v1 (accessed on 1 November 2021).
  60. Hoffer, E.; Hubara, I.; Soudry, D. Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks. 2017. Available online: http://arxiv-export-lb.library.cornell.edu/abs/1705.08741?context=cs (accessed on 27 October 2021).
  61. Goldstein, T.; Osher, S. The Split Bregman Method for L1-Regularized Problems. SIAM J. Imaging Sci. 2009, 2, 323–343. [Google Scholar] [CrossRef]
  62. Foody, G.M. Thematic Map Comparison. Photogramm. Eng. Remote Sens. 2004, 70, 627–633. [Google Scholar] [CrossRef]
Figure 1. Encoder for TabNets.
Figure 1. Encoder for TabNets.
Remotesensing 14 00716 g001
Figure 2. Feature transformer for TabNets.
Figure 2. Feature transformer for TabNets.
Remotesensing 14 00716 g002
Figure 3. TabNets decoder.
Figure 3. TabNets decoder.
Remotesensing 14 00716 g003
Figure 4. Attentive transformer for TabNets.
Figure 4. Attentive transformer for TabNets.
Remotesensing 14 00716 g004
Figure 5. Classification maps for Indian pines data obtained using different methods including (a) ground truth image, (b) RF (77.52%), (c) MLP (77.48%), (d) LightGBM (76.54%), (e) CatBoost (75.32%), (f) XGBoost (73.79%), (g) TabNet (82.32%), (h) uTabNet (84.44%), (i) CAE (85.07%), (j) TabNets (94.93%), (k) uTabNets (96.36%), (l) sRF (88.98%), (m) sMLP (84.17%), (n) sLightGBM (89.68%), (o) sCatBoost (89.93%), (p) sXGBoost (80.93%), (q) sTabNet (94.41%), (r) suTabNet (95.85%), (s) sCAE (95.95%), (t) sTabNets (96.40%), and (u) suTabNets (97.51%).
Figure 5. Classification maps for Indian pines data obtained using different methods including (a) ground truth image, (b) RF (77.52%), (c) MLP (77.48%), (d) LightGBM (76.54%), (e) CatBoost (75.32%), (f) XGBoost (73.79%), (g) TabNet (82.32%), (h) uTabNet (84.44%), (i) CAE (85.07%), (j) TabNets (94.93%), (k) uTabNets (96.36%), (l) sRF (88.98%), (m) sMLP (84.17%), (n) sLightGBM (89.68%), (o) sCatBoost (89.93%), (p) sXGBoost (80.93%), (q) sTabNet (94.41%), (r) suTabNet (95.85%), (s) sCAE (95.95%), (t) sTabNets (96.40%), and (u) suTabNets (97.51%).
Remotesensing 14 00716 g005
Figure 6. Classification maps for University of Pavia data obtained for different methods including (a) ground truth image, (b) RF (78.17%), (c) MLP (80.12%), (d) LightGBM (85.21%), (e) CatBoost (82.19%), (f) XGBoost (81.06%), (g) TabNet (90.19%), (h) uTabNet (92.58%), (i) CAE (94.26%), (j) TabNets (96.58%), (k) uTabNets (97.86%), (l) sRF (89.54%), (m) sMLP (92.28%), (n) sLightGBM (92.94%), (o) sCatBoost (94.26%), (p) sXGBoost (94.26%), (q) sTabNet (97.62%), (r) suTabNet (98.95%), (s) sCAE (98.58%), (t) sTabNets (98.38%), and (u) suTabNets (99.29%).
Figure 6. Classification maps for University of Pavia data obtained for different methods including (a) ground truth image, (b) RF (78.17%), (c) MLP (80.12%), (d) LightGBM (85.21%), (e) CatBoost (82.19%), (f) XGBoost (81.06%), (g) TabNet (90.19%), (h) uTabNet (92.58%), (i) CAE (94.26%), (j) TabNets (96.58%), (k) uTabNets (97.86%), (l) sRF (89.54%), (m) sMLP (92.28%), (n) sLightGBM (92.94%), (o) sCatBoost (94.26%), (p) sXGBoost (94.26%), (q) sTabNet (97.62%), (r) suTabNet (98.95%), (s) sCAE (98.58%), (t) sTabNets (98.38%), and (u) suTabNets (99.29%).
Remotesensing 14 00716 g006
Figure 7. Classification maps for Salinas data obtained for different methods including (a) ground truth image, (b) RF (79.76%), (c) MLP (83.73%), (d) LightGBM (87.87%), (e) CatBoost (84.12%), (f) XGBoost (86.99%), (g) TabNet (90.45%), (h) uTabNet (91.31%), (i) CAE (92.39%), (j) TabNets (97.32%), (k) uTabNets (98.36%), (l) sRF (89.10%), (m) sMLP (93.57%), (n) sLightGBM (95.05%), (o) sCatBoost (94.70%), (p) sXGBoost (93.93%), (q) sTabNet (96.20%), (r) suTabNet (98.85%), (s) sCAE (98.95%), (t) sTabNets (98.34%), and (u) suTabNets (99.33%).
Figure 7. Classification maps for Salinas data obtained for different methods including (a) ground truth image, (b) RF (79.76%), (c) MLP (83.73%), (d) LightGBM (87.87%), (e) CatBoost (84.12%), (f) XGBoost (86.99%), (g) TabNet (90.45%), (h) uTabNet (91.31%), (i) CAE (92.39%), (j) TabNets (97.32%), (k) uTabNets (98.36%), (l) sRF (89.10%), (m) sMLP (93.57%), (n) sLightGBM (95.05%), (o) sCatBoost (94.70%), (p) sXGBoost (93.93%), (q) sTabNet (96.20%), (r) suTabNet (98.85%), (s) sCAE (98.95%), (t) sTabNets (98.34%), and (u) suTabNets (99.33%).
Remotesensing 14 00716 g007
Figure 8. Overall classification accuracy (with standard deviations) of considered methods and SP-extracted versions with different numbers of training samples per class: (a,b) Indian Pines, (c,d) University of Pavia, (e,f) Salinas dataset.
Figure 8. Overall classification accuracy (with standard deviations) of considered methods and SP-extracted versions with different numbers of training samples per class: (a,b) Indian Pines, (c,d) University of Pavia, (e,f) Salinas dataset.
Remotesensing 14 00716 g008
Table 1. Acronyms and their meaning for variants of proposed TabNet classifiers.
Table 1. Acronyms and their meaning for variants of proposed TabNet classifiers.
NotationMeaning
TabNetAttentive interpretable tabular learning
uTabNetUnsupervised pretraining on attentive interpretable tabular learning
TabNetsAttentive interpretable tabular learning with spatial attention
uTabNetsUnsupervised pretraining on attentive interpretable tabular learning with spatial attention
sTabNetStructure profile on attentive interpretable tabular learning
suTabNetStructure profile on unsupervised pretrained attentive interpretable tabular learning
sTabNetsStructure profile on attentive interpretable tabular learning with spatial attention
suTabNetsStructure profile on unsupervised pretrained attentive interpretable tabular learning with spatial attention
Table 2. Layer-wise summary of spatial attention in proposed Tabnets for window size 25 × 25. The last layer of the TabNets encoder is considered based upon Indian Pines data.
Table 2. Layer-wise summary of spatial attention in proposed Tabnets for window size 25 × 25. The last layer of the TabNets encoder is considered based upon Indian Pines data.
(Attentive Transformer)
LayerShape of OutputFeature Map
FC62506250
BN62506250
Prior Scales62506250
Entmax(10,25,25)10
2D Convolution(16,8,8)16
2D Convolution(32,6,6)32
2D Convolution(64,4,4)64
flatten10241024
FC62506250
(TabNets Encoder)
LayerFeature Map
BN6250
Feature Transformer512
Split256
Attentive Transformer6250
Mask6250
Feature Transformer512
Split256
LeakyReLU256
FC16
Table 3. Training and testing samples with class labels in the Indian Pines dataset.
Table 3. Training and testing samples with class labels in the Indian Pines dataset.
NoColorClassesTrainingTesting
1 Alfalfa541
2 Corn-notill1431285
3 Corn-mintill83747
4 Corn24213
5 Grass-pasture48435
6 Grass-trees73657
7 Grass-pasture-mowed325
8 Hay-winfrowed48430
9 Oats218
10 Soybean-notill97875
11 Soybean-mintill2462209
12 Soybean-clean59534
13 Wheat21184
14 Woods1271138
15 Buildings-grass-trees-drives39347
16 Stone-steel-towers984
Total10279222
Table 4. Training and testing samples with class labels in the Pavia dataset.
Table 4. Training and testing samples with class labels in the Pavia dataset.
NoColorClassesTrainingTesting
1 Asphalt2006431
2 Meadows20018,449
3 Gravel2001899
4 Tree2002864
5 Painted metal sheets2001145
6 Bare soil2004829
7 Bitumen2001130
8 Self-blocking bricks2003482
9 Shadows200747
Total180040,976
Table 5. Training and testing samples with class labels in the Salinas dataset.
Table 5. Training and testing samples with class labels in the Salinas dataset.
NoColorClassesTrainingTesting
1 Broccoli-green-weeds-12001809
2 Broccoli-green-weeds-22003526
3 Fallow2001776
4 Fallow-rough-plow2001194
5 Fallow-smooth2002478
6 Stubble2003759
7 Celery2003379
8 Grapes-untrained20011,071
9 Soil- vineyard-develop2006003
10 Corn-seneseed-green-weeds2003078
11 Lettuce-romaine-4wk200868
12 Lettuce-romaine-5wk2001727
13 Lettuce-romaine-6wk200716
14 Lettuce-romaine-7wk200870
15 Vinyard-untrained2007068
16 Vinyard-vertical-trellis2001607
Total320050,929
Table 6. Parameter tuning in different algorithms.
Table 6. Parameter tuning in different algorithms.
MethodsIndianPaviaSalinas
N d = N a λ r N d = N a   λ r N d = N a   λ r
TabNet256--64--512--
uTabNet32-0.764-0.664-0.8
TabNets256--256--256--
uTabNets256-0.6256-0.7256-0.7
sTabNet321-640.5-641-
suTabNet3210.7640.50.66410.8
sTabNets2561-2560.5-2561-
suTabNets25610.72560.50.625610.8
Table 7. Varying window size in TabNets (OA in percentage).
Table 7. Varying window size in TabNets (OA in percentage).
WindowIndianPaviaSalinas
19 × 1992.7595.4496.87
21 × 2193.2396.0496.97
23 × 2394.5096.1497.07
25 × 2594.9396.5897.32
27 × 2794.5396.2997.15
Table 8. Classification accuracies on the Indian Pines dataset (10 percent training samples per class).
Table 8. Classification accuracies on the Indian Pines dataset (10 percent training samples per class).
RFMLPLightGBMCatBoostXGBoostTabNetuTabNetCAETabNetsuTabNets
133.3310058.6264.2888.8880.1983.7277.7796.42100
262.8377.0273.2867.4265.5773.9779.3676.7496.1094.53
376.1071.4271.6275.0674.1679.5683.1875.0688.2896.97
430.3154.5486.0462.3253.2761.4862.9286.5095.7299.42
594.8485.7196.8185.2583.9592.5391.8294.6391.1597.45
676.3690.0096.0683.2284.6090.9497.3193.7596.9599.84
750.0079.8850.0063.5533.3378.8896.0089.4790.07100
885.6892.3090.8087.8285.7792.4397.8691.6899.40100
941.0760.1210046.4510.5266.8787.5010094.11100
1086.8371.4288.9673.2870.4082.5680.7479.9595.3598.72
1174.5872.5961.3869.7168.1285.2283.1783.2696.3693.91
1272.5680.7669.6960.1858.8470.0179.1478.5194.6094.22
1350.0083.3393.5185.1586.2484.4190.7296.3399.7198.93
1492.1989.5593.0988.1689.0590.4591.8794.4697.7696.42
1574.9471.4290.6871.4769.1570.8064.5591.2194.5099.70
1699.2760.0075.2597.2992.9597.3497.1810092.5094.28
OA77.2078.8476.5475.3273.7982.3284.1485.0794.9396.36
AA68.8477.5080.9973.7869.6781.1085.4488.0894.9497.77
Kappa0.730.750.720.710.69700.790.820.830.940.96
Table 9. SP Classification accuracies on Indian Pines dataset (10 percent training samples per class).
Table 9. SP Classification accuracies on Indian Pines dataset (10 percent training samples per class).
sRFsMLPsLightGBMsCatBoostsXGBoostsTabNetsuTabNet sCAEsTabNetssuTabNets
190.6260.0074.2885.2929.1689.21100100100100
288.3271.8588.3287.4872.2194.2995.2593.6190.0997.85
386.8678.1382.1585.0971.6992.9293.0495.5496.3994.14
475.1386.0282.7080.1476.2689.4693.8782.7499.47100
597.9580.7293.1197.9889.8797.5396.5495.3399.0098.33
695.1793.4495.3093.6596.5898.7299.8598.4893.1995.67
795.0050.0092.8510010096.00100100100100
894.2493.2794.0994.0389.9699.1998.8510010099.76
945.7140.0066.6610075.0073.5181.8210010092.85
1087.1875.6788.8688.0882.9393.1992.8999.3698.0698.93
1187.5787.3988.8989.3676.7595.2696.5195.8696.5796.34
1273.7067.9177.0478.5562.1187.7587.3189.3196.1997.60
1392.3060.0095.1395.7892.2897.6097.8610098.93100
1495.7697.7196.8996.6194.4299.5599.6598.7797.7399.64
1587.7873.6394.0989.8084.8297.3897.4193.4810099.68
1698.6197.6798.5797.2982.5497.4597.4796.2097.4090.24
OA88.9884.1789.6889.9380.9394.4195.8595.9596.4097.51
AA86.9975.8388.0691.2079.7793.6995.5296.1797.6997.56
Kappa0.870.810.880.880.780.940.950.950.950.97
Table 10. Classification accuracies on University of Pavia dataset (200 training samples per class).
Table 10. Classification accuracies on University of Pavia dataset (200 training samples per class).
RFMLPLightGBMCatBoostXGBoostTabNetuTabNetCAETabNetsuTabNets
196.6596.8895.9495.8895.9896.5898.5398.9892.7196.27
292.8092.6294.8493.5393.3596.9598.1498.4598.8598.88
359.0556.7869.3863.5662.2372.7679.5089.6792.1692.67
470.1067.1776.1670.6270.7879.5988.4897.4499.3798.75
596.2798.3194.8397.0093.2597.1899.2199.4798.99100
648.4855.4766.7660.4156.7481.6787.8381.1797.4497.61
757.1559.8257.3655.9155.3270.2167.3482.3097.0191.69
881.1485.0783.1081.7281.3986.0681.3188.5790.3991.26
999.8699.7399.6099.3399.6098.7410098.0396.0399.41
OA78.1780.1285.2182.1981.0690.1992.5894.2696.5897.86
AA77.9579.0982.0079.7778.7486.6488.9392.6895.8896.28
Kappa0.710.730.800.760.750.860.890.920.950.96
Table 11. SP Classification accuracies on University of Pavia dataset (200 training samples per class).
Table 11. SP Classification accuracies on University of Pavia dataset (200 training samples per class).
sRFsMLPsLightGBMsCatBoostsXGBoostsTabNetsuTabNetsCAEsTabNetssuTabNets
197.6397.1897.2097.8198.3698.1999.5598.8097.0199.44
298.5298.5798.7198.6498.7299.5999.7099.7199.9399.93
382.5286.3186.3184.7585.3996.0797.6199.7895.2399.71
479.7280.7176.7582.6284.2292.1897.1994.3198.4898.00
599.6099.9594.4498.8297.7399.6599.6999.8299.9297.98
671.4387.8291.9694.2290.9198.1799.3899.2799.6898.74
768.5358.8770.6270.4574.7193.3297.6297.8393.1398.29
886.5988.3487.5889.9290.0192.1896.2297.9794.3598.00
999.6010094.4999.5910099.2010096.1298.8799.59
OA89.5492.2892.9494.2094.2697.6298.9598.5898.3899.29
AA87.1288.6388.6890.7691.1196.5098.5598.1997.4098.85
Kappa0.860.890.900.920.920.960.980.980.970.99
Table 12. Classification Accuracies on Salinas dataset (200 training samples per class).
Table 12. Classification Accuracies on Salinas dataset (200 training samples per class).
RFMLPLightGBMCatBoostXGBoostTabNetuTabNetCAETabNetsuTabNets
199.5899.5799.5297.6299.8898.4699.8399.7796.8699.08
299.5398.5699.2997.9998.7099.6499.6810099.7499.44
347.7595.1391.7683.3590.4697.9898.0096.9795.66100
496.1697.5395.8191.3895.7397.2798.6399.5897.1399.12
595.2384.9399.0497.3099.1698.4599.0699.2792.0199.44
698.4899.9799.7899.9499.7199.7899.9799.9799.7199.79
798.6896.0299.1399.0399.2399.0799.8699.4999.7499.57
875.8174.3479.6176.3677.6384.2282.9587.4194.4298.01
997.0398.2098.4198.2198.1599.4699.4199.9899.4599.95
1093.1282.4885.6082.0384.2890.2897.1197.7098.51100
1179.4770.5388.7470.1587.7194.8797.5687.0491.6499.84
1247.4589.3995.9091.3495.8797.1898.0498.7498.1899.12
1347.4791.6092.3582.0093.6297.1899.1610093.1297.88
1479.8994.4886.0474.7781.9090.2095.3298.4097.2896.91
1553.3354.9063.4759.2561.8769.2170.0171.8696.6294.98
1687.4695.1995.6877.2794.1097.1799.0699.6298.8699.82
OA79.7683.7387.8784.1286.9990.3591.3192.3997.3298.36
AA89.0288.9391.8986.1291.1394.4095.8595.9996.8198.93
Kappa0.770.810.860.810.850.880.900.910.970.98
Table 13. SP Classification accuracies on Salinas dataset (200 training samples per class).
Table 13. SP Classification accuracies on Salinas dataset (200 training samples per class).
sRFsMLPsLightGBMsCatBoostsXGBoostsTabNetsuTabNetsCAEsTabNetssuTabNets
110010010099.2869.4399.9799.9899.6299.65100
299.9499.9799.9799.7299.9499.9099.9210099.9799.94
393.7197.9294.3894.1596.0298.4299.7599.9499.91100
496.1897.5795.9296.3595.8097.3497.1510099.4698.72
598.6396.0599.3598.1597.5498.5799.6099.5299.5299.32
699.5310099.9796.8299.4199.6610010099.9199.73
799.8099.6599.5899.3199.9699.4599.8710099.7099.97
896.3991.6892.5495.1289.8497.5999.0298.2197.2199.00
998.9199.3498.6999.0598.7210099.9610099.23100
1093.3290.4689.7193.1895.5192.8497.3999.8799.8199.83
1184.6281.2698.8084.0098.9594.0197.1298.7299.8999.90
1277.9096.4299.0099.7699.6499.4099.4798.7899.67100
1382.6898.8097.5098.5899.0994.0999.1710098.0599.65
1493.4197.4198.8373.4498.2395.8794.5299.3599.19100
1563.3879.9386.2387.2089.9584.6396.6896.5395.3497.41
1695.7495.0896.4492.2398.6495.6499.7810099.70100
OA89.1093.5795.0594.7093.9396.2098.8598.9598.3499.33
AA92.1395.1696.6894.1595.3497.0798.7199.4199.1499.59
Kappa0.880.920.940.940.930.950.980.980.980.99
Table 14. Significance from the standard McNemar’s test for the difference between algorithms.
Table 14. Significance from the standard McNemar’s test for the difference between algorithms.
Z Value/Significant?
IndianPaviaSalinas
TabNet versus RF8.57/yes47.54/yes44.62/yes
TabNet versus MLP8.83/yes40.35/yes35.88/yes
TabNet versus LightGBM10.34/yes22.11/yes13.40/yes
TabNet versus CatBoost12.51/yes30.11/yes25.09/yes
TabNet versus XGBoost8.70/yes35.80/yes16.91/yes
uTabNet versus TabNet3.21/yes8.72/yes5.33/yes
TabNets versus TabNet26.16/yes43.24/yes44.85/yes
uTabNets versus CAE20.21/yes29.78/yes30.45/yes
uTabNets versus TabNets3.77/yes12.77/yes6.15/yes
sTabNet versus sRF14.17/yes47.14/yes46.09/yes
sTabNet versus sMLP26.74/yes35.29/yes21.35/yes
sTabNet versus sLightGBM12.67/yes33.90/yes12.20/yes
sTabNet versus sCatBoost12.13/yes22.45/yes14.43/yes
sTabNet versus sXGBoost27.69/yes25.62/yes21.26/yes
sTabNet versus TabNet25.76/yes44.01/yes41.96/yes
suTabNet versus sTabNet3.67/yes14.99/yes23.84/yes
sTabNets versus TabNets3.81/yes15.55/yes5.98/yes
suTabNets versus uTabNets3.58/yes13.87/yes5.89/yes
Table 15. Execution time (in seconds) in different experimental datasets.
Table 15. Execution time (in seconds) in different experimental datasets.
IndianPaviaSalinas
RF5.314.995.64
MLP12.0416.540.67
LightGBM380.21345.011080.63
CatBoost38.8325.948.10
XGBoost40.4915.0725.29
TabNet710.05639.104637.26
uTabNet873.82747.34837.24
CAE915.281265.151315.54
TabNets938.031620.171890.56
uTabNets1796.252580.233520.17
sTabNet963.051027.091206.17
suTabNet1017.131180.331412.33
sTabNets973.561780.031984.54
suTabNets1880.172663.373720.57
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Shah, C.; Du, Q.; Xu, Y. Enhanced TabNet: Attentive Interpretable Tabular Learning for Hyperspectral Image Classification. Remote Sens. 2022, 14, 716. https://doi.org/10.3390/rs14030716

AMA Style

Shah C, Du Q, Xu Y. Enhanced TabNet: Attentive Interpretable Tabular Learning for Hyperspectral Image Classification. Remote Sensing. 2022; 14(3):716. https://doi.org/10.3390/rs14030716

Chicago/Turabian Style

Shah, Chiranjibi, Qian Du, and Yan Xu. 2022. "Enhanced TabNet: Attentive Interpretable Tabular Learning for Hyperspectral Image Classification" Remote Sensing 14, no. 3: 716. https://doi.org/10.3390/rs14030716

APA Style

Shah, C., Du, Q., & Xu, Y. (2022). Enhanced TabNet: Attentive Interpretable Tabular Learning for Hyperspectral Image Classification. Remote Sensing, 14(3), 716. https://doi.org/10.3390/rs14030716

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop