Enhanced TabNet: Attentive Interpretable Tabular Learning for Hyperspectral Image Classification

: Tree-based methods and deep neural networks (DNNs) have drawn much attention in the classification of images. Interpretable canonical deep tabular data learning architecture (TabNet) that combines the concept of tree-based techniques and DNNs can be used for hyperspectral image classification. Sequential attention is used in such architecture for choosing appropriate salient features at each decision step, which enables interpretability and efficient learning to increase learning capacity. In this paper, TabNet with spatial attention (TabNets) is proposed to include spatial information, in which a 2D convolution neural network (CNN) is incorporated inside an attentive transformer for spatial soft feature selection. In addition, spatial information is exploited by feature extraction in a pre-processing stage, where an adaptive texture smoothing method is used to construct a structure profile (SP), and the extracted SP is fed into TabNet (sTabNet) to further enhance performance. Moreover, the performance of TabNet-class approaches can be improved by introducing unsupervised pretraining. Overall accuracy for the unsupervised pretrained version of the proposed TabNets, i.e., uTabNets, can be improved from 11.29% to 12.61%, 3.6% to 7.67%, and 5.97% to 8.01% in comparison to other classification techniques, at the cost of increases in computational complexity by factors of 1.96 to 2.52, 2.03 to 3.45, and 2.67 to 5.52, respectively. Experimental results obtained on different hyperspectral datasets demonstrated the superiority of the proposed approaches in comparison with other state-of-the-art techniques including DNNs and decision tree variants.


Introduction
Hyperspectral imagery (HSI) consists of abundant spatial and spectral information in a 3D data cube with hundreds of narrow spectral bands. Due to high spectral resolution, it has been applied in many applications, such as pollution monitoring, urban planning, analysis for land use, and land cover [1][2][3][4]. However, an increase in spatial and spectral information poses a challenge in HSI analysis. Thus, analysis of HSI, such as classification, dimensionality reduction [1,5], and feature extraction [6,7], has obtained much attention among the remote sensing community for decades [8]. Moreover, such approaches can be applicable towards vision technology applications in other engineering domains [9][10][11], multispectral remote sensing, and synthetic aperture radar (SAR) imagery [12,13].
In the last decades, spectral-based classification approaches such as support vector machine (SVM) and composite kernel SVM (SVM-CK) have been widely used in remote sensing [14][15][16]. In addition, different spatial-spectral features have been introduced for HSI classification [17,18]. Sparse representation (SR) for HSI classification was successfully applied in [19], inspired by the successful application of sparse representation in face recognition [20]. Consequently, many sparse and collaborative In this work, we observed enhanced performance of unsupervised pretraining on TabNet (uTabNet) for HSI classification, and pretraining was extended to TabNets, resulting in uTabNets. The unsupervised pretrained version of TabNets, i.e., uTabNets, can consider sequential attention in addition to spatial processing of masks by using 2D CNN in the attentive transformer.
Moreover, the existing TabNet does not include any preprocessing stage, weakening its ability to learn in a better way. Certainly, including spatial information in a spectral classifier has led to increased classification accuracy. Many deep learning classifiers, such as recurrent neural networks (RNN) [42] and generative adversarial network (GAN) [43], use CNN for deep feature extraction with several convolutional and pooling layers [44,45]. However, most deep learning methods need massive training to accurately learn of parameters. To deal with such issues, various classification frameworks, such as active learning [46] and ensemble learning [47], are introduced. In addition, spatial optimization using structure profile (SP) is introduced in [48] for feature extraction purposes. In this paper, we incorporate SP in the TabNet with structure profile (sTabNet). Similarly, SP is used in extended versions of TabNet, including uTabNet with SP (suTabNet), TabNets with SP (sTabNets), and uTabNets with SP (suTabNets).
The main contribution of this work can be summarized as follows: 1. It introduces TabNet for HSI classification and improves classification performance by applying unsupervised pretraining in uTabNet; 2. It develops TabNets and uTabNets after including spatial information in the attentive transformer; 3. It includes SP in sTabNet as a feature extraction to further improve the classification performance of SP versions of TabNet, i.e., suTabNet, sTabNets, and suTabNets.
The remainder of this article is organized as follows. Section 2 presents related work. Section 3 discusses the proposed TabNet versions for hyperspectral image classification. Section 4 shows experimental results along with a discussion. Section 5 summarizes the article conclusively.

Related Work
Features should be picked wisely for meaningful prediction in machine learning. Global feature selection methods are techniques of selecting appropriate features based on the entire training dataset. Forward selection and LASSO regularization are broadly used global feature selection techniques [49]. Forward selection uses an iterative approach in a step-by-step fashion to select appropriate features from each iteration, and Lasso regularization can allocate zero weights for irrelevant features in a linear model. As stated in [50], instance-wise feature selection can be used to select individual features for each input and explainer model to maximize the mutual information between the response variable and the selected features. Moreover, the actor-critic framework can be used to mimic a baseline by optimizing the feature selection [51]. Using the actor-critic framework, reward can be generated by the predicting network for the selecting network. However, TabNet can be used for soft feature selection by controlling the sparsity that can perform feature selection and output mapping, and can provide better representations of features to enhance performance.

Tree Based Learning
Tree-based methods are well suited for tabular data learning, as they can provide statistical information gains by picking global features [52]. Ensembling can be done to enhance the performance of tree-based models, such that random forests (RF) can use random subsets of data with randomly selected features to grow many trees [28,30]. Furthermore, CatBoost [53], XGBoost [31,32], and LightGBM [54] are recent ensemble decision tree approaches that can provide better performance for classification. Deep learning can be implemented by using the feature selecting capability to provide better performance than tree-based techniques.

Attentive Interpretable Tabular Learning (TabNet)
TabNet is based on tree-like functionality, as it can be used for the linear combination of features by determining the coefficients for the contribution of features in the decision process. It uses sparse instance-wise feature selection that can be learned in a training dataset, and it constructs a sequential multi-step architecture such that the portion of a decision can be determined at each decision step by using the selected features. Furthermore, features are nonlinearly processed. In an advanced task, such as HSI classification or anomaly detection, intrinsic spectral features need to be considered in detail to avoid the problems of non-identical spectra from the same materials or similar spectra from different materials [55]. Conventional DNNs, such as multi-layer perceptron (MLP) or stacked convolutional layers, lack the proper mechanisms to select soft features. TabNet should be implemented in comparison to conventional DNN-based approaches because TabNet has powerful soft feature selection capability, in addition to controlling the sparsity with sequential attention.

Proposed Method
The different variants of enhanced TabNet classifiers proposed in this work are summarized in Table 1.

Notation
Meaning TabNet  Attentive interpretable tabular learning  uTabNet  Unsupervised pretraining on attentive interpretable tabular learning  TabNets  Attentive interpretable tabular learning with spatial attention  uTabNets  Unsupervised pretraining on attentive interpretable tabular learning with spatial attention  sTabNet  Structure profile on attentive interpretable tabular learning  suTabNet  Structure profile on unsupervised pretrained attentive interpretable tabular learning  sTabNets Structure profile on attentive interpretable tabular learning with spatial attention suTabNets Structure profile on unsupervised pretrained attentive interpretable tabular learning with spatial attention

TabNet for Hyperspectral Image Classification
Suppose that a hyperspectral dataset with d spectral bands contains M labeled samples for C classes, and each is represented by x  and the corresponding label vector is . As shown in Figure 1, spectral features are used as inputs to TabNet. Suppose the training data X is passed to the initial decision step with batch size B . Then, the feature selection process includes the following steps: (1) The "split" module separates the output of the initial feature transformer to obtain features [ 1] i − a in Step 1 when i = 1; (2) If we disregard the spatial information in the attentive transformer of TabNets shown in Figure 4 below, it becomes the attentive transformer for TabNet. It uses a trainable function i h , consisting of a fully connected (FC) and batch normalization (BN) layer to generate features with high dimensions; (3) In each step, interpretable information is provided by masks for selecting features, and global interpretability can be attained by aggregating the masks from different decision steps. This process can enhance the discriminative ability in the spectral domain by implementing local and global interpretability for HSI feature selection.

The attentive transformer then generates masks
as a soft selection of salient features with the use of processed features [ 1] i − a from the previous step as: Entmax normalization [56] inherits the desirable sparsity of sparsemax and can provide smoother, and differentiable curvature, whereas sparsemax is piecewise linear denoted as is the prior scale term that denotes how much a particular feature has been used previously: where γ is a relaxation parameter such that a feature is used at one decision step when 1 γ = and features can be used in multiple decisions steps when γ increases.
For input attention , its sparsemax output can be estimated as: where D Δ represents the probability distribution and sparsemax( ) z provides zero ability to choices with low scores. However, entmax normalization provides continuous probability distribution. estimating better distributions in comparison to sparsemax normalization, which can be stated as: (4) The sparsity regularization term can be used in the form of entropy [57] for controlling the sparsity of selected features.
, , where ∈ takes a small value for numerical stability. Sparsity regularization sparse λ is also added to the overall loss as sparse sparse λ × L , which can provide favorable bias for convergence to high accuracy for datasets with redundant features; (5) A sequential multi-step decision process with steps N is used in TabNet's encoding.
The processed information from ( 1) th i − step is passed to the th i step to decide which features to use. The outputs are obtained by aggregating the processed feature representation in the overall decision function as shown by feature attributes in Figure 1.

With the masks
[ ] i M obtained from the attentive transformer, the following steps are used for feature processing.
(1) The feature transformer in Figure 2 is used to process the filtered features, which can be used in the decision step output and information for subsequent steps: (2) For efficient learning with high capacity, the feature transformer is comprised of layers that are shared across decision steps such that the same features can be input for different decision steps, and decision step-dependent layers in which features in the current decision step depend upon the output from the previous decision step; (3) In Figure 2, it can be observed that the feature transformer consists of the concatenation of two shared layers and two decision step-dependent layers, in which each fully connected (FC) layer is followed by batch normalization (BN) and a gated linear unit (GLU) [58]. Normalization with 0 .5 is also used for ensuring stabilized learning throughout the network [59]; (4) All BN operations, except applied at input features, are implemented in ghost BN [60] by selecting only part of samples rather than using an entire batch at one time to reduce the cost of computation. This improves performance by using the virtual or small batch size v B and momentum B m instead of using the entire batch.
Moreover, decision tree-like aggregation is implemented by constructing overall decision embedding as: where steps N represents the number of decision steps;

TabNet with Unsupervised Pretraining
To include unsupervised pretraining in TabNet (uTabNet), a decoder architecture is incorporated [39,40]. As shown in Figure 3, the decoder is composed of a feature transformer, and FC layers at each decision step to reconstruct features by combining the outputs. Missing columns of features can be predicted using other feature columns.
is a binary mask and r is the pretraining ratio of features to randomly discard for reconstruction such that the variable r represents the ratio of masking inside the binary mask S . The term in the encoder is initialized as [ ] (1 ) = − P 0 S such that the model focuses on the known features, and the last FC layer of the decoder is a result of the product of S and unknown output features. For this purpose, the reconstruction residual ( rec L ) used in an unsupervised manner without label information is formed as: where , i j X represents the reconstructed output and

TabNet with Spatial Attention (TabNets)
The generated masks in the attentive transformer for soft feature selection. Spatial information is incorporated by including a 2D CNN inside the attentive transformer, resulting in TabNet with spatial attention (TabNets), as shown in Figure 4. The output feature maps of each layer in TabNets are shown in Table 2. In CNN, 2D kernels are used for convolving the input data after calculating the sum of the product of the kernel and the input data. To cover the total spatial area, the kernel is strided on the input data. Nonlinearity is introduced with an activation function on the convolved features. The value after activation , , layer with the l-th feature map can be expressed as: where ψ represents the function of activation, with , k l e being the bias parameter. First of all, the 3D patch input T P P × × for the T reduced channels from principal component analysis (PCA) and patch size P P × is converted to a 1D input vector. For instance, in the Indian Pines data, the 3D input of size 10 25 25 × × becomes a 6250 × 1 vector. The feature size from each layer in the encoder is shown in the second part of Table  2: (1) The first BN generates a 6250 × 1 vector; (2) It is converted by the first feature transformer layer before Step 1 into a feature vector of size 512  For spatial attention inside an attentive transformer, a feature map of different layers is shown in the first part of Table 2.
(1) The output of entmax from Equation (1) is reshaped to 10 25 25 × × as input to the first 2D convolution layer. For a kernel size of 3 3 × and stride = 3, the first 2D convolution layer provides a 16 8 8 × × output; (2) The second convolution layer generates an output of size 32 6 6 × × with a kernel size of 3 3 × and stride = 1; (3) The third convolutional layer generates an output shape of 64 4 4 × × with a kernel size of 3 3 × and stride = 1; (4) The flatten layer provides an output of size 1024 × 1; (5) Finally, the FC layer generates an output of size 6250 × 1 that is provided as input to the prior scales for updating the abstract features generated by the FC and BN layers inside the attentive transformer.
In addition, TabNets with unsupervised pretraining (uTabNets) can be obtained by using steps of unsupervised pretraining and Equation (8) on TabNets.

Structure Profile on TabNet (sTabNet)
By using spatial feature extraction with structure profile (SP) [48] in the preprocessing stage, the performance of TabNet can be enhanced by using the TabNet with structure profile (sTabNet).

Spatial feature extraction with structure profile:
First of all, the original input image is divided into M subsets. The structure profile S can be extracted from the input image X using an adaptive texture smoothing model as: where λ is the free parameter and w is the weight that controls the similarity of adjacent pixels. For smoothing purposes, a local polynomial can be implemented as x for each ∈ x Ω with the optimization function as: is a polynomial with a degree L ≤ , and w decides the contribution of pixels ( ) i X x towards the construction of polynomial ( ) where ( ) Y ⋅ is the small region that can be used for comparing patches around i x and x, the scale parameter 0 h is set to 1, and G σ is the Gaussian function with standard deviation σ . Equation (10) can now be expressed as: Using the Bregman iteration algorithm [61], Equation (13) can be solved as below: The soft thresholding method can be used: These steps of updating 1 ( ) After obtaining convergence, the aforementioned TabNet classifier is implemented on the extracted SPs to obtain the classification results for sTabNet.

Structure Profile on Unsupervised Pretrained TabNet (suTabNet)
After applying SP in feature extraction before uTabNet, the performance of the TabNet with unsupervised pretraining with SP feature extraction (suTabNet) can be obtained. Similarly, SP feature extraction can be applied to TabNets and uTabNets to obtain their SP-extracted versions sTabNets and suTabNets, respectively, and on other comparative methods for a fair comparison.

Datasets
Three different datasets were used to validate the proposed methods. The first dataset used for the experiment is the Indian Pines dataset collected by the Airborne Visible and Infrared (AVIRIS) sensor. It consists of 16 different classes with a spatial size of 145 × 145 pixels and spectral bands of 220 (200 after noise removal). The water-absorption bands 104-108, 150-163, and 220 were removed. The spectral wavelength ranges from 0.4 to 2.5 μm. Ten percent of training samples were taken into consideration from each class for training and the remaining were used for testing. The number of training and testing samples for each class is listed in Table 3.
The second dataset used is the University of Pavia dataset, which was acquired by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor in Italy. It has a spatial size of 610 × 340 pixels. It consists of a total of 103 spectral bands after noisy band removal.
It includes spectral bands in the range 0.43 to 0.86 μm. Nine different classes exist in this dataset and 200 training samples were taken from each class as training samples; the remaining were used as testing samples. Table 4 shows the number of training and testing samples for each class.
The third dataset is the Salinas dataset, which is collected with an AVIRIS sensor in Salinas Valley, California. It comprises a spatial size of 512 × 217 pixels with 224 bands (204 bands after band removal). Water-absorption bands 108-112, 154-167, and 224 were removed. It has a spatial resolution of 3.7 m-pixels with 16 different classes. For training, 200 samples from each class were taken and remaining were used for testing. Table 5 shows the number of training and testing samples in different classes.

Experimental Setup
For all methods in comparison, such as RF, MLP, LightGBM, CatBoost, XGBoost, and CAE, parameters were estimated according to [28][29][30][31][32]35,41,53,54]. For our proposed methods, the Adam optimizer was used to estimate the optimal parameters. In all three datasets, 10% of training samples were allocated for validation and the remaining 90% of training samples were allocated for learning optimal weights of the network for tuning the hyper parameters of the network. The performance of TabNet, uTabNet, TabNets, uTabNets, and their SP-extracted versions sTabNet, suTabNet, sTabNets, and suTabNets on different parameters was investigated from a predefined set of parameters.  was investigated to incorporate more spatial information. However, choosing a too large window size may add redundancy due to interclass variation among neighboring pixels. As shown in Table 7, 25 25 × was found to be the most suitable for all datasets. For Indian Pines and Salinas data, 10 25 25 × × was used, and 7 25 25 × × was used for University of Pavia data.
In Figures 5-7, the classification map of the three datasets is consistent with the results in Tables 8-13. In Figure 5, the classification map for Indian Pines is shown, which consists of ground truth for the original image in Figure 5a,b. In addition, in these classification maps, labeled pixels are listed, in which sTabNet outperforms TabNet and the SP versions of other techniques. Furthermore, suTabNet outperforms uTabNet and sTabNet. The proposed TabNets shows less noise in the area of Soybean-notill and Woods, and uTabNets shows less noise in the region of Woods.   Moreover, their SP-extracted versions sTabNets, and suTabNets show less noise in the areas of Soybean-mintill and Woods, respectively. In Figure 6, the classification map for the University of Pavia is shown. It can be observed that the maps from the proposed TabNets and uTabNets are smoother in the regions of Bare soil and Meadows, respectively. Similarly, their SP-extracted versions sTabNets and suTabNets produce smoother areas of Bare soil and Meadows, respectively. In Figure 7, the classification map for different methods on the Salinas dataset are shown. It is illustrated that the maps from the proposed TabNets and uTabNets are less noisy in the regions of Corn-seneseed-green-weeds and Grapes-untrained. In addition, the maps from their SP-extracted versions sTabNets and suTabNets contain less noise in the areas of Grapes-untrained and Vinyard-untrained. z > , is larger than 1.96 or 2.58, which represents statistical difference at 95% or 99% confidence levels, respectively. The comparison among TabNet, uTabNet, TabNets, uTabNets, sTabNet, suTabNet, sTabNets, suTabNets, and other classifiers is illustrated, which indicates their superiority over their counterparts.
To estimate the computational complexity involved in the proposed algorithms, execution time for different algorithms on three hyperspectral datasets is illustrated in Table 15. All the experiments were run using a NVIDIA Tesla K80 GPU and MATLAB on an Intel(R) Core (TM) i7-4770 central processing unit with 16 GB of memory. It can be observed that TabNet has higher computational complexity in comparison to other tree-based methods, which may be due to the sequential attention involved in tabular learning. In addition, the unsupervised pretraining version of TabNet (uTabNet) has higher complexity than TabNet because of the pretraining operation. Additionally, the proposed TabNets and its unsupervised pretraining version uTabNets show slightly higher complexity than TabNet and uTabNet because of the convolution layer in the attentive transformer for spatial processing of masks. Moreover, the SP-extracted versions TabNets, uTabNets, sTabNets, and suTabNets are slightly costlier than their counterparts due to SP extraction.

Conclusions
In this work, we propose a TabNets network that uses spatial attention to enhance the performance of the original TabNet for HSI classification by including a 2D CNN in the attentive transformer. Moreover, unsupervised pretraining on TabNets (uTabNets) was introduced, which can outperform TabNets. SP-extracted versions of TabNet, uTabNet, TabNets, uTabNets were also developed to further utilize spatial information. The experimental results obtained on different hyperspectral datasets illustrate the superiority of the proposed TabNets and uTabNets and their SP versions in terms of classification accuracy over other techniques, such as RF, MLP, LightGBM, CatBoost, XGBoost, and their SP versions. However, the proposed networks show slightly higher complexity for network optimization. In future work, more spatial and spectral information will be incorporated into TabNet to enhance the classification performance with reduced computational cost. Moreover, the performance of the enhanced TabNet on hyperspectral anomaly detection will be investigated. This has potential applications for solving similar classification and feature extraction problems for high-resolution thermal or remote sensing images.