Enhanced TabNet: Attentive Interpretable Tabular Learning for Hyperspectral Image Classification

Shah, Chiranjibi; Du, Qian; Xu, Yan

doi:10.3390/rs14030716

Open AccessArticle

Enhanced TabNet: Attentive Interpretable Tabular Learning for Hyperspectral Image Classification

by

Chiranjibi Shah

^1,*

,

Qian Du

¹ and

Yan Xu

²

¹

Department of Electrical and Computer Engineering, Mississippi State University, Starkville, MS 39762, USA

²

Cotiviti Inc., South Jordan, UT 84095, USA

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(3), 716; https://doi.org/10.3390/rs14030716

Submission received: 5 December 2021 / Revised: 27 January 2022 / Accepted: 29 January 2022 / Published: 3 February 2022

Download

Browse Figures

Versions Notes

Abstract

:

Tree-based methods and deep neural networks (DNNs) have drawn much attention in the classification of images. Interpretable canonical deep tabular data learning architecture (TabNet) that combines the concept of tree-based techniques and DNNs can be used for hyperspectral image classification. Sequential attention is used in such architecture for choosing appropriate salient features at each decision step, which enables interpretability and efficient learning to increase learning capacity. In this paper, TabNet with spatial attention (TabNets) is proposed to include spatial information, in which a 2D convolution neural network (CNN) is incorporated inside an attentive transformer for spatial soft feature selection. In addition, spatial information is exploited by feature extraction in a pre-processing stage, where an adaptive texture smoothing method is used to construct a structure profile (SP), and the extracted SP is fed into TabNet (sTabNet) to further enhance performance. Moreover, the performance of TabNet-class approaches can be improved by introducing unsupervised pretraining. Overall accuracy for the unsupervised pretrained version of the proposed TabNets, i.e., uTabNets, can be improved from 11.29% to 12.61%, 3.6% to 7.67%, and 5.97% to 8.01% in comparison to other classification techniques, at the cost of increases in computational complexity by factors of 1.96 to 2.52, 2.03 to 3.45, and 2.67 to 5.52, respectively. Experimental results obtained on different hyperspectral datasets demonstrated the superiority of the proposed approaches in comparison with other state-of-the-art techniques including DNNs and decision tree variants.

Keywords:

hyperspectral imagery; classification; spatial attention; structure profile; tabular learning

1. Introduction

Hyperspectral imagery (HSI) consists of abundant spatial and spectral information in a 3D data cube with hundreds of narrow spectral bands. Due to high spectral resolution, it has been applied in many applications, such as pollution monitoring, urban planning, analysis for land use, and land cover [1,2,3,4]. However, an increase in spatial and spectral information poses a challenge in HSI analysis. Thus, analysis of HSI, such as classification, dimensionality reduction [1,5], and feature extraction [6,7], has obtained much attention among the remote sensing community for decades [8]. Moreover, such approaches can be applicable towards vision technology applications in other engineering domains [9,10,11], multispectral remote sensing, and synthetic aperture radar (SAR) imagery [12,13].

In the last decades, spectral-based classification approaches such as support vector machine (SVM) and composite kernel SVM (SVM-CK) have been widely used in remote sensing [14,15,16]. In addition, different spatial-spectral features have been introduced for HSI classification [17,18]. Sparse representation (SR) for HSI classification was successfully applied in [19], inspired by the successful application of sparse representation in face recognition [20]. Consequently, many sparse and collaborative representation-based classifiers have been introduced, such as the joint sparse representation classifier (JSRC) [21], joint version of spatial-aware collaborative-competition preserving graph embedding with Tikhonov regularization (JSaCCPGT) [22], nonlocal weighted JSRC (NLW-JSRC) [23], and correntropy-based robust JSRC (RJSRC) [24]. Furthermore, multiple morphological operations were utilized in [25] for constructing spatial-spectral features of HSI, and a spatial-spectral classifier was proposed in [26] for addressing the issue of mixed pixel characterization. Multiple kernel learning has also been designed in [27] to improve the SVM classifier.

Moreover, tree-based techniques, such as the random forest method, were introduced in [28]. More recently, enhanced performance of the random forest classifier is presented in [29,30] for HSI classification. Similarly, the performance of extreme gradient boosting (XGBoost) was investigated in [31,32] for HSI. Tree-based approaches have the advantages of efficient representation for decision manifolds with approximate hyperplane boundaries, being interpretable by tracking the decision nodes, and being fast to train. A deep neural network (DNN) based on multiscale spectral-spatial fusion was proposed for HSI classification in [33,34]. However, classification performance will decrease with a deeper network because input for such architecture is one-dimensional and it lacks neighborhood information in the spatial dimension. Moreover, a convolutional deep neural network based on stacked convolutional layers or multilayer perceptron (MLPs) fails to find an optimal solution for decision manifolds in the spectral domain due to lack of appropriate inductive bias [35]. In addition, convolutional neural networks (CNNs) have drawn much attention to classification of image [36], and patch-to-patch CNN was presented in [37] to obtain better performance than existing techniques. However, CNNs have the shortcoming of not considering spectral information effectively.

When a DNN is used for large datasets, classification performance can be improved because it enables gradient descent-based end-to-end learning. Tree learning lacks the use of backpropagation in its inputs for guidance from error signals [38], thus limiting its performance for large datasets. TabNet, a new canonical deep neural architecture for tabular data, was proposed in [39,40]. It can combine the valuable benefits of tree-based methods with DNN-based methods to obtain high performance and interpretability. The high performance of DNNs can be made more interpretable by substituting them with tree-based methods. Inspired by this work, we propose to use TabNet for HSI classification in this paper, as spectral signatures of pixels in HSI are organized as a tabular dataset. One of the aims of this paper is to overcome the deficiencies of existing neural networks and decision trees in HSI classification. In this regard, we explore TabNet and modify its original architecture for HSI. The original TabNet takes raw data without any feature processing and is trained with a gradient descent-based method. Moreover, at each decision step, it uses sequential attention. It enables local interpretability that determines the combination and importance of input features, and global interpretability that measures the contribution of each input feature to the trained model. However, sequential attention-based TabNet has some drawbacks as well. Although TabNet can provide good performance to analyze spectral signatures of HSI, it lacks proper use of local contextual information in the spatial domain. For this reason, we have modified the original architecture of TabNet by incorporating spatial information in an attentive transformer called TabNet with spatial attention (TabNets). Specifically, a 2D convolution neural network (CNN) is used in the attentive transformer to spatially process the masks that contribute to soft feature selection of the abstract features. TabNets can overcome the deficiency of CNNs by considering spatial information with a sequential attention.

Recently, different integrated networks, such as stacked auto encoders (SAE) and convolution autoencoders (CAE), were presented in [30,41] for feature extraction. However, such methods lack the powerful capability of feature extraction in the spatial and spectral domains.

In this work, we observed enhanced performance of unsupervised pretraining on TabNet (uTabNet) for HSI classification, and pretraining was extended to TabNets, resulting in uTabNets. The unsupervised pretrained version of TabNets, i.e., uTabNets, can consider sequential attention in addition to spatial processing of masks by using 2D CNN in the attentive transformer.

Moreover, the existing TabNet does not include any preprocessing stage, weakening its ability to learn in a better way. Certainly, including spatial information in a spectral classifier has led to increased classification accuracy. Many deep learning classifiers, such as recurrent neural networks (RNN) [42] and generative adversarial network (GAN) [43], use CNN for deep feature extraction with several convolutional and pooling layers [44,45]. However, most deep learning methods need massive training to accurately learn of parameters. To deal with such issues, various classification frameworks, such as active learning [46] and ensemble learning [47], are introduced. In addition, spatial optimization using structure profile (SP) is introduced in [48] for feature extraction purposes. In this paper, we incorporate SP in the TabNet with structure profile (sTabNet). Similarly, SP is used in extended versions of TabNet, including uTabNet with SP (suTabNet), TabNets with SP (sTabNets), and uTabNets with SP (suTabNets).

The main contribution of this work can be summarized as follows:

It introduces TabNet for HSI classification and improves classification performance by applying unsupervised pretraining in uTabNet;
It develops TabNets and uTabNets after including spatial information in the attentive transformer;
It includes SP in sTabNet as a feature extraction to further improve the classification performance of SP versions of TabNet, i.e., suTabNet, sTabNets, and suTabNets.

The remainder of this article is organized as follows. Section 2 presents related work. Section 3 discusses the proposed TabNet versions for hyperspectral image classification. Section 4 shows experimental results along with a discussion. Section 5 summarizes the article conclusively.

2. Related Work

Features should be picked wisely for meaningful prediction in machine learning. Global feature selection methods are techniques of selecting appropriate features based on the entire training dataset. Forward selection and LASSO regularization are broadly used global feature selection techniques [49]. Forward selection uses an iterative approach in a step-by-step fashion to select appropriate features from each iteration, and Lasso regularization can allocate zero weights for irrelevant features in a linear model. As stated in [50], instance-wise feature selection can be used to select individual features for each input and explainer model to maximize the mutual information between the response variable and the selected features. Moreover, the actor–critic framework can be used to mimic a baseline by optimizing the feature selection [51]. Using the actor–critic framework, reward can be generated by the predicting network for the selecting network. However, TabNet can be used for soft feature selection by controlling the sparsity that can perform feature selection and output mapping, and can provide better representations of features to enhance performance.

2.1. Tree Based Learning

Tree-based methods are well suited for tabular data learning, as they can provide statistical information gains by picking global features [52]. Ensembling can be done to enhance the performance of tree-based models, such that random forests (RF) can use random subsets of data with randomly selected features to grow many trees [28,30]. Furthermore, CatBoost [53], XGBoost [31,32], and LightGBM [54] are recent ensemble decision tree approaches that can provide better performance for classification. Deep learning can be implemented by using the feature selecting capability to provide better performance than tree-based techniques.

2.2. Attentive Interpretable Tabular Learning (TabNet)

TabNet is based on tree-like functionality, as it can be used for the linear combination of features by determining the coefficients for the contribution of features in the decision process. It uses sparse instance-wise feature selection that can be learned in a training dataset, and it constructs a sequential multi-step architecture such that the portion of a decision can be determined at each decision step by using the selected features. Furthermore, features are nonlinearly processed. In an advanced task, such as HSI classification or anomaly detection, intrinsic spectral features need to be considered in detail to avoid the problems of non-identical spectra from the same materials or similar spectra from different materials [55]. Conventional DNNs, such as multi-layer perceptron (MLP) or stacked convolutional layers, lack the proper mechanisms to select soft features. TabNet should be implemented in comparison to conventional DNN-based approaches because TabNet has powerful soft feature selection capability, in addition to controlling the sparsity with sequential attention.

3. Proposed Method

The different variants of enhanced TabNet classifiers proposed in this work are summarized in Table 1.

3.1. TabNet for Hyperspectral Image Classification

Suppose that a hyperspectral dataset with

d

spectral bands contains

M

labeled samples for C classes, and each is represented by

X = {x_{1}, x_{2}, \dots, x_{M}} \in R^{M \times d}

and the corresponding label vector is

Y = {y_{1}, y_{2}, \dots, y_{C}} \in R^{M \times C}

. As shown in Figure 1, spectral features are used as inputs to TabNet. Suppose the training data

X

is passed to the initial decision step with batch size

B

. Then, the feature selection process includes the following steps:

(1): The “split” module separates the output of the initial feature transformer to obtain features $a [i - 1]$ in Step 1 when i = 1;
(2): If we disregard the spatial information in the attentive transformer of TabNets shown in Figure 4 below, it becomes the attentive transformer for TabNet. It uses a trainable function $h_{i}$ , consisting of a fully connected (FC) and batch normalization (BN) layer to generate features with high dimensions;
(3): In each step, interpretable information is provided by masks for selecting features, and global interpretability can be attained by aggregating the masks from different decision steps. This process can enhance the discriminative ability in the spectral domain by implementing local and global interpretability for HSI feature selection.

The attentive transformer then generates masks

M [i] \in R^{B \times d}

as a soft selection of salient features with the use of processed features

a [i - 1]

from the previous step as:

M [i] = e n t m a x (P [i - 1] \cdot h_{i} (a [i - 1]))

(1)

Entmax normalization [56] inherits the desirable sparsity of sparsemax and can provide smoother, and differentiable curvature, whereas sparsemax is piecewise linear denoted as

s p a r s e m a x (P [i - 1] \cdot h_{i} (a [i - 1]))

. Here,

P [i]

is the prior scale term that denotes how much a particular feature has been used previously:

P [i - 1] = \prod_{j = 1}^{i} (γ - M [i])

(2)

where

γ

is a relaxation parameter such that a feature is used at one decision step when

γ = 1

and features can be used in multiple decisions steps when

γ

increases. For input attention

z = P [i - 1] \cdot h_{i} (a [i - 1])

, its sparsemax output can be estimated as:

sparsemax (z) = \underset{p \in Δ^{D}}{\arg \min} | | p - z | |^{2}

(3)

where

Δ^{D}

represents the probability distribution and

sparsemax (z)

provides zero ability to choices with low scores.

However, entmax normalization provides continuous probability distribution. estimating better distributions in comparison to sparsemax normalization, which can be stated as:

entmax (z) = \underset{p \in Δ^{D}}{\arg \max} p^{T} z + F_{υ}^{T} (p)

(4)

where

F_{υ}^{T} (p)

is a continuous function denoted as

F_{υ}^{T} (p) = \{\begin{matrix} \frac{1}{υ (υ - 1)} \sum_{n} (p^{n} - p_{n}^{υ}) \\ - \sum_{n} p_{n} \log p_{n}, υ = 1 \end{matrix}, υ \neq 1

;

(4): The sparsity regularization term can be used in the form of entropy [57] for controlling the sparsity of selected features.

L_{s p a r s e} = \sum_{i = 1}^{N_{s t e p s}} \sum_{b = 1}^{B} \sum_{j = 1}^{d} \frac{- M_{b, j} [i]}{N_{s t e p s} \cdot B} \log (M_{b, j} [i] + \in)

(5)

where

\in

takes a small value for numerical stability. Sparsity regularization

λ_{s p a r s e}

is also added to the overall loss as

λ_{s p a r s e} \times L_{s p a r s e}

, which can provide favorable bias for convergence to high accuracy for datasets with redundant features;

(5): A sequential multi-step decision process with $N_{s t e p s}$ is used in TabNet’s encoding. The processed information from ${(i - 1)}^{t h}$ step is passed to the $i^{t h}$ step to decide which features to use. The outputs are obtained by aggregating the processed feature representation in the overall decision function as shown by feature attributes in Figure 1.

With the masks

M [i]

obtained from the attentive transformer, the following steps are used for feature processing.

(1): The feature transformer in Figure 2 is used to process the filtered features, which can be used in the decision step output and information for subsequent steps:

[d [i], a [i]] = X_{i} (M [i] \cdot X)

(6)

where the

[,]

operator denotes splitting of

d [i] = R^{B \times N_{d}}

and

a [i] = R^{B \times N_{a}}

, with

N_{d}

being the width of the prediction layer for the decision and

N_{a}

being the width of the attention layer for the masks;

(2): For efficient learning with high capacity, the feature transformer is comprised of layers that are shared across decision steps such that the same features can be input for different decision steps, and decision step-dependent layers in which features in the current decision step depend upon the output from the previous decision step;
(3): In Figure 2, it can be observed that the feature transformer consists of the concatenation of two shared layers and two decision step-dependent layers, in which each fully connected (FC) layer is followed by batch normalization (BN) and a gated linear unit (GLU) [58]. Normalization with $\sqrt{0.5}$ is also used for ensuring stabilized learning throughout the network [59];
(4): All BN operations, except applied at input features, are implemented in ghost BN [60] by selecting only part of samples rather than using an entire batch at one time to reduce the cost of computation. This improves performance by using the virtual or small batch size $B_{v}$ and momentum $m_{B}$ instead of using the entire batch. Moreover, decision tree-like aggregation is implemented by constructing overall decision embedding as:

$d_{o u t} = \sum_{i = 1}^{N_{s t e p s}} LeakyReLU (d [i])$

(7)

where $N_{s t e p s}$ represents the number of decision steps;
(5): The linear mapping $W_{f i n a l} d_{o u t}$ is applied for output mapping and softmax is employed during training for discrete outputs.

3.2. TabNet with Unsupervised Pretraining

To include unsupervised pretraining in TabNet (uTabNet), a decoder architecture is incorporated [39,40]. As shown in Figure 3, the decoder is composed of a feature transformer, and FC layers at each decision step to reconstruct features by combining the outputs. Missing columns of features can be predicted using other feature columns. Suppose

S \in {0, 1}^{B \times d}

is a binary mask and r is the pretraining ratio of features to randomly discard for reconstruction such that the variable r represents the ratio of masking inside the binary mask

S

. The term in the encoder is initialized as

P [0] = (1 - S)

such that the model focuses on the known features, and the last FC layer of the decoder is a result of the product of

S

and unknown output features. For this purpose, the reconstruction residual (

L_{r e c}

) used in an unsupervised manner without label information is formed as:

L_{r e c} = {\sum_{i = 1}^{B} \sum_{j}^{d} |\frac{({\hat{X}}_{i, j} - X_{i, j}) S_{i, j}}{\sqrt{\sum_{i = 1}^{B} {(X_{i, j} - 1 / B \sum_{i = 1}^{B} X_{i, j})}^{2}}}|}^{2}

(8)

where

{\hat{X}}_{i, j}

represents the reconstructed output and

X_{i, j}

denotes the original input.

3.3. TabNet with Spatial Attention (TabNets)

The generated masks

M [i]

in Equation (1) are used in Equation (2) to update the prior

P [i]

in the attentive transformer for soft feature selection. Spatial information is incorporated by including a 2D CNN inside the attentive transformer, resulting in TabNet with spatial attention (TabNets), as shown in Figure 4. The output feature maps of each layer in TabNets are shown in Table 2.

In CNN, 2D kernels are used for convolving the input data after calculating the sum of the product of the kernel and the input data. To cover the total spatial area, the kernel is strided on the input data. Nonlinearity is introduced with an activation function on the convolved features. The value after activation

A_{k, l}^{u, v}

at spatial position (

u, v

) for the k-th layer with the l-th feature map can be expressed as:

A_{k, l}^{u, v} = ψ (e_{k, l} + \sum_{δ = 1}^{o_{m - 1}} \sum_{θ = - τ}^{τ} \sum_{β = - Φ}^{Φ} f_{k, l, δ}^{β, θ} \times A_{k - 1, l}^{u + β, v + θ})

(9)

where

ψ

represents the function of activation, with

e_{k, l}

being the bias parameter.

o_{m - 1}

denotes the number of feature maps in the

(m - 1)

th layer with the depth of the kernel

f_{k, l}

at the k-th layer for the l-th feature map.

2 τ + 1

represents the width of the kernel and

2 Φ + 1

denotes the height of the kernel with weight parameter

f_{k, l}

.

First of all, the 3D patch input

T \times P \times P

for the T reduced channels from principal component analysis (PCA) and patch size

P \times P

is converted to a 1D input vector. For instance, in the Indian Pines data, the 3D input of size

10 \times 25 \times 25

becomes a 6250 × 1 vector. The feature size from each layer in the encoder is shown in the second part of Table 2:

(1): The first BN generates a 6250 × 1 vector;
(2): It is converted by the first feature transformer layer before Step 1 into a feature vector of size $N_{d} + N_{a} = 512$ ;
(3): The Split layer divides it into two parts and provides a feature of size $N_{a} = 256$ for the attentive transformer;
(4): The Attentive transformer layer generates output masks for the 6250 × 1 feature;
(5): The Mask layer in Step 1 generates the multiplicative output $M [i] \cdot X$ to the feature transformer layer with the 6250 × 1 feature;
(6): The feature transformer generates the feature of size $N_{d} + N_{a} = 512$ , which is separated into two parts: $N_{d} = 256$ in LeakyReLu and $N_{a} = 256$ for the attentive transformer in Step 1;
(7): The output of each decision step is then concatenated in the TabNets encoder and converted to a feature map with 16 classes by the FC layer.

For spatial attention inside an attentive transformer, a feature map of different layers is shown in the first part of Table 2.

(1): The output of $e n t m a x$ from Equation (1) is reshaped to $10 \times 25 \times 25$ as input to the first 2D convolution layer. For a kernel size of $3 \times 3$ and stride = 3, the first 2D convolution layer provides a $16 \times 8 \times 8$ output;
(2): The second convolution layer generates an output of size $32 \times 6 \times 6$ with a kernel size of $3 \times 3$ and stride = 1;
(3): The third convolutional layer generates an output shape of $64 \times 4 \times 4$ with a kernel size of $3 \times 3$ and stride = 1;
(4): The flatten layer provides an output of size 1024 × 1;
(5): Finally, the FC layer generates an output of size 6250 × 1 that is provided as input to the prior scales for updating the abstract features generated by the FC and BN layers inside the attentive transformer.

In addition, TabNets with unsupervised pretraining (uTabNets) can be obtained by using steps of unsupervised pretraining and Equation (8) on TabNets.

3.4. Structure Profile on TabNet (sTabNet)

By using spatial feature extraction with structure profile (SP) [48] in the preprocessing stage, the performance of TabNet can be enhanced by using the TabNet with structure profile (sTabNet).

Spatial feature extraction with structure profile:

First of all, the original input image is divided into

M

subsets. The structure profile

S

can be extracted from the input image

X

using an adaptive texture smoothing model as:

\underset{s}{\arg \min} | | S - X | |_{2}^{2} ⊙ w + λ | | S | |_{T V}

(10)

where

λ

is the free parameter and w is the weight that controls the similarity of adjacent pixels. For smoothing purposes, a local polynomial can be implemented as

p = \sum_{l = 1}^{m} c_{l} p_{l}

of degree

L

denoted as

\prod_{L},

with m being the number of elements in

\prod_{L} .

For N pixels in

Ω (x)

, assume

Ω (x) = {x_{1}, x_{2}, \dots, x_{N}}

is a set of points around x in

X

. To obtain the structure profile,

S

can be obtained as

S (x) : p (x)

for each

x \in Ω

with the optimization function as:

\underset{p \in \prod_{L}}{\arg \min} {\sum_{i = 1}^{N} | | p (x_{i}) - X (x_{i}) | |_{2}^{2} w (x, x_{i}) + λ | | \nabla_{p} (x_{i}) | |_{T V}}

(11)

where

\prod_{L} = {x^{α} : x \in R^{2}, α \in z_{+,} | α |_{1} \leq L}

is a polynomial with a degree

\leq L

, and w decides the contribution of pixels

X (x_{i})

towards the construction of polynomial

p (x_{i})

, such that

w (x_{i}, x) = \exp (\frac{\sum_{y \in Y (x_{i})} | | X (x_{i} + y) - X (x + y) | |_{2}^{2} G_{σ} (| | y | |)}{h_{0}^{2}})

(12)

where

Y (\cdot)

is the small region that can be used for comparing patches around

x_{i}

and x, the scale parameter

h_{0}

is set to 1, and

G_{σ}

is the Gaussian function with standard deviation

σ

. Equation (10) can now be expressed as:

\underset{p \in \prod_{L}}{\arg \min} {\sum_{i = 1}^{N} | | p (x_{i}) - X (x_{i}) | |_{2}^{2} w (x, x_{i}) + λ | | \nabla_{p} (x_{i}) | |_{1}}

(13)

Using the Bregman iteration algorithm [61], Equation (13) can be solved as below:

Update

p^{k + 1} (x_{i})

:

p^{k + 1} (x_{i}) = \underset{p \in \prod_{L}}{\arg \min} \sum_{i = 1}^{N} | | p (x_{i}) - X (x_{i}) | |_{2}^{2} w (x, x_{i}) + λ | | d^{k} (x_{i}) - \nabla_{p} (x_{i}) - b^{k} (x_{i}) | |_{2}^{2} w (x, x_{i})

(14)

Update

d^{k + 1} (x_{i})

:

\underset{d}{\arg \min} | d (x_{i}) |_{1} + λ | | d (x_{i}) - \nabla p^{k + 1} (x) - b^{k} (x) | |_{2}^{2}

(15)

The soft thresholding method can be used:

d^{k + 1} (x_{i}) = s o f t (\nabla p^{k + 1} (x_{i}) + b^{k} (x_{i}), 1 / λ)

(16)

Update

b^{k + 1} (x_{i})

:

b^{k + 1} (x_{i}) = b^{k} (x_{i}) + \nabla p^{k + 1} (x_{i}) - d^{k + 1} (x_{i})

(17)

These steps of updating

p^{k + 1} (x_{i})

,

d^{k + 1} (x_{i})

, and

b^{k + 1} (x_{i})

are repeated until convergence is attained.

After obtaining convergence, the aforementioned TabNet classifier is implemented on the extracted SPs to obtain the classification results for sTabNet.

3.5. Structure Profile on Unsupervised Pretrained TabNet (suTabNet)

After applying SP in feature extraction before uTabNet, the performance of the TabNet with unsupervised pretraining with SP feature extraction (suTabNet) can be obtained. Similarly, SP feature extraction can be applied to TabNets and uTabNets to obtain their SP-extracted versions sTabNets and suTabNets, respectively, and on other comparative methods for a fair comparison.

4. Experiments

4.1. Datasets

Three different datasets were used to validate the proposed methods.

The first dataset used for the experiment is the Indian Pines dataset collected by the Airborne Visible and Infrared (AVIRIS) sensor. It consists of 16 different classes with a spatial size of 145 × 145 pixels and spectral bands of 220 (200 after noise removal). The water-absorption bands 104–108, 150–163, and 220 were removed. The spectral wavelength ranges from 0.4 to 2.5

μ

m. Ten percent of training samples were taken into consideration from each class for training and the remaining were used for testing. The number of training and testing samples for each class is listed in Table 3.

The second dataset used is the University of Pavia dataset, which was acquired by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor in Italy. It has a spatial size of 610 × 340 pixels. It consists of a total of 103 spectral bands after noisy band removal. It includes spectral bands in the range 0.43 to 0.86

μ

m. Nine different classes exist in this dataset and 200 training samples were taken from each class as training samples; the remaining were used as testing samples. Table 4 shows the number of training and testing samples for each class.

The third dataset is the Salinas dataset, which is collected with an AVIRIS sensor in Salinas Valley, California. It comprises a spatial size of 512 × 217 pixels with 224 bands (204 bands after band removal). Water-absorption bands 108–112, 154–167, and 224 were removed. It has a spatial resolution of 3.7 m-pixels with 16 different classes. For training, 200 samples from each class were taken and remaining were used for testing. Table 5 shows the number of training and testing samples in different classes.

4.2. Experimental Setup

For all methods in comparison, such as RF, MLP, LightGBM, CatBoost, XGBoost, and CAE, parameters were estimated according to [28,29,30,31,32,35,41,53,54]. For our proposed methods, the Adam optimizer was used to estimate the optimal parameters. In all three datasets, 10% of training samples were allocated for validation and the remaining 90% of training samples were allocated for learning optimal weights of the network for tuning the hyper parameters of the network. The performance of TabNet, uTabNet, TabNets, uTabNets, and their SP-extracted versions sTabNet, suTabNet, sTabNets, and suTabNets on different parameters was investigated from a predefined set of parameters.

N_{d}

and

N_{a}

were selected from the range of

{8, 16, 24, 32, \dots, 1024}

,

γ = {1, 1.5, \dots 2}

,

λ_{s p a r s e} = {0, 0.0001, \dots, 0.1}

,

B = {16, 32, \dots, 16384}

,

m_{B} = {0.2, \dots, 1}

,

N_{s t e p s} = {1, 2, \dots, 10}

, and

B_{v} = {16, 32, \dots, 1024}

was used as a range of different parameters for TabNet, uTabNet, TabNets, uTabNets, and their SP versions. In all three datasets,

λ_{s p a r s e} = 0.01

,

γ = 1.5

,

B = 64

,

m_{B} = 0.6

,

N_{s t e p s} = 5

, and

B_{v} = 128

were selected. The proposed TabNets, and uTabNets can provide enhanced results in a smaller number of epochs, such as 200 epochs for Indian Pines data and 500 epochs for the other two datasets. Each experiment was repeated 10 times and the average value is reported to reduce the ambiguity. The optimal parameters of the proposed methods are listed in Table 6 for all three datasets.

In addition, varying window size in the range of

{19 \times 19, 21 \times 21, 23 \times 23, 25 \times 25, 27 \times 27}

was investigated to incorporate more spatial information. However, choosing a too large window size may add redundancy due to interclass variation among neighboring pixels. As shown in Table 7,

25 \times 25

was found to be the most suitable for all datasets. For Indian Pines and Salinas data,

10 \times 25 \times 25

was used, and

7 \times 25 \times 25

was used for University of Pavia data.

4.3. Result of Classification

Classification accuracies in terms of overall accuracy, average accuracy, Kappa coefficients, and per-class accuracy are enlisted in Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13. It can be observed that TabNet shows better classification accuracy than the other methods of RF [28,29,30], MLP [35], LightGBM [54], CatBoost [53], and XGBoost [31,32]. In addition, TabNet with spatial attention (TabNets) and its unsupervised pretrained version (uTabNets) outperform TabNet and its unsupervised version uTabNet in all three datasets. Additionally, uTabNets outperforms the convolutional autoencoder (CAE) [30,41] in all three datasets. Moreover, sTabNet outperforms TabNet and SP-extracted versions of other methods, such as sRF, sMLP, sLightGBM, sCatBoost, and sXGBoost. Additionally, SP on TabNets (sTabNets) and its unsupervised pretrained version (uTabNets) outperform TabNet, uTabNet, TabNets, and uTabNets, along with all other SP-extracted versions in all three datasets.

In Figure 5, Figure 6 and Figure 7, the classification map of the three datasets is consistent with the results in Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13. In Figure 5, the classification map for Indian Pines is shown, which consists of ground truth for the original image in Figure 5a,b. In addition, in these classification maps, labeled pixels are listed, in which sTabNet outperforms TabNet and the SP versions of other techniques. Furthermore, suTabNet outperforms uTabNet and sTabNet. The proposed TabNets shows less noise in the area of Soybean-notill and Woods, and uTabNets shows less noise in the region of Woods.

Moreover, their SP-extracted versions sTabNets, and suTabNets show less noise in the areas of Soybean-mintill and Woods, respectively. In Figure 6, the classification map for the University of Pavia is shown. It can be observed that the maps from the proposed TabNets and uTabNets are smoother in the regions of Bare soil and Meadows, respectively. Similarly, their SP- extracted versions sTabNets and suTabNets produce smoother areas of Bare soil and Meadows, respectively. In Figure 7, the classification map for different methods on the Salinas dataset are shown. It is illustrated that the maps from the proposed TabNets and uTabNets are less noisy in the regions of Corn-seneseed-green-weeds and Grapes-untrained. In addition, the maps from their SP-extracted versions sTabNets and suTabNets contain less noise in the areas of Grapes-untrained and Vinyard-untrained.

Figure 8 shows the classification performance of different methods for varying numbers of training samples in all datasets. For Indian pines, each class’ training sample is varied as

{10 %, 20 %, 30 %, 40 %, and 5 0 %} .

The training samples per class are varied as

{100, 200, 300, 400, and 5 00}

in both the University of Pavia and Salinas datasets. It can be observed that the proposed TabNets, uTabNets, sTabNets, and suTabNets outperform all other methods, such as RF, MLP, LightGBM, CatBoost, XGBoost, TabNet, uTabNet, CAE, and their SP versions for all numbers of training samples in all three datasets.

To evaluate statistical significance in OA performance improvement, the McNemar’s test [62] is shown in Table 14 among different pairs of methods. Two methods are statistically different if

z

, the value of McNemar’s test denoted as

(| z | > 0)

, is larger than 1.96 or 2.58, which represents statistical difference at 95% or 99% confidence levels, respectively. The comparison among TabNet, uTabNet, TabNets, uTabNets, sTabNet, suTabNet, sTabNets, suTabNets, and other classifiers is illustrated, which indicates their superiority over their counterparts.

To estimate the computational complexity involved in the proposed algorithms, execution time for different algorithms on three hyperspectral datasets is illustrated in Table 15. All the experiments were run using a NVIDIA Tesla K80 GPU and MATLAB on an Intel(R) Core (TM) i7-4770 central processing unit with 16 GB of memory.

It can be observed that TabNet has higher computational complexity in comparison to other tree-based methods, which may be due to the sequential attention involved in tabular learning. In addition, the unsupervised pretraining version of TabNet (uTabNet) has higher complexity than TabNet because of the pretraining operation.

Additionally, the proposed TabNets and its unsupervised pretraining version uTabNets show slightly higher complexity than TabNet and uTabNet because of the convolution layer in the attentive transformer for spatial processing of masks. Moreover, the SP-extracted versions TabNets, uTabNets, sTabNets, and suTabNets are slightly costlier than their counterparts due to SP extraction.

5. Conclusions

In this work, we propose a TabNets network that uses spatial attention to enhance the performance of the original TabNet for HSI classification by including a 2D CNN in the attentive transformer. Moreover, unsupervised pretraining on TabNets (uTabNets) was introduced, which can outperform TabNets. SP-extracted versions of TabNet, uTabNet, TabNets, uTabNets were also developed to further utilize spatial information. The experimental results obtained on different hyperspectral datasets illustrate the superiority of the proposed TabNets and uTabNets and their SP versions in terms of classification accuracy over other techniques, such as RF, MLP, LightGBM, CatBoost, XGBoost, and their SP versions. However, the proposed networks show slightly higher complexity for network optimization. In future work, more spatial and spectral information will be incorporated into TabNet to enhance the classification performance with reduced computational cost. Moreover, the performance of the enhanced TabNet on hyperspectral anomaly detection will be investigated. This has potential applications for solving similar classification and feature extraction problems for high-resolution thermal or remote sensing images.

Author Contributions

Conceptualization, C.S., Q.D. and Y.X.; methodology, C.S. and Q.D.; writing—original draft, C.S.; writing—review and editing, C.S. and Q.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

The authors would like to thank the authors of all references used in the paper, the editors, and the anonymous reviewers for their detailed comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shah, C.; Du, Q. Collaborative and Low-Rank Graph for Discriminant Analysis of Hyperspectral Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5248–5259. [Google Scholar] [CrossRef]
Shah, C.; Du, Q. Spatial-Aware Probabilistic Collaborative Representation for Hyperspectral Image Classification. In Proceedings of the Image and Signal Processing for Remote Sensing XXVI (Proc. Of SPIE), Edinburgh, UK, 21–25 September 2020. art no 115330Q. [Google Scholar] [CrossRef]
Li, W.; Du, Q. Joint Within-Class Collaborative Representation for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2200–2208. [Google Scholar] [CrossRef]
Shah, C.; Du, Q. Modified Structure-Aware Collaborative Representation for Hyperspectral Image Classification. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021. [Google Scholar] [CrossRef]
Pan, L.; Li, H.-C.; Deng, Y.-J.; Zhang, F.; Chen, X.-D.; Du, Q. Hyperspectral Dimensionality Reduction by Tensor Sparse and Low-Rank Graph-Based Discriminant Analysis. Remote Sens. 2017, 9, 452. [Google Scholar] [CrossRef] [Green Version]
Li, W.; Wang, Z.; Li, L.; Du, Q. Feature extraction for hyperspectral images using local contain profile. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 5035–5046. [Google Scholar] [CrossRef]
Hong, D.; Wu, X.; Ghamisi, P.; Chanussot, J.; Yokoya, N.; Zhu, X.X. Invariant attribute profiles: A spatial-frequency joint feature extractor for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 3791–3808. [Google Scholar] [CrossRef] [Green Version]
Chang, C.-I. Hyperspectral Data Exploitation: Theory and Applications; Wiley: Hoboken, NJ, USA, 2007. [Google Scholar] [CrossRef]
Chen, M.; Tang, Y.; Zou, X.; Huang, Z.; Zhou, H.; Chen, S. 3D global mapping of large-scale unstructured orchard integrating eye-in-hand stereo vision and SLAM. Comput. Electron. Agric. 2021, 187, 106237. [Google Scholar] [CrossRef]
Wu, F.; Duan, J.; Chen, S.; Ye, Y.; Ai, P.; Yang, Z. Multi-Target Recognition of Bananas and Automatic Positioning for the Inflorescence Axis Cutting Point. Front. Plant Sci. 2021, 12, 705021. [Google Scholar] [CrossRef]
Cao, X.; Yan, H.; Huang, Z. A Multi-Objective Particle Swarm Optimization for Trajectory Planning of Fruit Picking Manipulator. Agronomy 2021, 11, 2286. [Google Scholar] [CrossRef]
Du, P.; Samat, A.; Waske, B.; Liu, S.; Li, Z. Random Forest and rotation forest for fully polarized SAR image classification using polarimetric and spatial features. ISPRS J. Photogramm. Remote Sens. 2015, 105, 38–53. [Google Scholar] [CrossRef]
Samat, A.; Persello, C.; Liu, S.; Li, E.; Miao, Z.; Abuduwaili, J. Classification of VHR multispectral images using extratrees and maximally stable extremal region-guided morphological profile. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 3179–3195. [Google Scholar] [CrossRef]
Melgani, F.; Bruzzone, L. Classification of Hyperspectral Remote Sensing Images with Support Vector Machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [Google Scholar] [CrossRef] [Green Version]
Camps-Valls, G.; Gomez-Chova, L.; Munoz-Mari, J.; Vila-Frances, J.; Calpe-Maravilla, J. Composite Kernels for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2006, 3, 93–97. [Google Scholar] [CrossRef]
Li, J.; Bioucas-Dias, J.M.; Plaza, A. Spectral–Spatial Hyperspectral Image Segmentation Using Subspace Multinomial Logistic Regression and Markov Random Fields. IEEE Trans. Geosci. Remote Sens. 2012, 50, 809–823. [Google Scholar] [CrossRef]
Hughes, G. On the Mean Accuracy of Statistical Pattern Recognizers. IEEE Trans. Inf. Theory 1968, 14, 55–63. [Google Scholar] [CrossRef] [Green Version]
Fauvel, M.; Tarabalka, Y.; Benediktsson, J.A.; Chanussot, J.; Tilton, J.C. Advances in Spectral-Spatial Classification of Hyperspectral Images. Proc. IEEE 2013, 101, 652–675. [Google Scholar] [CrossRef] [Green Version]
Cui, M.; Prasad, S. Class-Dependent Sparse Representation Classifier for Robust Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 2683–2695. [Google Scholar] [CrossRef]
Wright, J.; Yang, A.Y.; Ganesh, A.; Sastry, S.S.; Yi, M. Robust Face Recognition via Sparse Representation. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 210–227. [Google Scholar] [CrossRef] [Green Version]
Chen, Y.; Nasrabadi, N.M.; Tran, T.D. Hyperspectral Image Classification Using Dictionary-Based Sparse Representation. IEEE Trans. Geosci. Remote Sens. 2011, 49, 3973–3985. [Google Scholar] [CrossRef]
Shah, C.; Du, Q. Spatial-Aware Collaboration-Competition Preserving Graph Embedding for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Zhang, H.; Li, J.; Huang, Y.; Zhang, L. A Nonlocal Weighted Joint Sparse Representation Classification Method for Hyperspectral Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2056–2065. [Google Scholar] [CrossRef]
Peng, J.; Du, Q. Robust Joint Sparse Representation Based on Maximum CORRENTROPY Criterion for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 7152–7164. [Google Scholar] [CrossRef]
Benediktsson, J.A.; Palmason, J.A.; Sveinsson, J.R. Classification of Hyperspectral Data from Urban Areas Based on Extended Morphological Profiles. IEEE Trans. Geosci. Remote Sens. 2005, 43, 480–491. [Google Scholar] [CrossRef]
Khodadadzadeh, M.; Li, J.; Plaza, A.; Ghassemian, H.; Bioucas-Dias, J.M.; Li, X. Spectral–Spatial Classification of Hyperspectral Data Using Local and Global Probabilities for Mixed Pixel Characterization. IEEE Trans. Geosci. Remote Sens. 2014, 52, 6298–6314. [Google Scholar] [CrossRef]
Fang, L.; Li, S.; Duan, W.; Ren, J.; Benediktsson, J.A. Classification of Hyperspectral Images by Exploiting Spectral–Spatial Information of Superpixel via Multiple Kernels. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6663–6674. [Google Scholar] [CrossRef] [Green Version]
Ho, T.K. The Random Subspace Method for Constructing Decision Forests. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844. [Google Scholar]
Xia, J.; Ghamisi, P.; Yokoya, N.; Iwasaki, A. Random Forest Ensembles and Extended Multiextinction Profiles for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 202–216. [Google Scholar] [CrossRef] [Green Version]
Rasti, B.; Hong, D.; Hang, R.; Ghamisi, P.; Kang, X.; Chanussot, J.; Benediktsson, J.A. Feature Extraction for Hyperspectral Imagery: The Evolution from Shallow to Deep: Overview and Toolbox. IEEE Geosci. Remote Sens. Mag. 2020, 8, 60–88. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
Samat, A.; Li, E.; Wang, W.; Liu, S.; Lin, C.; Abuduwaili, J. Meta-XGBoost for Hyperspectral Image Classification Using Extended MSER-Guided Morphological Profiles. Remote Sens. 2020, 12, 1973. [Google Scholar] [CrossRef]
Li, Z.; Huang, L.; Zhang, D.; Liu, C.; Wang, Y.; Shi, X. A Deep Network Based on Multiscale Spectral-Spatial Fusion for Hyperspectral Classification. Proc. Int. Knowl. Sci. Eng. Manag. 2018, 11062, 283–290. [Google Scholar]
Li, Z.; Huang, L.; He, J. A Multiscale Deep Middle-Level Feature Fusion Network for Hyperspectral Classification. Remote Sens. 2019, 11, 695. [Google Scholar] [CrossRef] [Green Version]
Heaton, J. Ian Goodfellow, Yoshua Bengio, and Aaron Courville: Deep Learning. Genet. Program. Evolvable Mach. 2017, 19, 305–307. [Google Scholar] [CrossRef] [Green Version]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Zhang, M.; Li, W.; Du, Q.; Gao, L.; Zhang, B. Feature extraction for classification of Hyperspectral and LIDAR data using patch-to-patch CNN. IEEE Trans. Cybern. 2020, 50, 100–111. [Google Scholar] [CrossRef]
Hestness, J.; Narang, S.; Ardalani, N.; Diamos, G.; Jun, H.; Kianinejad, H.; Patwary, M.M.A.; Yang, Y.; Zhou, Y. Deep Learning Scaling Is Predictable, Empirically. Available online: https://arxiv.org/abs/1712.00409 (accessed on 29 October 2021).
Arik, S.O.; Pfister, T. TabNet: Attentive Interpretable Tabular Learning. arXiv 2020, arXiv:1908.07442. Available online: https://arxiv.org/abs/1908.07442v4 (accessed on 6 November 2021).
Arik, S.O.; Pfister, T. TabNet: Attentive Interpretable Tabular Learning. AAAI 2021, 35, 6679–6687. Available online: https://ojs.aaai.org/index.php/AAAI/article/view/16826 (accessed on 29 October 2021).
Kemker, R.; Kanan, C. Self-Taught Feature Learning for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2693–2705. [Google Scholar] [CrossRef]
Hang, R.; Liu, Q.; Hong, D.; Ghamisi, P. Cascaded Recurrent Neural Networks for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5384–5394. [Google Scholar] [CrossRef] [Green Version]
Zhu, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Generative Adversarial Networks for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 5046–5063. [Google Scholar] [CrossRef]
Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [Google Scholar] [CrossRef] [Green Version]
Cheng, G.; Li, Z.; Han, J.; Yao, X.; Guo, L. Exploring Hierarchical Convolutional Features for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 6712–6722. [Google Scholar] [CrossRef]
Haut, J.M.; Paoletti, M.E.; Plaza, J.; Li, J.; Plaza, A. Active Learning with Convolutional Neural Networks for Hyperspectral Image Classification Using a New Bayesian Approach. IEEE Trans. Geosci. Remote Sens. 2018, 56, 6440–6461. [Google Scholar] [CrossRef]
Chen, Y.; Wang, Y.; Gu, Y.; He, X.; Ghamisi, P.; Jia, X. Deep Learning Ensemble for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 1882–1897. [Google Scholar] [CrossRef]
Duan, P.; Ghamisi, P.; Kang, X.; Rasti, B.; Li, S.; Gloaguen, R. Fusion of Dual Spatial Information for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7726–7738. [Google Scholar] [CrossRef]
Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
Chen, J.; Song, L.; Wainwright, M.J.; Jordan, M.I. Learning to Explain: An Information-Theoretic Perspective on Model Interpretation. International Conference to Machine Learning (ICML) 2018. Available online: https://arxiv.org/abs/1802.07814 (accessed on 2 November 2021).
Yoon, J.; Jordon, J.; Schaar, M. Invase: Instance-wise variable selection using neural networks: Semantic scholar. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019; Available online: https://openreview.net/forum?id=BJg_roAcK7 (accessed on 2 November 2021).
Grabczewski, K.; Jankowski, N. Feature Selection with Decision Tree Criterion. In Proceedings of the Fifth International Conference on Hybrid Intelligent Systems, Rio de Janeiro, Brazil, 6–9 November 2005. [Google Scholar]
Catboost. Catboost/Benchmarks: Comparison Tools. Available online: https://github.com/catboost/benchmarks (accessed on 4 November 2021).
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Wang, X.; Tan, K.; Du, Q.; Chen, Y.; Du, P. Caps-Triplegan: Gan-Assisted CapsNet for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7232–7245. [Google Scholar] [CrossRef]
Peters, B.; Niculae, V.; Martins, A.F. Sparse Sequence-to-Sequence Models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Yves, G.; Yoshua, B. Entropy Regularization. Semi-Supervised Learn. 2006, 151–168. [Google Scholar] [CrossRef] [Green Version]
Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. 2016. Available online: https://arxiv.org/abs/1612.08083 (accessed on 28 October 2021).
Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional Sequence to Sequence Learning. 2017. Available online: https://arxiv.org/abs/1705.03122v1 (accessed on 1 November 2021).
Hoffer, E.; Hubara, I.; Soudry, D. Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks. 2017. Available online: http://arxiv-export-lb.library.cornell.edu/abs/1705.08741?context=cs (accessed on 27 October 2021).
Goldstein, T.; Osher, S. The Split Bregman Method for L1-Regularized Problems. SIAM J. Imaging Sci. 2009, 2, 323–343. [Google Scholar] [CrossRef]
Foody, G.M. Thematic Map Comparison. Photogramm. Eng. Remote Sens. 2004, 70, 627–633. [Google Scholar] [CrossRef]

Figure 1. Encoder for TabNets.

Figure 2. Feature transformer for TabNets.

Figure 3. TabNets decoder.

Figure 4. Attentive transformer for TabNets.

Figure 5. Classification maps for Indian pines data obtained using different methods including (a) ground truth image, (b) RF (77.52%), (c) MLP (77.48%), (d) LightGBM (76.54%), (e) CatBoost (75.32%), (f) XGBoost (73.79%), (g) TabNet (82.32%), (h) uTabNet (84.44%), (i) CAE (85.07%), (j) TabNets (94.93%), (k) uTabNets (96.36%), (l) sRF (88.98%), (m) sMLP (84.17%), (n) sLightGBM (89.68%), (o) sCatBoost (89.93%), (p) sXGBoost (80.93%), (q) sTabNet (94.41%), (r) suTabNet (95.85%), (s) sCAE (95.95%), (t) sTabNets (96.40%), and (u) suTabNets (97.51%).

Figure 6. Classification maps for University of Pavia data obtained for different methods including (a) ground truth image, (b) RF (78.17%), (c) MLP (80.12%), (d) LightGBM (85.21%), (e) CatBoost (82.19%), (f) XGBoost (81.06%), (g) TabNet (90.19%), (h) uTabNet (92.58%), (i) CAE (94.26%), (j) TabNets (96.58%), (k) uTabNets (97.86%), (l) sRF (89.54%), (m) sMLP (92.28%), (n) sLightGBM (92.94%), (o) sCatBoost (94.26%), (p) sXGBoost (94.26%), (q) sTabNet (97.62%), (r) suTabNet (98.95%), (s) sCAE (98.58%), (t) sTabNets (98.38%), and (u) suTabNets (99.29%).

Figure 7. Classification maps for Salinas data obtained for different methods including (a) ground truth image, (b) RF (79.76%), (c) MLP (83.73%), (d) LightGBM (87.87%), (e) CatBoost (84.12%), (f) XGBoost (86.99%), (g) TabNet (90.45%), (h) uTabNet (91.31%), (i) CAE (92.39%), (j) TabNets (97.32%), (k) uTabNets (98.36%), (l) sRF (89.10%), (m) sMLP (93.57%), (n) sLightGBM (95.05%), (o) sCatBoost (94.70%), (p) sXGBoost (93.93%), (q) sTabNet (96.20%), (r) suTabNet (98.85%), (s) sCAE (98.95%), (t) sTabNets (98.34%), and (u) suTabNets (99.33%).

Figure 8. Overall classification accuracy (with standard deviations) of considered methods and SP-extracted versions with different numbers of training samples per class: (a,b) Indian Pines, (c,d) University of Pavia, (e,f) Salinas dataset.

Table 1. Acronyms and their meaning for variants of proposed TabNet classifiers.

Notation	Meaning
TabNet	Attentive interpretable tabular learning
uTabNet	Unsupervised pretraining on attentive interpretable tabular learning
TabNets	Attentive interpretable tabular learning with spatial attention
uTabNets	Unsupervised pretraining on attentive interpretable tabular learning with spatial attention
sTabNet	Structure profile on attentive interpretable tabular learning
suTabNet	Structure profile on unsupervised pretrained attentive interpretable tabular learning
sTabNets	Structure profile on attentive interpretable tabular learning with spatial attention
suTabNets	Structure profile on unsupervised pretrained attentive interpretable tabular learning with spatial attention

Table 2. Layer-wise summary of spatial attention in proposed Tabnets for window size 25

\times

25. The last layer of the TabNets encoder is considered based upon Indian Pines data.

Table 2. Layer-wise summary of spatial attention in proposed Tabnets for window size 25

\times

25. The last layer of the TabNets encoder is considered based upon Indian Pines data.

(Attentive Transformer)
Layer	Shape of Output		Feature Map
FC	6250		6250
BN	6250		6250
Prior Scales	6250		6250
Entmax	(10,25,25)		10
2D Convolution	(16,8,8)		16
2D Convolution	(32,6,6)		32
2D Convolution	(64,4,4)		64
flatten	1024		1024
FC	6250		6250
(TabNets Encoder)
Layer		Feature Map
BN		6250
Feature Transformer		512
Split		256
Attentive Transformer		6250
Mask		6250
Feature Transformer		512
Split		256
LeakyReLU		256
FC		16

Table 3. Training and testing samples with class labels in the Indian Pines dataset.

No	Classes	Training	Testing
1	Alfalfa	5	41
2	Corn-notill	143	1285
3	Corn-mintill	83	747
4	Corn	24	213
5	Grass-pasture	48	435
6	Grass-trees	73	657
7	Grass-pasture-mowed	3	25
8	Hay-winfrowed	48	430
9	Oats	2	18
10	Soybean-notill	97	875
11	Soybean-mintill	246	2209
12	Soybean-clean	59	534
13	Wheat	21	184
14	Woods	127	1138
15	Buildings-grass-trees-drives	39	347
16	Stone-steel-towers	9	84
	Total	1027	9222

Table 4. Training and testing samples with class labels in the Pavia dataset.

No	Classes	Training	Testing
1	Asphalt	200	6431
2	Meadows	200	18,449
3	Gravel	200	1899
4	Tree	200	2864
5	Painted metal sheets	200	1145
6	Bare soil	200	4829
7	Bitumen	200	1130
8	Self-blocking bricks	200	3482
9	Shadows	200	747
	Total	1800	40,976

Table 5. Training and testing samples with class labels in the Salinas dataset.

No	Classes	Training	Testing
1	Broccoli-green-weeds-1	200	1809
2	Broccoli-green-weeds-2	200	3526
3	Fallow	200	1776
4	Fallow-rough-plow	200	1194
5	Fallow-smooth	200	2478
6	Stubble	200	3759
7	Celery	200	3379
8	Grapes-untrained	200	11,071
9	Soil- vineyard-develop	200	6003
10	Corn-seneseed-green-weeds	200	3078
11	Lettuce-romaine-4wk	200	868
12	Lettuce-romaine-5wk	200	1727
13	Lettuce-romaine-6wk	200	716
14	Lettuce-romaine-7wk	200	870
15	Vinyard-untrained	200	7068
16	Vinyard-vertical-trellis	200	1607
	Total	3200	50,929

Table 6. Parameter tuning in different algorithms.

Methods	Indian			Pavia			Salinas
Methods	$N_{d} = N_{a}$	$λ$	r	$N_{d} = N_{a}$	$λ$	r	$N_{d} = N_{a}$	$λ$	r
TabNet	256	-	-	64	-	-	512	-	-
uTabNet	32	-	0.7	64	-	0.6	64	-	0.8
TabNets	256	-	-	256	-	-	256	-	-
uTabNets	256	-	0.6	256	-	0.7	256	-	0.7
sTabNet	32	1	-	64	0.5	-	64	1	-
suTabNet	32	1	0.7	64	0.5	0.6	64	1	0.8
sTabNets	256	1	-	256	0.5	-	256	1	-
suTabNets	256	1	0.7	256	0.5	0.6	256	1	0.8

Table 7. Varying window size in TabNets (OA in percentage).

Window	Indian	Pavia	Salinas
19 × 19	92.75	95.44	96.87
21 × 21	93.23	96.04	96.97
23 × 23	94.50	96.14	97.07
25 × 25	94.93	96.58	97.32
27 × 27	94.53	96.29	97.15

Table 8. Classification accuracies on the Indian Pines dataset (10 percent training samples per class).

	RF	MLP	LightGBM	CatBoost	XGBoost	TabNet	uTabNet	CAE	TabNets	uTabNets
1	33.33	100	58.62	64.28	88.88	80.19	83.72	77.77	96.42	100
2	62.83	77.02	73.28	67.42	65.57	73.97	79.36	76.74	96.10	94.53
3	76.10	71.42	71.62	75.06	74.16	79.56	83.18	75.06	88.28	96.97
4	30.31	54.54	86.04	62.32	53.27	61.48	62.92	86.50	95.72	99.42
5	94.84	85.71	96.81	85.25	83.95	92.53	91.82	94.63	91.15	97.45
6	76.36	90.00	96.06	83.22	84.60	90.94	97.31	93.75	96.95	99.84
7	50.00	79.88	50.00	63.55	33.33	78.88	96.00	89.47	90.07	100
8	85.68	92.30	90.80	87.82	85.77	92.43	97.86	91.68	99.40	100
9	41.07	60.12	100	46.45	10.52	66.87	87.50	100	94.11	100
10	86.83	71.42	88.96	73.28	70.40	82.56	80.74	79.95	95.35	98.72
11	74.58	72.59	61.38	69.71	68.12	85.22	83.17	83.26	96.36	93.91
12	72.56	80.76	69.69	60.18	58.84	70.01	79.14	78.51	94.60	94.22
13	50.00	83.33	93.51	85.15	86.24	84.41	90.72	96.33	99.71	98.93
14	92.19	89.55	93.09	88.16	89.05	90.45	91.87	94.46	97.76	96.42
15	74.94	71.42	90.68	71.47	69.15	70.80	64.55	91.21	94.50	99.70
16	99.27	60.00	75.25	97.29	92.95	97.34	97.18	100	92.50	94.28
OA	77.20	78.84	76.54	75.32	73.79	82.32	84.14	85.07	94.93	96.36
AA	68.84	77.50	80.99	73.78	69.67	81.10	85.44	88.08	94.94	97.77
Kappa	0.73	0.75	0.72	0.71	0.6970	0.79	0.82	0.83	0.94	0.96

Table 9. SP Classification accuracies on Indian Pines dataset (10 percent training samples per class).

	sRF	sMLP	sLightGBM	sCatBoost	sXGBoost	sTabNet	suTabNet	sCAE	sTabNets	suTabNets
1	90.62	60.00	74.28	85.29	29.16	89.21	100	100	100	100
2	88.32	71.85	88.32	87.48	72.21	94.29	95.25	93.61	90.09	97.85
3	86.86	78.13	82.15	85.09	71.69	92.92	93.04	95.54	96.39	94.14
4	75.13	86.02	82.70	80.14	76.26	89.46	93.87	82.74	99.47	100
5	97.95	80.72	93.11	97.98	89.87	97.53	96.54	95.33	99.00	98.33
6	95.17	93.44	95.30	93.65	96.58	98.72	99.85	98.48	93.19	95.67
7	95.00	50.00	92.85	100	100	96.00	100	100	100	100
8	94.24	93.27	94.09	94.03	89.96	99.19	98.85	100	100	99.76
9	45.71	40.00	66.66	100	75.00	73.51	81.82	100	100	92.85
10	87.18	75.67	88.86	88.08	82.93	93.19	92.89	99.36	98.06	98.93
11	87.57	87.39	88.89	89.36	76.75	95.26	96.51	95.86	96.57	96.34
12	73.70	67.91	77.04	78.55	62.11	87.75	87.31	89.31	96.19	97.60
13	92.30	60.00	95.13	95.78	92.28	97.60	97.86	100	98.93	100
14	95.76	97.71	96.89	96.61	94.42	99.55	99.65	98.77	97.73	99.64
15	87.78	73.63	94.09	89.80	84.82	97.38	97.41	93.48	100	99.68
16	98.61	97.67	98.57	97.29	82.54	97.45	97.47	96.20	97.40	90.24
OA	88.98	84.17	89.68	89.93	80.93	94.41	95.85	95.95	96.40	97.51
AA	86.99	75.83	88.06	91.20	79.77	93.69	95.52	96.17	97.69	97.56
Kappa	0.87	0.81	0.88	0.88	0.78	0.94	0.95	0.95	0.95	0.97

Table 10. Classification accuracies on University of Pavia dataset (200 training samples per class).

	RF	MLP	LightGBM	CatBoost	XGBoost	TabNet	uTabNet	CAE	TabNets	uTabNets
1	96.65	96.88	95.94	95.88	95.98	96.58	98.53	98.98	92.71	96.27
2	92.80	92.62	94.84	93.53	93.35	96.95	98.14	98.45	98.85	98.88
3	59.05	56.78	69.38	63.56	62.23	72.76	79.50	89.67	92.16	92.67
4	70.10	67.17	76.16	70.62	70.78	79.59	88.48	97.44	99.37	98.75
5	96.27	98.31	94.83	97.00	93.25	97.18	99.21	99.47	98.99	100
6	48.48	55.47	66.76	60.41	56.74	81.67	87.83	81.17	97.44	97.61
7	57.15	59.82	57.36	55.91	55.32	70.21	67.34	82.30	97.01	91.69
8	81.14	85.07	83.10	81.72	81.39	86.06	81.31	88.57	90.39	91.26
9	99.86	99.73	99.60	99.33	99.60	98.74	100	98.03	96.03	99.41
OA	78.17	80.12	85.21	82.19	81.06	90.19	92.58	94.26	96.58	97.86
AA	77.95	79.09	82.00	79.77	78.74	86.64	88.93	92.68	95.88	96.28
Kappa	0.71	0.73	0.80	0.76	0.75	0.86	0.89	0.92	0.95	0.96

Table 11. SP Classification accuracies on University of Pavia dataset (200 training samples per class).

	sRF	sMLP	sLightGBM	sCatBoost	sXGBoost	sTabNet	suTabNet	sCAE	sTabNets	suTabNets
1	97.63	97.18	97.20	97.81	98.36	98.19	99.55	98.80	97.01	99.44
2	98.52	98.57	98.71	98.64	98.72	99.59	99.70	99.71	99.93	99.93
3	82.52	86.31	86.31	84.75	85.39	96.07	97.61	99.78	95.23	99.71
4	79.72	80.71	76.75	82.62	84.22	92.18	97.19	94.31	98.48	98.00
5	99.60	99.95	94.44	98.82	97.73	99.65	99.69	99.82	99.92	97.98
6	71.43	87.82	91.96	94.22	90.91	98.17	99.38	99.27	99.68	98.74
7	68.53	58.87	70.62	70.45	74.71	93.32	97.62	97.83	93.13	98.29
8	86.59	88.34	87.58	89.92	90.01	92.18	96.22	97.97	94.35	98.00
9	99.60	100	94.49	99.59	100	99.20	100	96.12	98.87	99.59
OA	89.54	92.28	92.94	94.20	94.26	97.62	98.95	98.58	98.38	99.29
AA	87.12	88.63	88.68	90.76	91.11	96.50	98.55	98.19	97.40	98.85
Kappa	0.86	0.89	0.90	0.92	0.92	0.96	0.98	0.98	0.97	0.99

Table 12. Classification Accuracies on Salinas dataset (200 training samples per class).

	RF	MLP	LightGBM	CatBoost	XGBoost	TabNet	uTabNet	CAE	TabNets	uTabNets
1	99.58	99.57	99.52	97.62	99.88	98.46	99.83	99.77	96.86	99.08
2	99.53	98.56	99.29	97.99	98.70	99.64	99.68	100	99.74	99.44
3	47.75	95.13	91.76	83.35	90.46	97.98	98.00	96.97	95.66	100
4	96.16	97.53	95.81	91.38	95.73	97.27	98.63	99.58	97.13	99.12
5	95.23	84.93	99.04	97.30	99.16	98.45	99.06	99.27	92.01	99.44
6	98.48	99.97	99.78	99.94	99.71	99.78	99.97	99.97	99.71	99.79
7	98.68	96.02	99.13	99.03	99.23	99.07	99.86	99.49	99.74	99.57
8	75.81	74.34	79.61	76.36	77.63	84.22	82.95	87.41	94.42	98.01
9	97.03	98.20	98.41	98.21	98.15	99.46	99.41	99.98	99.45	99.95
10	93.12	82.48	85.60	82.03	84.28	90.28	97.11	97.70	98.51	100
11	79.47	70.53	88.74	70.15	87.71	94.87	97.56	87.04	91.64	99.84
12	47.45	89.39	95.90	91.34	95.87	97.18	98.04	98.74	98.18	99.12
13	47.47	91.60	92.35	82.00	93.62	97.18	99.16	100	93.12	97.88
14	79.89	94.48	86.04	74.77	81.90	90.20	95.32	98.40	97.28	96.91
15	53.33	54.90	63.47	59.25	61.87	69.21	70.01	71.86	96.62	94.98
16	87.46	95.19	95.68	77.27	94.10	97.17	99.06	99.62	98.86	99.82
OA	79.76	83.73	87.87	84.12	86.99	90.35	91.31	92.39	97.32	98.36
AA	89.02	88.93	91.89	86.12	91.13	94.40	95.85	95.99	96.81	98.93
Kappa	0.77	0.81	0.86	0.81	0.85	0.88	0.90	0.91	0.97	0.98

Table 13. SP Classification accuracies on Salinas dataset (200 training samples per class).

	sRF	sMLP	sLightGBM	sCatBoost	sXGBoost	sTabNet	suTabNet	sCAE	sTabNets	suTabNets
1	100	100	100	99.28	69.43	99.97	99.98	99.62	99.65	100
2	99.94	99.97	99.97	99.72	99.94	99.90	99.92	100	99.97	99.94
3	93.71	97.92	94.38	94.15	96.02	98.42	99.75	99.94	99.91	100
4	96.18	97.57	95.92	96.35	95.80	97.34	97.15	100	99.46	98.72
5	98.63	96.05	99.35	98.15	97.54	98.57	99.60	99.52	99.52	99.32
6	99.53	100	99.97	96.82	99.41	99.66	100	100	99.91	99.73
7	99.80	99.65	99.58	99.31	99.96	99.45	99.87	100	99.70	99.97
8	96.39	91.68	92.54	95.12	89.84	97.59	99.02	98.21	97.21	99.00
9	98.91	99.34	98.69	99.05	98.72	100	99.96	100	99.23	100
10	93.32	90.46	89.71	93.18	95.51	92.84	97.39	99.87	99.81	99.83
11	84.62	81.26	98.80	84.00	98.95	94.01	97.12	98.72	99.89	99.90
12	77.90	96.42	99.00	99.76	99.64	99.40	99.47	98.78	99.67	100
13	82.68	98.80	97.50	98.58	99.09	94.09	99.17	100	98.05	99.65
14	93.41	97.41	98.83	73.44	98.23	95.87	94.52	99.35	99.19	100
15	63.38	79.93	86.23	87.20	89.95	84.63	96.68	96.53	95.34	97.41
16	95.74	95.08	96.44	92.23	98.64	95.64	99.78	100	99.70	100
OA	89.10	93.57	95.05	94.70	93.93	96.20	98.85	98.95	98.34	99.33
AA	92.13	95.16	96.68	94.15	95.34	97.07	98.71	99.41	99.14	99.59
Kappa	0.88	0.92	0.94	0.94	0.93	0.95	0.98	0.98	0.98	0.99

Table 14. Significance from the standard McNemar’s test for the difference between algorithms.

	Z Value/Significant?
	Indian	Pavia	Salinas
TabNet versus RF	8.57/yes	47.54/yes	44.62/yes
TabNet versus MLP	8.83/yes	40.35/yes	35.88/yes
TabNet versus LightGBM	10.34/yes	22.11/yes	13.40/yes
TabNet versus CatBoost	12.51/yes	30.11/yes	25.09/yes
TabNet versus XGBoost	8.70/yes	35.80/yes	16.91/yes
uTabNet versus TabNet	3.21/yes	8.72/yes	5.33/yes
TabNets versus TabNet	26.16/yes	43.24/yes	44.85/yes
uTabNets versus CAE	20.21/yes	29.78/yes	30.45/yes
uTabNets versus TabNets	3.77/yes	12.77/yes	6.15/yes
sTabNet versus sRF	14.17/yes	47.14/yes	46.09/yes
sTabNet versus sMLP	26.74/yes	35.29/yes	21.35/yes
sTabNet versus sLightGBM	12.67/yes	33.90/yes	12.20/yes
sTabNet versus sCatBoost	12.13/yes	22.45/yes	14.43/yes
sTabNet versus sXGBoost	27.69/yes	25.62/yes	21.26/yes
sTabNet versus TabNet	25.76/yes	44.01/yes	41.96/yes
suTabNet versus sTabNet	3.67/yes	14.99/yes	23.84/yes
sTabNets versus TabNets	3.81/yes	15.55/yes	5.98/yes
suTabNets versus uTabNets	3.58/yes	13.87/yes	5.89/yes

Table 15. Execution time (in seconds) in different experimental datasets.

	Indian	Pavia	Salinas
RF	5.31	4.99	5.64
MLP	12.04	16.5	40.67
LightGBM	380.21	345.01	1080.63
CatBoost	38.83	25.9	48.10
XGBoost	40.49	15.07	25.29
TabNet	710.05	639.104	637.26
uTabNet	873.82	747.34	837.24
CAE	915.28	1265.15	1315.54
TabNets	938.03	1620.17	1890.56
uTabNets	1796.25	2580.23	3520.17
sTabNet	963.05	1027.09	1206.17
suTabNet	1017.13	1180.33	1412.33
sTabNets	973.56	1780.03	1984.54
suTabNets	1880.17	2663.37	3720.57

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shah, C.; Du, Q.; Xu, Y. Enhanced TabNet: Attentive Interpretable Tabular Learning for Hyperspectral Image Classification. Remote Sens. 2022, 14, 716. https://doi.org/10.3390/rs14030716

AMA Style

Shah C, Du Q, Xu Y. Enhanced TabNet: Attentive Interpretable Tabular Learning for Hyperspectral Image Classification. Remote Sensing. 2022; 14(3):716. https://doi.org/10.3390/rs14030716

Chicago/Turabian Style

Shah, Chiranjibi, Qian Du, and Yan Xu. 2022. "Enhanced TabNet: Attentive Interpretable Tabular Learning for Hyperspectral Image Classification" Remote Sensing 14, no. 3: 716. https://doi.org/10.3390/rs14030716

APA Style

Shah, C., Du, Q., & Xu, Y. (2022). Enhanced TabNet: Attentive Interpretable Tabular Learning for Hyperspectral Image Classification. Remote Sensing, 14(3), 716. https://doi.org/10.3390/rs14030716

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced TabNet: Attentive Interpretable Tabular Learning for Hyperspectral Image Classification

Abstract

1. Introduction

2. Related Work

2.1. Tree Based Learning

2.2. Attentive Interpretable Tabular Learning (TabNet)

3. Proposed Method

3.1. TabNet for Hyperspectral Image Classification

3.2. TabNet with Unsupervised Pretraining

3.3. TabNet with Spatial Attention (TabNets)

3.4. Structure Profile on TabNet (sTabNet)

3.5. Structure Profile on Unsupervised Pretrained TabNet (suTabNet)

4. Experiments

4.1. Datasets

4.2. Experimental Setup

4.3. Result of Classification

5. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI