Next Article in Journal
Combining Multi-View UAV Photogrammetry, Thermal Imaging, and Computer Vision Can Derive Cost-Effective Ecological Indicators for Habitat Assessment
Next Article in Special Issue
A Multi-Hyperspectral Image Collaborative Mapping Model Based on Adaptive Learning for Fine Classification
Previous Article in Journal
CUS3D: A New Comprehensive Urban-Scale Semantic-Segmentation 3D Benchmark Dataset
Previous Article in Special Issue
Hyperspectral Image Classification with the Orthogonal Self-Attention ResNet and Two-Step Support Vector Machine
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Joint Classification of Hyperspectral and LiDAR Data Based on Adaptive Gating Mechanism and Learnable Transformer

1
Key Laboratory of Advanced Ship Communication and Information Technology, Harbin Engineering University, Harbin 150001, China
2
Agile and Intelligent Computing Key Laboratory, Chengdu 610000, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(6), 1080; https://doi.org/10.3390/rs16061080
Submission received: 1 February 2024 / Revised: 7 March 2024 / Accepted: 13 March 2024 / Published: 19 March 2024
(This article belongs to the Special Issue Recent Advances in the Processing of Hyperspectral Images)

Abstract

:
Utilizing multi-modal data, as opposed to only hyperspectral image (HSI), enhances target identification accuracy in remote sensing. Transformers are applied to multi-modal data classification for their long-range dependency but often overlook intrinsic image structure by directly flattening image blocks into vectors. Moreover, as the encoder deepens, unprofitable information negatively impacts classification performance. Therefore, this paper proposes a learnable transformer with an adaptive gating mechanism (AGMLT). Firstly, a spectral–spatial adaptive gating mechanism (SSAGM) is designed to comprehensively extract the local information from images. It mainly contains point depthwise attention (PDWA) and asymmetric depthwise attention (ADWA). The former is for extracting spectral information of HSI, and the latter is for extracting spatial information of HSI and elevation information of LiDAR-derived rasterized digital surface models (LiDAR-DSM). By omitting linear layers, local continuity is maintained. Then, the layer Scale and learnable transition matrix are introduced to the original transformer encoder and self-attention to form the learnable transformer (L-Former). It improves data dynamics and prevents performance degradation as the encoder deepens. Subsequently, learnable cross-attention (LC-Attention) with the learnable transfer matrix is designed to augment the fusion of multi-modal data by enriching feature information. Finally, poly loss, known for its adaptability with multi-modal data, is employed in training the model. Experiments in the paper are conducted on four famous multi-modal datasets: Trento (TR), MUUFL (MU), Augsburg (AU), and Houston2013 (HU). The results show that AGMLT achieves optimal performance over some existing models.

1. Introduction

Data from multiple remote sensing imaging devices in the same geographic area are available, which makes it possible to analyze land cover material using multi-modal data. Various remote sensing imaging sensor technologies can effectively capture different features of land cover materials. For example, hyperspectral imagers can acquire reflected spectral information while acquiring ground spatial information [1], and light detection and ranging (LiDAR) can measure the elevation information of ground objects [2]. Integrating multi-modal data allows for the construction of a more detailed and comprehensive feature representation of ground objects.
Since the late 20th century, hyperspectral imaging has emerged as a pivotal detection technique in remote sensing, which employs an imaging spectrometer to precisely segment the spectrum across visible near-infrared, short-wave infrared, and even long-wave infrared ranges. This process generates tens to hundreds of spectral bands for imaging ground objects simultaneously. It captures the spectral details of various ground objects alongside their spatial distribution, thereby merging image and spectral information effectively. Therefore, hyperspectral image (HSI) is widely used in land-cover classification [3], ecosystem measurement [4], military reconnaissance [5], target detection [6], and many other fields [7,8,9]. Among them, land cover classification, also known as hyperspectral image classification (HSIC), is particularly important in HSI processing tasks.
HSIC uses spectral dimension information and spatial dimension information to assign a category identifier to each pixel [10]. Early HSIC tasks mainly relied on data from a single mode. Roy et al. [11] integrated the three-dimensional convolutional neural network (3DCNN) and the two-dimensional convolutional neural network (2DCNN) to design a hybrid convolutional neural network (CNN) for spectral–spatial feature representation. Sun et al. [12] designed a classification model with heterogeneous spectral–spatial attention convolutional neural blocks, which simultaneously extracted the three-dimensional (3D) features from HSI. Although CNN has excellent performance, it has some limitations when dealing with long sequence properties of spectral features due to its inherent network backbone structure. Due to the power of the vision transformer, Hong et al. [13] developed a spectral transformer for extracting spectral discriminative features from bands of HSI. While transformer networks excel at simulating global interactions between token embeddings via self-attention (SA) mechanisms, they fall short in effectively disseminating local information among tokens [14]. Therefore, Sun et al. [15] combined the CNN module with a transformer encoder to form a new spectral–spatial feature tokenization transformer for representing sequential relations and high-level semantic features. Wang et al. [16] proposed a new spectral–spatial kernel combined with an improved visual transformation method to extract spectral–spatial features of HSI together.
For HSI, the spectral features of identical ground objects may vary, and, conversely, similar spectral features can correspond to different ground objects [17]. Therefore, it is necessary to supplement the ground object information with the multi-modal remote sensing data in the same area. LiDAR-DSM data, which primarily contain terrain variations and the feature heights of surface objects [18], are often employed in conjunction with HSI for joint classification, thereby enhancing classification accuracy. Compared with HSI alone, the advantage of multi-modal image collaborative classification is that it can fully describe the features of the target and make a more accurate judgment of the target. Consequently, numerous research initiatives have been undertaken to harness the complementary information between HSI and LiDAR-DSM. Pedergnana et al. [19] combined morphological extended attribute profiles on HSI and LiDAR-DSM data with raw spectral data from HSI for classification. However, directly stacking high-dimensional features can trigger the Hughes phenomenon, especially when training samples are scarce. Rasti et al. [20] utilized extinction profiles to derive spatial and elevation information from HSI and LiDAR-DSM data and integrate them with spectral information through a feature fusion method based on Orthogonal Total Variation Component Analysis (OTVCA), which facilitates the processing of fusion features in the lower-dimensional space.
However, traditional methods rely heavily on prior information, so it is difficult to improve the classification accuracy while maintaining robustness. Deep learning can learn high-level semantic information from data using the end-to-end pattern [21]. Roy et al. [22] proposed a joint feature learning fusion mechanism based on CNN and spatial morphological blocks to generate high-precision land cover maps. Song et al. [23] proposed a new hash-based deep metric learning approach that focuses on sample correlations between single-source and cross-source data. Xu et al. [24] used two-branch CNN to extract spatial and spectral information of HSI and a cascaded network to extract elevation information of LiDAR-DSM and carried out block-level fusion and classification. Although CNN has excellent performance, due to its inherent network backbone structure, it has certain limitations in processing long sequence attributes of features. Therefore, inspired by the classification of HSI, researchers have applied the fusion model of CNN and transformer to the joint classification task of HSI and LiDAR-DSM. Ding et al. [25] introduced the Global–Local Transformer Network (GLT-Net), designed to capture the global–local correlation features from inputs, effectively enhancing classification outcomes. This method only concatenated features from HSI and LiDAR-DSM without deep information fusion learning. Zhang et al. [26] developed the Local Information Interaction Transformer (LIIT), addressing the challenge of redundant or deficient complementary information between HSI and LiDAR-DSM data by dynamically integrating multi-modal features via the transformer, also achieving promising results. However, it has some shortcomings in extracting fine-grained information from images. Xu et al. [27] proposed a transformer with multi-branch interaction to extract spectral, spatial, and elevation information simultaneously. Its spectral and spatial information is learned independently before being concatenated, rather than interactive learning on multi-modal data. Roy et al. [28] proposed a transformer backbone to extract feature representations from multiple sources of data and use class tokens for final classification. Zhao et al. [29] proposed a novel dual-branch approach, combining a hierarchical CNN with a transformer network, designed to fuse multi-modal heterogeneous information and enhance joint classification performance. While these two methods enable the interactive learning of multi-modal data, feature extraction using a shallow CNN is relatively simplistic, lacking local fine-grained detail, and the feature dynamics within the fusion structure are inadequate.
To fully extract the local fine-grained features in HSI and LiDAR-DSM data and improve classification performance, a novel adaptive joint classification method based on the adaptive gating mechanism and learnable transformer (AGMLT) is designed. The dual-branch spectral–spatial adaptive gating mechanism (SSAGM) is engineered to concurrently extract spectral–spatial features from HSI and elevation features from LiDAR-DSM. Additionally, the layer scale and learnable transition matrices are incorporated into the original transformer encoder to enhance training dynamics. Learnable transition matrices are further applied to cross-attention, augmenting the attention graphs across various levels. The model training utilized poly loss, ultimately leading to improved classification performance. The key contributions of AGMLT are summarized as follows.
  • The Gated Spatial Attention Unit (GSAU) [30] is introduced into the joint classification of HSI and LiDAR-DSM, which is improved to design a dual-branch SSAGM feature extraction module. SSAGM encompasses the point depthwise attention module (PDWA) and the asymmetric depthwise attention module (ADWA). The PDWA primarily aims at extracting the spectral features from HSI, while the ADWA focuses on extracting spatial information from HSI and elevation information from LiDAR-DSM. This approach allows for the omission of the linear layer to emphasize local continuity without compromising complexity.
  • The learnable transformer (L-Former) is designed to enhance data dynamics and mitigate performance decline as the depth of the transformer increases. The layer scale is incorporated into the output of each residual block, with different output channels being multiplied by distinct values to further refine the features. Concurrently, a learnable transition matrix is integrated into the self-attention (SA) to develop learnable self-attention (LS-Attention, LSA), which addresses the issue of centralized decomposition and facilitates the training of deeper transformers.
  • The learnable transition matrix is integrated into cross-attention, forming learnable cross-attention (LC-Attention). This integration diminishes the similarity among attention maps, thereby augmenting the diversity of the features.
  • Poly loss is implemented for classifying to improve the model training. Remote sensing datasets frequently exhibit uneven distributions and potential overlaps among samples of the same type. Furthermore, the features of data differ across various modalities. Poly loss is a versatile loss function suited for multi-modal data fusion classification.
The rest of this paper is arranged as follows. Section 2 expounds on the relevant theory of the proposed method AGMLT. Section 3 presents the four well-known multi-modal datasets, experimental settings, and various experiments on the datasets. In Section 4, the ablation analysis and performance of different percentages of training samples are discussed. Finally, Section 5 concludes the paper.

2. Methodology

The AGMLT proposed in this paper is shown in Figure 1. Firstly, SSAGM is designed to enhance feature extraction. Then, the L-Former with two learnable matrixes is proposed to increase the data dynamics and prevent performance degradation as the transformer deepens. At the same time, LC-Attention with a learnable matrix enriches the feature information of multi-modal fusion. Finally, poly loss is the loss function for AGMLT, which is more suitable for data with different modes.
From Algorithm 1, the original input data of AGMLT could be represented as X I N H S I R H × W × B and X I N L i D A R R H × W , where height is H , width is W , and spectra are B . HSI has a large number of spectral bands, which can provide rewarding information but also significantly increases the cost of computing. Principal component analysis (PCA) is used to reduce the spectral number of hyperspectral images. The data after PCA would be reshaped to X P C A H S I R H × W × L , of which L is the number of bands after PCA. Since HSI is the 3D data, X P C A H S I is sent to 3DCNN to extract 3D features X 3 D C N N H S I . X 3 D C N N H S I is reshaped to X 2 D H S I to make the data dimension match the subsequent attention module. Put X 2 D H S I into the PDWA to focus on extracting spectral features. The output X P D W A H S I is sent to 2DCNN for simple extracting the features. Then, the outputs of 2DCNN are sent to ADWA to extract the spatial information and get the output features X A D W A H S I . LiDAR-DSM is the two-dimensional (2D) data, so it could calculate directly with 2DCNN. The outputs are sent to ADWA for the extraction of the elevation information and to gain the output. X A D W A L i D A R . Next, the reshaped X A D W A H S I and X A D W A L i D A R are integrated into the Fusion Module. The Fusion Module may loop N times. Then, the outputs X L L F H S I and X L L F L i D A R of the L-Former in the Fusion Module are put into LC-Attention for information fusion of multi-modal data. Finally, the outputs X L C A H S I and X L C A L i D A R are fed into the multi-layer perceptron (MLP) separately for the final classification. Poly loss is used to measure the degree of inconsistency between the predicted labels Y P and the true labels Y L .
Algorithm 1 The algorithm flow of AGMLT
InputHSI:  X I N H S I R H × W × B , LiDAR-DSM:  X I N L i D A R R H × W , Labels:  Y L R H × W , Patches = 11 × 11, PCA = 30.
OutputPrediction:  Y P .
1:Initialize: batch size = 64, epochs = 100, learning rate depends on datasets.
2:PCA:  X P C A H S I R H × W × L .
3:Create all sample patches from X P C A H S I , X I N L i D A R , and divide them into the training sets D t r a i n and the test sets D t e s t . ( D t r a i n contains the labels, and D t e s t does not contain the labels).
4:Training AGMLT (begin)
5:for epoch in range(epochs):
6:for i, ( D t r a i n H S I , D t r a i n L i D A R , Y L ) in enumerate ( D t r a i n ):
7: X P C A H S I 3 DCNN X 3 D C N N H S I reshape X 2 D H S I PDWA X P D W A H S I 2 DCNN X 2 D C N N H S I ADWA X A D W A H S I reshape X 1 D H S I
8: X I N L i D A R 2 DCNN X 2 D C N N L i D A R ADWA X A D W A L i D A R reshape X 1 D L i D A R
9: X 1 D H S I LL - Former X L L F H S I , X 1 D L i D A R LL - Former X L L F L i D A R
10: X L L F H S I , X L L F L i D A R LC - Attention X L C A H S I , X L C A L i D A R
11: X O U T = MLP X L C A H S I + MLP X L C A L i D A R
12: Poly   loss X O U T , Y L
13:Training AGMLT (end) and test AGMLT
14: Y P = AGMTL t r a i n e d D t e s t

2.1. SSAGM

Although the transformer networks can simulate global interactions between token embeddings through the SA, they are less capable of extracting fine grained local feature patterns [31]. Based on the superior ability of CNNs to model spatial context features, it performs exceptionally well in HSI classification tasks. Simultaneously, many applications have proved that CNNs can extract the deep features of LiDAR-DSM [32]. Therefore, we introduce a CNN to extract features from input data. To further enhance feature representation, we are inspired by GSAU [30] to design the SSAGM. The key components of SSAGM are PDWA and ADWA, which enable the linear layer to be excluded and local continuity to be captured while considering complexity.
PDWA is used to extract spectral features from HSI, which is shown on the left side of Figure 2. PDWA includes pointwise convolution (PWConv), point depthwise convolution (PDWConv), multiplication operation, and residual connection. ADWA is mainly used to extract spatial feature information of HSI and elevation information of LiDAR-DSM. Its structure is shown on the right of Figure 2.
The input data of the PDWA are divided into X P 1 and X P 2 evenly. X P 1 is sent to the PWConv layer to obtain X P P 1 . Feed X P P 1 into the PDWConv with 1 × 1 convolution kernel to yield the output X P D . Groups in the PDWConv layer are equal to the channels of X P P 1 . Since the convolution kernel size is 1 × 1 and the number of groups is the same as the input, it achieves the role of focusing on the channel information. X P 2 also obtains X P P 2 using a PWConv. To preserve partial original information, there are no operations performed on X P P 2 . The data obtained by multiplying X P D and X P P 2 are connected with X P i n via a residual connection. Then, it is sent to the PWConv layer to obtain the output X P o u t . The PWConv contains a 1 × 1 convolution kernel whose purpose is to adjust the data dimension for element-by-element multiplication and residual connection. The main process of PDWA is as follows:
PDWA ( X P 1 , X P 2 ) = F PDW ( X P 1 ) X P 2 ,
where X P 1 and X P 2 represent the feature data of the two branches in PDWA, respectively. F PDW ( ) and represent the PDWConv and multiplication.
ADWA mainly includes the PWConv, two asymmetric depthwise convolution (ADWConv) layers, multiplication operation, and residual connection. This module changes the PDWConv in PDWA to two ADWConv with 3 × 1 and 1 × 3 convolution kernels, and other operations are unchanged. The main processes of ADWA are calculated as follows:
ADWA ( X A 1 , X A 2 ) = F ADW 2 ( F ADW 1 ( X A 1 ) ) X A 2 ,
where X A 1 and X A 2 represent the features of the two branches in ADWA. F ADW 1 and F ADW 2 represent two ADWConv.

2.2. L-Former

Figure 3 shows the structural details of the proposed L-Former. Transformer encoders are used to model the deep semantic relationships between tokens of features, which could map the input of L-Former to a sequence of vectors. A class token is embedded in the head of the vector sequence, which obtains the overall sequence. Then, we embed n position encodings into the sequence to obtain multiple tokens. The more proximate the information, the more similarly is encoded. Then, we enter multiple tokens into the transformer encoder. The output of learnable attention (L-Attention) is classified using MLP, which consists of one layer norm (LN) and two fully connected layers. The Gaussian Error Linear Unit (GELU) [33] activation function is used for classification to obtain the final classification result. The above operations are stacked repeatedly N times. As the model goes deep, the attention graphs of the deeper blocks become more similar, which means that adding more blocks to a deep transformer may not improve model performance [34].
Therefore, we introduce the layer scale from cait attention [35] into the transformer encoder. The layer scale adds a learnable diagonal matrix to the output of each residual block, which initialize to near 0. Applying distinct multiplication factors to different channels of the output from SA or MLP refines the features, enhancing their expression quality in the model. It could train deeper volumes. The formulas are as follows:
x l = x l + diag λ l , 1 , , λ l , d × SA η x l ,
x l + 1 = x l + diag λ l , 1 , , λ l , d × MLP η x l ,
where η is the layer norm and MLP is the feedforward network used in L-Former. λ l , 1 and λ l , 1 are learnable weights for SA and MLP. The diagonal values are all initialized to the fixed small value σ . When the depth is within 18, σ is set as 0.1, σ = 5 × 10 3 is used to the depth within 24, and σ = 5 × 10 6 is adopted in the deeper networks.
In order to learn the relationship between feature tokens, W q , W k , and W v learnable weights are pre-defined for SA. Multiply the feature tokens with the three learnable weights and linearly package them into three different matrices (queries Q, keys K, and values V). The softmax function converts the scores into weight probabilities. And SA is written as follows:
SA Q , K , V = S o f t m a x Q K T d K V ,
where d K represents the dimension of K .
At the same time, the learnable transition matrix M R N × N from re-attention is introduced into SA to obtain LSA, which overcomes the problem of concentration breakdown and allows for training a deeper transformer [34].
LS - Attention Q , K , V = M T S o f t m a x Q K T d K V ,
where the transformation matrix M is multiplied by the self-attentional mapping of the head dimension. The softmax function is applied to the rows of comparable matrices. Relationships between tokens are modeled by projecting similarities between pairs of Q and K, and an attention score is acquired.
We adopt multiple groups of weights to form L-Attention, which is like multi-head attention (MHSA). L-Attention has multiple learnable SA (LSA), and all of these LSA scores are tied together. The expression is as follows:
L - Attention Q , K , V = Concat LSA 1 ,   LSA 2 , ,   LSA h W ,
Here, h is the number of attention heads and W is the parameter matrix.

2.3. LC-Attention

Figure 4 shows the schematic diagram of the fusion encoding module for HSI feature representations and LiDAR-DSM feature representations, respectively.
Taking the fusion encoding module of HSI feature representations as an example, the class token X c l s H S I of HSI is spliced with the pixel tokens of LiDAR-DSM data first, and the formulas are
X c l s H S I = F H S I X c l s H S I ,
X L H S I = X c l s H S I X L i D A R X c l s L i D A R   ,
where X c l s H S I is the class token of HSI feature representations, and X c l s L i D A R is the class token of LiDAR-DSM feature representations. F H S I is a linear mapping function for dimensional alignment. X c l s H S I represents the transformed class token that is consistent with the X c l s L i D A R dimension. X L H S I   is represented the new LiDAR-DSM feature representations, where the original X c l s L i D A R is replaced by X c l s H S I .
Then, LC-Attention with the learnable transition matrix M R N × N is used to encode between X c l s H S I and X L H S I   . X c l s H S I is the only query vector for attention operations. Feature fusion representations based on LC-Attention are expressed as follows:
Q = X c l s H S I W q K = X L H S I   W k V = X L H S I   W v ,
LC - Attention X L H S I = M T S o f t m a x Q K T C / H V ,
where W q , W k , and W v are the weight matrices of learning updates, C is the embedded dimension, and H is the number of attention heads.
The time and space complexity of creating the attention diagram is linear because it is only used in the query vector, which makes the entire computation more efficient. Similar to the MHSA, LC-Attention also uses multiple heads, namely MHLCA. After layer norm and residual connection, the formula of LC-Attention is expressed as follows:
Y c l s H S I = X c l s H S I + MHLCA LN X L H S I ,
Y c l s H S I = G H S I Y c l s H S I ,
X H S I = Y c l s H S I X H S I X c l s H S I ,
where Y c l s H S I is the class token obtained by learning fusion features. Y c l s H S I is consistent with the class token dimensions of LiDAR-DSM. Y c l s H S I indicates a class token with the same dimension as the class token of HSI, which is obtained by linear mapping G H S I . At the same time, G H S I is used for dimensional alignment.
The same processing is used for fusion processing in LiDAR-DSM feature representations. The output after fusion is X L i D A R , and the class token obtained by feature learning is Y c l s L i D A R . The new class tokens obtained using feature fusion learning are fed into the classifier for classification.

2.4. Poly Loss

Cross-entropy loss (CE) and focal loss (FC) are the most common choices for training classification networks. However, a good loss function should take a more flexible form for tailoring to different tasks and datasets [36]. For remote sensing datasets, the sample distribution of the same class may be uneven, and some different samples will even overlap. This makes the classification effort more difficult.
Leng et al. [36] proposed poly loss, which decomposed the commonly used classification loss function into a series of weighted polynomial bases through Taylor expansion. CE and FC are decomposed into a series of weighted polynomial bases with polynomial coefficients as the predicted probabilities labeled with class labels. Each polynomial base is weighted by the corresponding polynomial coefficient. Poly loss adjusts the polynomial coefficients for different tasks and datasets, and its formulas are as follows:
L P C = log P t + j = 1 N ε j 1 P t j ,
L PF = 1 P t γ log P t + j = 1 N ε j 1 P t j + γ ,
where j represents the power of the polynomial basis and γ represents the power shift of the polynomial term. ε j [ 1 / j , ] is the perturbation term. It allows us to pinpoint the first N polynomial without worrying about infinitely many higher-order ( j > N + 1 ) coefficients. The predicted probability of the model for the target class is shown as P t . Adjusting the first polynomial term gives the most significant gain, so the poly loss formulas can be reduced to the following:
L P C = log P t + ε 1 1 P t ,
L PF = 1 P t γ log P t + ε 1 1 P t 1 + γ

3. Experimental Results

3.1. Data Description

The performance of the proposed AGMLT method in this paper is evaluated on four public multi-modal datasets: Trento (TR), MUUFL (MU) [37,38], Augsburg (AU), and Houston2013 (HU). Details of all datasets are described as follows.
  • TR
The TR dataset covers a rural area surrounding the city of Trento, Italy. It includes HSI and LiDAR-DSM data with 600 × 166 pixels, and six categories. The HSI has 63 bands in the wavelength range from 420.89 to 989.09 nm. The spectral resolution is 9.2 nm, and the spatial resolution is 1 m. The LiDAR-DSM data consist of a single-channel image containing the altitude of the corresponding ground position, and its image size is the same as that of HSI. The pseudo-color image of HSI, the grayscale image of LiDAR-DSM, and the ground-truth image are shown in Figure 5. The color, class name, training samples, and test samples for the TR dataset are presented in Table 1.
2.
MU
Both HSI and LiDAR data from the MU dataset were collected in one flight using a flight platform equipped with the CASI-1500 hyperspectral imager and Gemini LiDAR. The MU dataset covers the University of Southern Mississippi Gulf Park Campus, Long Beach, Mississippi, USA. The dataset was acquired in November 2010 with a spatial resolution of 1 m per pixel. The original dataset is 325 × 337 pixels with 72 bands, and the imaging spectral range is between 380 nm and 1050 nm. Due to the influence of imaging noise, the first four and last four bands were removed, and 64 bands were ultimately used. The invalid area on the right of the original image was removed, and the 325 × 220 pixels were retained. A DSM image was generated using LiDAR data, and its spatial resolution was 1 m per pixel. Objects in the imaging scene were labeled into eleven categories. The pseudo-color image of HSI, the grayscale image of LiDAR-DSM, and the ground-truth image are shown in Figure 6. The details of MU dataset are presented in Table 2.
3.
AU
The AU dataset was captured over the city of Augsburg, Germany. The HSI was obtained using a DAS-EOC HySpex sensor [39], and the LiDAR-DSM data were collected using the DLR-3 K system [40]. The spatial resolutions were down sampled to a unified resolution of 30 m for managing the multi-modal data adequately. The HSI has 180 bands from 0.4 to 2.5 μm, while LiDAR-DSM data have a single raster. The pixel size of AU is 332 × 485, with seven different land cover classes being depicted. The pseudo-color image of HSI, the grayscale image of LiDAR-DSM, and the ground-truth image are shown in Figure 7. Details on the AU dataset are presented in Table 3.
4.
HU
The HU dataset was provided by IEEE GRSS for the 2013 Data Fusion Competition. The scene covers the University of Houston and its surrounding area in Texas, USA. It includes HSI and LiDAR-DSM data with 340 × 1905 pixels, and fifteen categories. The HIS has 144 bands in the wavelength range of 0.38 to 1.05 µm and with a spatial resolution of 2.5 m per pixel. The spatial resolution of LiDAR-DSM data is also 2.5 m per pixel. The pseudo-color image of HIS, the grayscale image of LiDAR-DSM, and the ground-truth image are shown in Figure 8. The color, class name, training samples, and test samples for the HU dataset are shown in Table 4.

3.2. Experimental Setting

The experiments related to this paper were conducted on a computer with Windows 11, Intel Core i9 CPU with 32 GB memory, and NVIDIA RTX 3090Ti graphics with 24 GB GPU memory, which were coded with Python 3.8 under pytorch 1.12.0. The sizes of the input images, batch size, and epochs were set to 11 × 11, 64, and 100, respectively. The number of principal components chosen by PCA was set as 30. In order to improve the reliability of the experimental results, training samples and test samples were randomly selected for TR, MU, AU and HU datasets. Since the baseline algorithm in this paper is HCT, the choices of training samples and test samples are consistent with the HCT [29]. Table 1, Table 2, Table 3 and Table 4 list the number of training samples and test samples of the four datasets. All experiments were conducted five consecutive times, and the final classification results are average values of the five times. The evaluation index overall accuracy (OA), average accuracy (AA), and statistical kappa coefficient (K), which are commonly used in classification experiments, are chosen as the key evaluation indexes of this paper.
To obtain the best accuracy, it is necessary to compare the experimental results of different experimental parameters. The initial learning rate of Adam, the heads for attention, the depth of encoders and the depth of the Fusion Module are tested on all datasets. The control variable method is used in the experiments, that is, the input size, epochs, experiment times, the number of training samples and test samples are consistent.

3.2.1. Initial Learning Rate

Table 5 shows the influence of initial learning rates for Adam on the experimental results. Initial learning rates of 0.001, 0.0005, and 0.0001 are selected in the experiments. The results show that the best accuracy could be obtained by setting the initial learning rate as 0.0005 on TR and AU datasets, 0.001 on MU dataset, and 0.0001 on HU dataset.

3.2.2. Depth and Heads

Figure 9 depicts the synergistic effect of the number of attention heads, the depth of encoders, and the depth of the Fusion Module. The number of heads for L-Attention and LC-Attention are identical, and the depth of each encoder and the Fusion Module are the same. The experiments selected four combinations of 4 + 2, 4 + 1, 8 + 2, and 8 + 1, which concluded that the best accuracy could be obtained by setting the heads for the attention, and the depth for encoders and the Fusion Module as 4 and 2 on all datasets.

3.3. Performance Comparison

In this section, the proposed AGMLT is compared with DMCN [41], SpectralFormer [13], SSFTT [15], morpFormer [42], CoupledCNN [32], MFT_PT [28], MFT_CT [28], and HCT [29] for validating the classification performance. The initial learning rates for the baseline HCT are consistent with the original paper, which for the TR and HU are 0.001, for the MU is 0.0001, and for the AU is 0.0005. The depth for the Fusion Encoder of HCT on all datasets is 2, the depth in the transformer encoder and cross-attention is 1, and the attention heads for TR, MU, AU, and HU are 4, 8, 8, and 8 based on the source code. The initial learning rates for DMCN, SpectralFormer, SSFTT, and CoupledCNN are consistent with AGMLT for obtaining optimal performance, and for morpFormer, MFT_PT, and MFT_CT are 0.0005 as in the original papers. The classification results and classification maps on all datasets of the methods are outlined in Section 3.3.1, and Section 3.3.2 shows the comparison of consumption and computational complexity for all methods.

3.3.1. Experimental Results

The classification results of the proposed AGMLT and all the comparison methods are shown in Table 6, Table 7, Table 8 and Table 9. It could be seen that the proposed AGMLT achieves the best results on evaluation indicators, with the OA reaching 99.72%, 90.16%, 97.80%, and 99.93%, AA reaching 99.57%, 92.47%, 89.35%, and 99.95%, and K × 100 reaching 99.62%, 87.14%, 96.85%, and 99.93% on the TR, MU, AU and HU datasets, respectively. For evaluating the classification performance.
  • TR dataset
As shown in Table 6, SpectralFormer has the worst classification results, because it directly flattens the image block into the vector, which destroys the internal structure information of the image. Coupled CNN is the second worst because its structure is relatively simple and the ability to extract features is relatively weak. The proposed AGMLT improves 0.37%, 1.73%, 0.54%, 0.70%, 1.33%, 0.61%, 0.27%, and 0.10% on OA compared to DMCN, SpectralFormer, SSFTT, morpFormer, Coupled CNN, MFT_PT, MFT_CT, and HCT. At the same time, the proposed AGMLT improves 0.70%, 3.03%, 0.87%, 1.08%, 2.19%, 0.90%, 0.54%, and 0.26% on AA, and improves 0.49%, 2.31%, 0.72%, 0.93%, 1.77%, 0.81%, 0.36%, and 0.13% on K × 100, respectively. In addition, it could be found that the accuracy of the categories SSFTT, morpFormer, HCT, and the proposed AGMLT reached 100%. The accuracy of the categories SSFTT, morpFormer, and the proposed AGMLT also reached 100%. This is because the distribution of these two samples is simple, which is means that it is easy to learn the feature information. From Figure 10, the salt-and-pepper noise of AGMLT is the least compared to the comparison methods.
Table 6. Classification results of all methods on TR dataset (the bold represents the optimum accuracy).
Table 6. Classification results of all methods on TR dataset (the bold represents the optimum accuracy).
No.HSI InputHSI and LiDAR-DSM Input
DMCNSpectralFormerSSFTTmorp-
Former
Coupled
CNN
MFT_PTMFT_CTHCTAGMLT
1Mean99.6599.198.8497.8999.1897.6598.299.5799.47
Std0.350.720.610.750.610.450.440.370.14
2Mean99.7494.4998.0196.4992.9297.9398.7498.8598.81
Std0.490.390.52.576.240.480.640.280.37
3Mean99.4497.5410010099.6899.7398.8899.41100
Std0.560.58000.320.271.120.590
4Mean99.9999.9210010099.9699.9199.99100100
Std0.010.08000.040.090.0100
5Mean99.9799.6599.9999.9799.8499.9299.9699.9999.97
Std0.030.230.010.020.160.080.040.010.02
6Mean96.4288.5195.3896.5892.7196.8798.3898.0199.14
Std1.125.552.232.845.061.460.90.980.2
OA (%)Mean99.3597.9999.1899.0298.3999.1199.4599.6299.72
Std0.170.640.120.281.280.190.10.140.04
AA (%)Mean98.8796.5498.798.4997.3898.6799.0399.3199.57
Std0.350.510.220.421.940.30.320.320.07
K × 100Mean99.1397.3198.998.6997.8598.8199.2699.4999.62
Std0.580.490.170.381.720.120.140.180.05
Figure 10. Classification images of different methods on TR. (a) Ground-truth image; (b) DMCN (99.35%); (c) SpectralFormer (97.99%); (d) SSFTT (98.18%); (e) morpFormer (99.02%); (f) Coupled CNN (98.39%); (g) MFT_PT (99.11%); (h) MFT_CT (99.45%); (i) HCT (99.62%); (j) AGMLT (99.72%).
Figure 10. Classification images of different methods on TR. (a) Ground-truth image; (b) DMCN (99.35%); (c) SpectralFormer (97.99%); (d) SSFTT (98.18%); (e) morpFormer (99.02%); (f) Coupled CNN (98.39%); (g) MFT_PT (99.11%); (h) MFT_CT (99.45%); (i) HCT (99.62%); (j) AGMLT (99.72%).
Remotesensing 16 01080 g010
2.
MU dataset
As shown in Table 7, Coupled CNN has the worst classification results, and MFT_PT is the second worst. This is because MFT_PT only carries out convolutional feature extraction on HSI. The OA of the proposed AGMLT increased by 2.77%, 3.08%, 3.10%, 5.20%, 6.49%, 5.83%, 5.35%, and 2.22% compared to DMCN, SpectralFormer, SSFTT, morpFormer, Coupled CNN, MFT_PT, MFT_CT, and HCT. Meanwhile, the AA increased by 2.38%, 3.17%, 3.29%, 4.90%, 5.21%, 5.70%,5.28%, and 3.11%, and K × 100 increased by 3.54%, 4.00%, 3.95%, 6.53%, 8.28%, 7.38%, 6.77%, and 2.90%, respectively. The uneven and complex sample distribution of the MU dataset presents a significant challenge for classification accuracy across various methods. The AGMLT stands out due to its ability to harness rich dynamic feature information, resulting in a superior classification effect compared to other algorithms. This advantage is likely attributed to the sophisticated design of AGMLT, enabling it to effectively harness the complexities of sample distribution for the MU dataset. From Figure 11, the classification image of AGMLT is closest to the ground-truth image.
Table 7. Classification results of all methods on MU dataset (the bold represents the optimum accuracy).
Table 7. Classification results of all methods on MU dataset (the bold represents the optimum accuracy).
No.HSI InputHSI and LiDAR-DSM Input
DMCNSpectralFormerSSFTTmorp-
Former
Coupled
CNN
MFT_PTMFT_CTHCTAGMLT
1Mean
Std
87.76
2.37
88.62
0.36
88.16
0.57
85.14
2.26
86.29
0.78
86.42
1.22
86.26
2.91
90.04
3.34
90.52
2.43
2Mean
Std
84.85
6.81
78.01
9.75
84.27
9.82
79.49
6.40
87.09
2.12
81.96
3.20
77.81
13.68
82.84
1.45
90.56
1.81
3Mean
Std
78.90
3.35
81.75
8.58
79.53
3.86
81.83
2.22
76.96
1.58
77.24
0.99
79.82
2.12
77.69
3.65
82.46
1.23
4Mean
Std
96.42
1.54
94.88
2.49
93.89
7.73
96.30
0.65
94.93
2.56
92.79
1.91
92.96
1.96
94.44
2.74
96.76
0.73
5Mean
Std
88.05
3.91
88.62
0.36
84.34
3.17
79.83
5.17
77.89
3.72
79.12
1.07
78.89
2.45
86.28
2.47
89.69
1.63
6Mean
Std
99.84
0.16
99.43
0.57
99.68
0.32
99.56
0.38
99.84
0.19
99.24
0.76
99.24
0.76
99.40
0.60
99.87
0.15
7Mean
Std
92.44
3.04
91.38
2.04
94.30
2.57
90.16
2.72
92.06
0.96
91.22
2.59
91.54
3.61
92.99
2.98
95.10
1.74
8Mean
Std
94.56
2.32
92.28
0.85
93.03
1.47
92.82
2.00
77.03
8.71
90.24
1.98
93.18
2.58
94.27
1.28
94.62
2.76
9Mean
Std
75.45
3.57
76.79
0.93
78.93
1.39
76.16
6.12
75.30
6.98
67.32
6.02
70.99
5.25
75.67
3.28
83.61
0.66
10Mean
Std
94.24
5.76
93.94
6.06
86.68
10.92
83.03
10.43
92.12
1.21
87.88
12.12
90.30
0.61
95.0
1.31
93.93
9.38
11Mean
Std
99.24
0.76
99.50
0.52
98.32
1.68
98.99
0.34
97.82
2.18
98.99
1.01
98.15
1.85
90.00
9.00
100
0.00
OA (%)Mean
Std
87.39
1.12
87.08
1.24
87.06
0.85
84.96
1.10
83.67
1.46
84.33
0.76
84.81
1.34
87.94
0.48
90.16
1.49
AA (%)Mean
Std
90.09
0.99
89.30
1.12
89.18
1.64
87.57
0.80
87.26
1.64
86.77
1.52
87.19
0.60
89.36
1.26
92.47
1.33
K × 100Mean
Std
83.60
0.21
83.14
1.64
83.19
0.43
80.61
1.31
78.86
0.33
79.76
0.93
80.37
1.59
84.24
1.55
87.14
1.86
Figure 11. Classification images of different methods on MU. (a) Ground-truth Image; (b) DMCN (87.39%); (c) SpectralFormer (87.08%); (d) SSFTT (87.06%); (e) morpFormer (84.96%); (f) Coupled CNN (83.67%); (g) MFT_PT (84.33%); (h) MFT_CT (84.81%); (i) HCT (87.94%); (j) AGMLT (90.16%).
Figure 11. Classification images of different methods on MU. (a) Ground-truth Image; (b) DMCN (87.39%); (c) SpectralFormer (87.08%); (d) SSFTT (87.06%); (e) morpFormer (84.96%); (f) Coupled CNN (83.67%); (g) MFT_PT (84.33%); (h) MFT_CT (84.81%); (i) HCT (87.94%); (j) AGMLT (90.16%).
Remotesensing 16 01080 g011
3.
AU dataset
As seen in Table 8, similar to the TR dataset, SpectralFormer has the worst classification results, and Coupled CNN is the second worst. The OA of the proposed AGMLT increased by 1.56%, 3.91%, 0.72%, 0.95%, 2.79%, 1.45%, 1.28%, and 0.86% compared to DMCN, SpectralFormer, SSFTT, morpFormer, Coupled CNN, MFT_PT, MFT_CT, and HCT. Simultaneously, the AA of the proposed method increased by 8.32%, 17.69%, 3.23%, 1.32%, 7.91%, 3.74%, 2.46%, and 3.01%, and K × 100 increased by 2.25%, 5.63%, 1.04%, 1.37%, 4.06%, 2.07%, 1.83%, and 1.24%, respectively. From Figure 12, for the proposed AGMLT, the salt-and-pepper noise is the least compared to the comparison methods.
Table 8. Classification results of all methods on AU dataset (the bold represents the optimum accuracy).
Table 8. Classification results of all methods on AU dataset (the bold represents the optimum accuracy).
No.HSI InputHSI and LiDAR-DSM Input
DMCNSpectralFormerSSFTTmorp-
Former
Coupled
CNN
MFT_PTMFT_CTHCTAGMLT
1Mean
Std
98.59
0.56
86.10
0.44
98.82
0.08
97.71
0.21
89.59
6.02
98.38
0.53
98.29
1.31
98.75
0.49
99.31
0.20
2Mean
Std
98.52
0.44
96.10
1.44
99.02
0.33
98.54
0.25
98.55
0.61
98.20
0.26
98.14
2.86
98.66
0.41
99.10
0.18
3Mean
Std
87.64
1.51
75.99
8.92
90.13
1.39
89.69
1.46
87.65
1.39
89.24
2.23
88.60
1.20
88.45
2.78
93.10
2.23
4Mean
Std
99.02
0.58
98.66
0.34
98.77
0.34
98.53
0.11
99.39
0.26
97.88
0.28
98.37
0.35
98.93
0.21
99.29
0.12
5Mean
Std
71.08
3.99
48.88
7.61
79.09
5.60
84.88
3.06
75.54
7.72
78.43
0.36
86.18
8.12
81.08
7.95
87.09
5.21
6Mean
Std
47.82
5.15
27.56
9.54
70.12
3.37
75.45
3.58
58.62
9.20
70.68
3.28
71.17
2.02
69.00
1.26
76.69
3.59
7Mean
Std
64.51
1.86
55.50
4.95
66.88
1.20
71.36
3.58
60.73
1.52
66.41
6.30
67.52
4.04
69.52
4.05
70.85
1.95
OA (%)Mean
Std
96.24
1.36
93.89
0.27
97.08
0.18
96.85
0.07
95.01
1.31
96.35
0.24
96.52
0.31
96.94
0.33
97.80
0.06
AA (%)Mean
Std
81.03
2.30
71.66
2.58
86.12
1.93
88.03
1.21
81.44
2.86
85.61
1.11
86.89
1.25
86.34
1.51
89.35
0.92
K × 100Mean
Std
94.60
0.42
91.22
0.43
95.81
0.25
95.48
0.10
92.79
1.91
94.78
0.34
95.02
0.44
95.61
0.47
96.85
0.08
Figure 12. Classification images of different methods on AU. (a) Ground-truth Image; (b) DMCN (96.24%); (c) SpectralFormer (93.89%); (d) SSFTT (97.08%); (e) morpFormer (96.85%); (f) Coupled CNN (95.01%); (g) MFT_PT (96.35%); (h) MFT_CT (96.52%); (i) HCT (96.94%); (j) AGMLT (97.80%).
Figure 12. Classification images of different methods on AU. (a) Ground-truth Image; (b) DMCN (96.24%); (c) SpectralFormer (93.89%); (d) SSFTT (97.08%); (e) morpFormer (96.85%); (f) Coupled CNN (95.01%); (g) MFT_PT (96.35%); (h) MFT_CT (96.52%); (i) HCT (96.94%); (j) AGMLT (97.80%).
Remotesensing 16 01080 g012
4.
HU dataset
As shown in Table 9, Coupled CNN has the worst classification results, and DMCN has the second worst. The proposed AGMLT increased by 1.09%, 1.04%, 0.20%, 0.57%, 1.39%, 0.33%, 0.47%, and 0.20% on OA compared to DMCN, SpectralFormer, SSFTT, morpFormer, Coupled CNN, MFT_PT, MFT_CT, and HCT. The value of AA increased by 0.90%, 1.05%, 0.16%, 0.52%, 1.10%, 0.27%, 0.40%, and 0.17%, and the value of K × 100 increased by 1.19%, 1.13%, 0.22%, 0.63%, 1.52%, 0.36%, 0.52%, and 0.23%, respectively. In addition, DMCN and SpectralFormer have similar classification performance on the HU datasets. Simultaneously, SSFTT and HCT have similar classification performance. From Figure 13, the higher classification accuracy leads to less salt-and-pepper noise. This indicates that AGMLT effectively enhances the joint classification performance.
Table 9. Classification results of all methods on HU dataset (the bold represents the optimum accuracy).
Table 9. Classification results of all methods on HU dataset (the bold represents the optimum accuracy).
No.HSI InputHSI and LiDAR-DSM Input
DMCNSpectralFormerSSFTTmorp-
Former
Coupled
CNN
MFT_PTMFT_CTHCTAGMLT
1Mean
Std
98.35
0.80
99.34
0.66
99.74
0.26
99.18
0.36
99.91
0.09
99.32
0.68
98.94
0.97
98.77
1.05
99.81
0.08
2Mean
Std
98.54
3.24
98.89
1.11
99.91
0.09
99.19
0.32
99.94
0.06
99.53
0.85
99.49
0.51
99.70
0.30
99.84
0.16
3Mean
Std
98.05
2.88
100
0.00
99.96
0.04
99.47
0.47
99.92
0.08
99.76
0.24
99.88
0.12
99.92
0.08
100
0.00
4Mean
Std
98.74
0.98
99.72
0.28
99.66
0.15
99.56
0.16
94.56
5.15
94.43
0.57
98.28
1.72
99.56
0.35
100
0.00
5Mean
Std
100
0.00
99.39
0.61
99.92
0.08
100
0.00
100
0.00
100
0.00
100
0.00
100
0.00
100
0.00
6Mean
Std
96.89
3.11
99.30
0.70
100
0.00
100
0.00
100
0.00
100
0.00
100
0.00
100
0.00
100
0.00
7Mean
Std
96.05
1.37
98.06
1.94
99.63
0.38
98.88
0.65
99.06
0.75
99.31
0.69
99.85
0.15
99.74
0.26
100
0.00
8Mean
Std
94.60
4.45
98.33
1.42
99.44
0.10
98.35
0.25
96.81
1.28
99.55
0.45
99.47
0.53
99.87
0.13
100
0.00
9Mean
Std
94.34
5.52
96.20
2.19
99.68
0.32
98.65
1.45
96.94
2.02
99.09
0.91
99.13
0.87
99.23
0.37
100
0.00
10Mean
Std
99.83
0.17
99.83
0.17
99.77
0.23
99.94
0.09
99.83
0.17
99.81
0.19
99.98
0.02
99.98
0.02
100
0.00
11Mean
Std
99.31
0.69
99.48
0.32
99.79
0.21
100
0.00
99.28
0.57
99.72
0.28
99.49
0.51
99.98
0.02
100
0.00
12Mean
Std
97.14
2.51
99.27
0.23
99.63
0.37
99.46
0.20
99.06
0.56
99.81
0.19
99.29
0.23
99.67
0.33
99.57
0.04
13Mean
Std
94.03
5.96
95.99
3.64
99.93
0.07
98.71
1.82
99.65
0.35
99.86
0.14
99.58
0.42
99.72
0.28
100
0.00
14Mean
Std
99.90
0.10
99.92
0.08
100
0.00
100
0.00
100
0.00
100
0.00
100
0.00
100
0.00
100
0.00
15Mean
Std
100
0.00
99.83
0.17
100
0.00
100
0.00
100
0.00
100
0.00
99.92
0.08
100
0.00
100
0.00
OA (%)Mean
Std
98.84
0.29
98.89
0.70
99.73
0.14
99.36
0.24
98.54
0.49
99.60
0.15
99.46
0.29
99.73
0.16
99.93
0.02
AA (%)Mean
Std
99.05
0.38
98.90
0.45
99.79
0.11
99.43
0.29
98.85
0.31
99.68
0.13
99.55
0.24
99.78
0.22
99.95
0.01
K × 100Mean
Std
98.74
0.31
98.80
0.32
99.71
0.16
99.30
0.26
98.41
0.53
99.57
0.16
99.41
0.32
99.70
0.16
99.93
0.02
Figure 13. Classification images of different methods on HU. (a) Ground-truth image; (b) DMCN (98.84%); (c) SpectralFormer (98.89%); (d) SSFTT (99.73%); (e) morpFormer (99.36%); (f) Coupled CNN (98.54%); (g) MFT_PT (99.60%); (h) MFT_CT (99.46%); (i) HCT (99.73%); (j) AGMLT (99.93%).
Figure 13. Classification images of different methods on HU. (a) Ground-truth image; (b) DMCN (98.84%); (c) SpectralFormer (98.89%); (d) SSFTT (99.73%); (e) morpFormer (99.36%); (f) Coupled CNN (98.54%); (g) MFT_PT (99.60%); (h) MFT_CT (99.46%); (i) HCT (99.73%); (j) AGMLT (99.93%).
Remotesensing 16 01080 g013

3.3.2. Consumption and Computational Complexity

To comprehensively compare the AGMLT with the comparison methods, the total parameters (TPs), training time (Tr), test time (Te) and Flops of all methods are tested in this section. The results are presented in Table 10. Since the data are filled in the convolution part to align the feature sizes, the number of parameters and the complexity of the model are increased, while the learnable features are added to improve the classification accuracy.
The settings of experiments are the same as previously mentioned. The AGMLT has fewer total parameters than DMCN but a longer running time and larger Flops, and the AGMLT has a shorter running time than MFT_PT and MFT_CT but more total parameters and larger Flops. However, compared with SpectralFormer and SSFTT, the AGMLT has more total parameters, larger Flops, and a longer running time. Taking the TR and HU datasets as examples, the AGMLT has a shorter test time than morpFormer, but more total parameters and Flops, and a longer training time. Taking the MU and AU datasets as examples, the AGMLT has more total parameters, larger Flops, and a longer running time than morpFormer. Furthermore, compared with HCT, the AGMLT has more Flops and a longer running time, and it has more total parameters on the TR dataset and fewer total parameters than other datasets. Finally, the classification performance of AGMLT is optimal.

4. Discussion

4.1. Ablation Analysis

This section takes the TR dataset as an example to conduct ablation experiments to verify the effectiveness of different components. The first column in Table 11 is the convolutional feature extraction module SSAGM shown in Figure 1, and its specific ablation experiments are shown in Table 12. The second column of Table 11 is the L-Former shown in Figure 1, where LS and LTM represent the layer scale and learnable transition matrix in Figure 3, respectively. The third column in Table 11 is the cross-attention with the learnable transition matrix. As outlined in the table, the classification accuracy of the AGMLT proposed in this paper is the best. Each component plays a positive role in improving classification accuracy.
Detailed ablation experiments on the convolutional feature extraction are presented in Table 12. PDWA mainly extracts spectral features of HSI. ADWA(H) is the spatial feature extraction of HSI, while ADWA(L) is the spatial feature extraction of LiDAR-DSM data. In this paper, different combinations of the three attention modules were verified with experiments. Finally, it obtained the best combination and use order, which achieved the best classification effect.
Table 13 shows the effect of asymmetric convolution kernels on the AGMLT. The asymmetric convolution kernel can improve classification accuracy while reducing the number of parameters and Flops of the model. Because the 3D convolution kernel can be divided into many two-dimensional convolution kernels, when the rank of a 2D kernel is 1, it can be equivalent to a series of one-dimensional convolutions, which can strengthen the nuclear skeleton of the CNN while reducing the parameters.
In this paper, we compare the classification accuracies of HSI or LiDAR-DSM alone and the combination of the two data. As indicated in Table 14, by fusing the HSI and LiDAR-DSM, it is possible to achieve a more accurate and robust classification outcome than would be possible using either source of data alone. HSI can provide rich spectral information, and LiDAR-DSM can supplement accurate orientation and distance information.

4.2. Loss Functions

AGMLT with different loss functions compared on four multi-modal datasets in this section. LCE stands for cross entropy loss, LFC stands for focal loss, LPC stands for poly loss—CE, and LPF stands for poly loss—focal. As shown in Table 15, LPF, with the best effect, was selected as the loss function of AGMLT.

4.3. Training Percentage

In this section, experiments were conducted to analyze the performance of the proposed AGMLT under different training percentages. The experimental settings are the same as above. The results are shown in Figure 14.
For TR, AU, and HU datasets, 2%, 4%, 6%, and 8% of the total samples are selected as training samples. However, the sample distribution of the MU dataset is particularly uneven, so 5%, 10%, 15%, and 20% of the total samples are selected for training. Experiments have shown that the accuracies of all methods have been significantly improved when the training samples increased. Notably, the AGMLT model exhibited superior performance compared to other methods in all cases, with a particularly notable improvement in accuracy for the MU dataset. It is attributed to the rich learnable features of AGMLT, which adapt more effectively to uneven distributions and improve accuracy. Moreover, the effectiveness of AGMLT across diverse datasets suggests its potential for wide applicability in tasks involving multi-modal data fusion and classification.

5. Conclusions

In the study, an adaptive learning model named AGMLT is proposed. Firstly, SSAGM was used to extract local information, which mainly included PDWA and ADWA. The PDWA could extract the spectral information of HSI. The ADWA could extract the spatial information of HSI and the elevation information of LIDAR-DSM. Then, by adding a layer scale and learnable transition matrix to the primary transformer encoder and SA, the data dynamics were improved, and the influence of transformer depth on model classification performance was alleviated. Next, the learnable transfer matrix in LC-Attention enriched the feature information of multi-modal data fusion. Finally, the poly loss training model could adapt to different data. A large number of experiments of AGMLT were carried out to verify the effectiveness and its components.
The data padding in SSAGM increases model complexity and parameters. Therefore, the future scientific research task is designing a precise yet lightweight model. We plan to remove the data padding in SSAGM to shorten the semantic sequence for the subsequent transformer encoder. We propose a multi-scale dynamic gating mechanism combining asymmetric and depthwise separable convolutions to maintain classification performance. The effectiveness of the idea needs to be assessed in future research.

Author Contributions

Conceptualization, M.W., Y.S. and J.X.; methodology, software, validation, writing—original draft, M.W., Y.S. and R.S.; writing—review and editing, M.W., Y.S. and Y.Z.; supervision, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental Research Funds for the Central Universities, grant number 3072022CF0801, the National Key R&D Program of China, grant number 2018YFE0206500, and the National Key Laboratory of Communication Anti Jamming Technology, grant number 614210202030217.

Data Availability Statement

The MUUFL dataset is at https://github.com/GatorSense/MUUFLGulfport/, the Trento and Augsburg datasets are available at https://github.com/AnkurDeria/MFT?tab=readme-ov-file, and the University of Pavia dataset is at https://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes. All the websites can be accessed on 16 March 2024.

Acknowledgments

The authors are grateful to the peer researchers for their source codes as well as the public HSI datasets.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Czaja, W.; Kavalerov, I.; Li, W. Exploring the high dimensional geometry of HSI features. In Proceedings of the 2021 11th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), Amsterdam, The Netherlands, 24–26 March 2021; pp. 1–5. [Google Scholar]
  2. Wang, Z.; Menenti, M. Challenges and opportunities in lidar remote sensing. Front. Remote Sens. 2021, 2, 641723. [Google Scholar] [CrossRef]
  3. Roy, S.K.; Kar, P.; Hong, D.; Wu, X.; Plaza, A.; Chanussot, J. Revisiting deep hyperspectral feature extraction networks via gradient centralized convolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5516619. [Google Scholar] [CrossRef]
  4. Hestir, E.; Brando, V.; Bresciani, M.; Giardino, C.; Matta, E.; Villa, P.; Dekker, A. Measuring freshwater aquatic ecosystems: The need for a hyperspectral global mapping satellite mission. Remote Sens. Environ. 2015, 167, 181–195. [Google Scholar] [CrossRef]
  5. Shimoni, M.; Haelterman, R.; Perneel, C. Hyperspectral imaging for military and security applications: Combining myriad processing and sensing techniques. IEEE Geosci. Remote Sens. Mag. 2019, 7, 101–117. [Google Scholar] [CrossRef]
  6. Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for infrared small object detection. IEEE Trans. Image Process. 2023, 32, 364–376. [Google Scholar] [CrossRef]
  7. Carrino, T.A.; Crósta, A.P.; Toledo, C.L.B.; Silva, A.M. Hyper-spectral remote sensing applied to mineral exploration in southern peru:A multiple data integration approach in the chapi chiara gold prospect. Int. J. Appl. Earth Obs. Geoinf. 2018, 64, 287–300. [Google Scholar]
  8. Schimleck, L.; Ma, T.; Inagaki, T.; Tsuchikawa, S. Review of Near Infrared Hyperspectral Imaging Applications Related to Wood and Wood Products. Appl. Spectrosc. Rev. 2022, 57, 2098759. [Google Scholar] [CrossRef]
  9. Liao, X.; Liao, G.; Xiao, L. Rapeseed Storage Quality Detection Using Hyperspectral Image Technology–An Application for Future Smart Cities. J. Test. Eval. 2022, 51, JTE20220073. [Google Scholar] [CrossRef]
  10. Du, P.; Xia, J.S.; Xue, Z.H. Review of hyperspectral remote sensing image classification. J. Remote Sens. 2016, 20, 236–256. [Google Scholar]
  11. Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B. HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2020, 17, 277–281. [Google Scholar] [CrossRef]
  12. Sun, Y.; Wang, M.; Wei, C.; Zhong, Y.; Xiang, J. Heterogeneous spectral-spatial network with 3D attention and MLP for hyperspectral image classification using limited training samples. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 8702–8720. [Google Scholar] [CrossRef]
  13. Hong, D.; Han, Z.; Yao, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5518615. [Google Scholar] [CrossRef]
  14. Sang, M.; Zhao, Y.; Liu, G. Improving Transformer-Based Networks with Locality for Automatic Speaker Verification. In Proceedings of the 2023 48th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
  15. Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–spatial feature tokenization transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
  16. Wang, A.; Xing, S.; Zhao, Y.; Wu, H.; Iwahori, Y. A hyperspectral image classification method based on adaptive spectral spatial kernel combined with improved vision transformer. Remote Sens. 2022, 14, 3705. [Google Scholar] [CrossRef]
  17. Li, J.; Bioucas-Dias, J.M.; Plaza, A. Spectral–spatial hyperspectral image segmentation using subspace multinomial logistic regression and Markov random fields. IEEE Trans. Geosci. Remote Sens. 2011, 50, 809–823. [Google Scholar] [CrossRef]
  18. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep hierarchical feature learning on points a metric space. arXiv 2017, arXiv:1706.02413. [Google Scholar]
  19. Pedergnana, M.; Marpu, P.R.; Dalla Mura, M.; Benediktsson, J.A.; Bruzzone, L. Classification of remote sensing optical and LiDAR data using extended attribute profiles. IEEE J. Sel. Top. Signal Process. 2012, 6, 856–865. [Google Scholar] [CrossRef]
  20. Rasti, B.; Ghamisi, P.; Gloaguen, R. Hyperspectral and LiDAR fusion using extinction profiles and total variation component analysis. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3997–4007. [Google Scholar] [CrossRef]
  21. Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
  22. Roy, S.K.; Deria, A.; Hong, D. Hyperspectral and LiDAR data classification using joint CNNs and morphological feature learning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5530416. [Google Scholar] [CrossRef]
  23. Song, W.; Dai, Y.; Gao, Z. Hashing-based deep metric learning for the classification of hyperspectral and LiDAR data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5704513. [Google Scholar] [CrossRef]
  24. Xu, X.; Li, W.; Ran, Q. Multisource remote sensing data classification based on convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2017, 56, 937–949. [Google Scholar] [CrossRef]
  25. Ding, K.; Lu, T.; Fu, W.; Li, S.; Ma, F. Global–local transformer network for HSI and LiDAR data joint classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5541213. [Google Scholar] [CrossRef]
  26. Zhang, Y.; Peng, Y.; Tu, B.; Liu, Y. Local Information interaction transformer for hyperspectral and LiDAR data classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 16, 1130–1143. [Google Scholar] [CrossRef]
  27. Xu, H.; Zheng, T.; Liu, Y.; Zhang, Z.; Xue, C.; Li, J. A joint convolutional cross ViT network for hyperspectral and light detection and ranging fusion classification. Remote Sens. 2024, 16, 489. [Google Scholar] [CrossRef]
  28. Roy, S.K.; Deria, A.; Hong, D. Multimodal fusion transformer for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5515620. [Google Scholar] [CrossRef]
  29. Zhao, G.; Ye, Q.; Sun, L. Joint classification of hyperspectral and LiDAR data using a hierarchical CNN and transformer. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5500716. [Google Scholar] [CrossRef]
  30. Wang, Y.; Li, Y.; Wang, G.; Liu, X. Multi-scale attention network for single image super-resolution. arXiv 2022, arXiv:2209.14145. [Google Scholar]
  31. Gulati, A.; Qin, J.; Chiu, C.C. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
  32. Hang, R.; Li, Z.; Ghamisi, P.; Hong, D.; Xia, G.; Liu, Q. Classification of hyperspectral and LiDAR data using coupled CNNs. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4939–4950. [Google Scholar] [CrossRef]
  33. Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
  34. Zhou, D.; Kang, B.; Jin, X.; Yang, L. DeepViT: Towards deeper vision transformer. arXiv 2021, arXiv:2103.11886v4. [Google Scholar]
  35. Touvron, H.; Cord, M.; Sablayrolles, A. Going deeper with image transformers. arXiv 2021, arXiv:2103.17239v2. [Google Scholar]
  36. Leng, Z.Q.; Tan, M.X.; Liu, C.X. PolyLoss: A polynomial expansion perspective of classification loss functions. In Proceedings of the 2022 10th IEEE Conference on International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
  37. Gader, P.; Zare, A.; Close, R.; Aitken, J.; Tuell, G. Muufl Gulfport Hyperspectral and LiDAR Airborne Data Set; Technical Report REP-2013–570; University of Florida: Gainesville, FL, USA, 2013. [Google Scholar]
  38. Du, X.; Zare, A. Scene Label Ground Truth Map for Muufl Gulfport Data Set; Technical Report 20170417; University of Florida: Gainesville, FL, USA, 2017. [Google Scholar]
  39. Baumgartner, A.; Gege, P.; Köhler, C.; Lenhard, K.; Schwarzmaier, T. Characterisation methods for the hyperspectral sensor HySpex at DLR’s calibration home base. Proc. SPIE 2012, 8533, 371–378. [Google Scholar]
  40. Kurz, F.; Rosenbaum, D.; Leitloff, J.; Meynberg, O.; Reinartz, P. Real time camera system for disaster and traffic monitoring. Proceedings of International Conference on SMPR, Tehran, Iran, 18–19 May 2011; pp. 1–6. [Google Scholar]
  41. Xiang, J.H.; Wei, C.; Wang, M.H.; Teng, L. End-to-End Multilevel Hybrid Attention Framework for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5511305. [Google Scholar] [CrossRef]
  42. Swalpa, K.R.; Ankur, D.; Shah, C. Spectral–spatial morphological attention transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5503615. [Google Scholar]
Figure 1. Structure for proposed AGMLT model. The SSAGM is proposed to exclude the linear layer and capture local continuity while considering complexity. L-Former is designed to increase the data dynamics and prevent performance degradation as the transformer deepens. LC-Attention is designed for enriching the feature information. Poly loss is a flexible loss function suitable for multi-modal data fusion classification.
Figure 1. Structure for proposed AGMLT model. The SSAGM is proposed to exclude the linear layer and capture local continuity while considering complexity. L-Former is designed to increase the data dynamics and prevent performance degradation as the transformer deepens. LC-Attention is designed for enriching the feature information. Poly loss is a flexible loss function suitable for multi-modal data fusion classification.
Remotesensing 16 01080 g001
Figure 2. Structure for proposed SSAGM. PDWA is used to extract spectral features from HSI. ADWA is used to extract spatial features from HSI and elevation information from LiDAR-DSM.
Figure 2. Structure for proposed SSAGM. PDWA is used to extract spectral features from HSI. ADWA is used to extract spatial features from HSI and elevation information from LiDAR-DSM.
Remotesensing 16 01080 g002
Figure 3. Structure for proposed L-Former. The layer scale makes features more detailed, while the learnable transfer matrix overcomes the problem of centralized decomposition and can train deeper transformers.
Figure 3. Structure for proposed L-Former. The layer scale makes features more detailed, while the learnable transfer matrix overcomes the problem of centralized decomposition and can train deeper transformers.
Remotesensing 16 01080 g003
Figure 4. Structure for proposed LC-Attention. (a) Fusion encoding module of HSI feature representations; (b) fusion encoding module of LiDAR-DSM feature representations.
Figure 4. Structure for proposed LC-Attention. (a) Fusion encoding module of HSI feature representations; (b) fusion encoding module of LiDAR-DSM feature representations.
Remotesensing 16 01080 g004
Figure 5. TR dataset. (a) Pseudo-color image; (b) grayscale image; (c) ground-truth image.
Figure 5. TR dataset. (a) Pseudo-color image; (b) grayscale image; (c) ground-truth image.
Remotesensing 16 01080 g005
Figure 6. MU dataset. (a) Pseudo-color image; (b) grayscale image; (c) ground-truth image.
Figure 6. MU dataset. (a) Pseudo-color image; (b) grayscale image; (c) ground-truth image.
Remotesensing 16 01080 g006
Figure 7. AU dataset. (a) Pseudo-color image; (b) grayscale image; (c) ground-truth image.
Figure 7. AU dataset. (a) Pseudo-color image; (b) grayscale image; (c) ground-truth image.
Remotesensing 16 01080 g007
Figure 8. HU dataset. (a) Pseudo-color image; (b) grayscale image; (c) ground-truth image.
Figure 8. HU dataset. (a) Pseudo-color image; (b) grayscale image; (c) ground-truth image.
Remotesensing 16 01080 g008
Figure 9. Combined effect of the heads for attention, and the depth for encoders on four datasets. (a) TR dataset (4 + 2); (b) MU dataset (4 + 2); (c) AU dataset (4 + 2); (d) HU dataset (4 + 2). The horizontal coordinate is the depth for encoders, and the vertical coordinate is the OA (%) value. The blue circle represents four attention heads, and the orange circle represents eight attention heads.
Figure 9. Combined effect of the heads for attention, and the depth for encoders on four datasets. (a) TR dataset (4 + 2); (b) MU dataset (4 + 2); (c) AU dataset (4 + 2); (d) HU dataset (4 + 2). The horizontal coordinate is the depth for encoders, and the vertical coordinate is the OA (%) value. The blue circle represents four attention heads, and the orange circle represents eight attention heads.
Remotesensing 16 01080 g009
Figure 14. Classification results of different training percentages. (a) TR dataset; (b) MU dataset; (c) AU dataset; (d) HU dataset. The accuracies of all methods have been significantly improved when the training samples increased. The AGMLT model exhibited superior performance in all cases.
Figure 14. Classification results of different training percentages. (a) TR dataset; (b) MU dataset; (c) AU dataset; (d) HU dataset. The accuracies of all methods have been significantly improved when the training samples increased. The AGMLT model exhibited superior performance in all cases.
Remotesensing 16 01080 g014
Table 1. Details on TR dataset.
Table 1. Details on TR dataset.
No.ColorClass NameTraining SamplesTest Samples
1 Apple Trees1293905
2 Buildings1252778
3 Ground105374
4 Woods1549896
5 Vineyard18410,317
6 Roads1223052
Total81929,395
Table 2. Details on MU dataset.
Table 2. Details on MU dataset.
No.ColorClass NameTraining SamplesTest Samples
1 Trees15023,096
2 Mostly Grass1504120
3 Mixed Ground Surface1506732
4 Dirt and Sand1501676
5 Road1506537
6 Water150316
7 Buildings Shadow1502083
8 Buildings1506090
9 Sidewalk1501235
10 Yellow Curb15033
11 Cloth Panels150119
Total165052,037
Table 3. Details on AU dataset.
Table 3. Details on AU dataset.
No.ColorClass NameTraining SamplesTest Samples
1 Forest67512,832
2 Residential Area151628,813
3 Industrial Area1923659
4 Low Plants134225,515
5 Allotment28547
6 Commercial Area821563
7 Water161454
Total391174,383
Table 4. Details on HU dataset.
Table 4. Details on HU dataset.
No.ColorClass NameTraining SamplesTest Samples
1 Healthy Grass1981053
2 Stressed Grass1901064
3 Synthetic Grass192505
4 Trees1881056
5 Soil1861056
6 Water182143
7 Residential1961072
8 Commercial1911053
9 Road1931059
10 Highway1911036
11 Railway1811054
12 Parking Lot l 1921041
13 Parking Lot 2184285
14 Tennis Court181247
15 Running Track187473
Total283212,197
Table 5. OA of different learning rate on each dataset (the bold represents the optimum accuracy).
Table 5. OA of different learning rate on each dataset (the bold represents the optimum accuracy).
DatasetsInitial Learning Rate
0.0010.00050.0001
TR99.66 ± 0.0499.72 ± 0.0499.58 ± 0.09
MU90.16 ± 1.4987.44 ± 1.8987.82 ± 1.03
AU97.60 ± 0.1697.80 ± 0.0697.50 ± 0.11
HU99.65 ± 0.0699.70 ± 0.0599.93 ± 0.02
Table 10. Consumption and computational complexity of each dataset (the bold represents the optimum accuracy).
Table 10. Consumption and computational complexity of each dataset (the bold represents the optimum accuracy).
MethodsTPsTr (s)Te (s)FlopsOA (%)TPsTr (s)Te (s)FlopsOA (%)
TRMU
DMCN2.77 M20.221.693.21 G99.35 ± 0.172.77 M34.403.043.21 G87.39 ± 1.12
SpectralFormer97.33 K46.803.55192.68 M97.99 ± 0.6497.65 K93.226.22192.70 M87.08 ± 1.24
SSFTT147.84 K22.081.51447.18 M99.18 ± 0.12148.16 K38.062.78447.20 M87.06 ± 0.85
morpFormer62.56 K38.364.38334.43 M99.02 ± 0.2862.56 K77.677.11334.43 M84.96 ± 1.10
CoupledCNN104.18 K7.680.78169.08 M98.39 ± 1.28106.11 K18.471.38169.20 M83.67 ± 1.46
MFT_PT221.29 K58.507.98312.91 M99.11 ± 0.19221.61 K115.8014.10312.93 M84.33 ± 0.76
MFT_CT221.29 K82.3311.60312.91 M99.45 ± 0.10221.61 K163.8720.39312.93 M84.81 ± 1.34
HCT465.62 K14.531.28519.16 M99.62 ± 0.14728.09 K26.842.27569.55 M87.94 ± 0.48
AGMLT837.08 K50.443.974.91 G99.72 ± 0.04837.40 K120.489.554.91 G90.16 ± 1.49
MethodsAUHU
DMCN2.77 M76.963.823.21 G96.24 ± 1.362.78 M23.490.933.21 G98.84 ± 0.29
SpectralFormer97.39 K202.328.03192.68 M93.89 ± 0.2797.91 K153.841.43192.71 M98.89 ± 0.70
SSFTT147.90 K93.013.97447.18 M97.08 ± 0.18148.42 K28.370.37447.22 M99.73 ± 0.14
morpFormer62.56 K185.3810.22334.43 M96.85 ± 0.0762.56 K134.351.85334.43 M99.36 ± 0.24
CoupledCNN104.57 K37.862.03169.11 M95.01 ± 1.31107.66 K27.980.37169.30 M98.54 ± 0.49
MFT_PT221.35 K272.0220.03312.91 M96.35 ± 0.24221.87 K195.113.32312.95 M99.60 ± 0.15
MFT_CT221.35 K397.3229.77312.91 M96.52 ± 0.31221.87 K332.685.50312.95 M99.46 ± 0.29
HCT727.83 K60.743.42569.52 M96.94 ± 0.33728.35 K58.330.87569.58 M99.73 ± 0.16
AGMLT837.14 K258.5612.434.91 G97.80 ± 0.06837.66 K170.651.674.91 G99.93 ± 0.02
Table 11. Ablation experiments of each component (The √ represents that use current component, and the bold represents the optimum accuracy).
Table 11. Ablation experiments of each component (The √ represents that use current component, and the bold represents the optimum accuracy).
SSAGML-FormerLC-AttentionOA (%)AA (%)K × 100
LSLTM
99.67 ± 0.0399.49 ± 0.0499.56 ± 0.04
99.63 ± 0.0199.38 ± 0.0299.50 ± 0.01
99.34 ± 0.0898.87 ± 0.1699.11 ± 0.11
99.55 ± 0.0999.31 ± 0.1499.40 ± 0.11
99.62 ± 0.1399.37 ± 0.2299.49 ± 0.18
99.41 ± 0.0498.95 ± 0.1799.21 ± 0.06
99.68 ± 0.0299.46± 0.0299.57 ± 0.02
99.43 ± 0.0398.98 ± 0.1399.24 ± 0.05
99.50 ± 0.0999.12 ± 0.1499.32 ± 0.12
99.46 ± 0.0899.14 ± 0.1399.28 ± 0.11
99.72 ± 0.0499.57 ± 0.0799.62 ± 0.05
Table 12. Different combinations of PDWA and ADWA (The √ represents that use current component, and the bold represents the optimum accuracy).
Table 12. Different combinations of PDWA and ADWA (The √ represents that use current component, and the bold represents the optimum accuracy).
PDWAADWA(H)ADWA(L)OA (%)AA (%)K×100
99.57 ± 0.0399.34 ± 0.0599.43 ± 0.04
99.38 ± 0.0699.04 ± 0.0799.17 ± 0.07
99.63 ± 0.1599.42 ± 0.2499.50 ± 0.20
99.36 ± 0.1498.61 ± 0.2099.14 ± 0.18
99.61 ± 0.0599.40 ± 0.0799.48 ± 0.07
99.51 ± 0.0399.24 ± 0.0599.34 ± 0.03
99.72 ± 0.0499.57 ± 0.0799.62 ± 0.05
Table 13. The effect of asymmetric convolution for AGMLT (the bold represents the optimum accuracy).
Table 13. The effect of asymmetric convolution for AGMLT (the bold represents the optimum accuracy).
OA (%)AA (%)K × 100Total ParamsFlops
No Asymmetric
Convolution
99.62 ± 0.0899.01 ± 0.1499.50 ± 0.10904.71 K5.39 G
With Asymmetric
Convolution
99.72 ± 0.0499.57 ± 0.0799.62 ± 0.05837.08 K4.91 G
Table 14. Ablation analysis of different inputs (the bold represents the optimum accuracy).
Table 14. Ablation analysis of different inputs (the bold represents the optimum accuracy).
InputsOA (%)AA (%)K×100OA (%)AA (%)K × 100
TRMU
HSI99.32 ± 0.0398.95 ± 0.0599.09 ± 0.0489.33 ± 0.9291.83 ± 1.2086.09 ± 1.19
LiDAR-DSM97.81 ± 0.6496.55 ± 1.2297.06 ± 0.8768.11 ± 1.6167.26 ± 5.3959.55 ± 1.87
HSI + LiDAR-DSM99.72 ± 0.0499.57 ± 0.0799.62 ± 0.0590.16 ± 1.4992.47 ± 1.3387.14 ± 1.86
InputsAUHU
HSI97.45 ± 0.1989.17 ± 1.2196.35 ± 0.2799.76 ± 0.0599.80 ± 0.0599.73 ± 0.06
LiDAR-DSM95.62 ± 1.0795.62 ± 1.0795.62 ± 1.0795.62 ± 1.0795.62 ± 1.0795.62 ± 1.07
HSI + LiDAR-DSM97.80 ± 0.0689.35 ± 0.9296.85 ± 0.0899.93 ± 0.0299.95 ± 0.0199.93 ± 0.02
Table 15. Experimental results using different loss functions (the bold represents the optimum accuracy).
Table 15. Experimental results using different loss functions (the bold represents the optimum accuracy).
Loss FunctionsOA (%)AA (%)K × 100OA (%)AA (%)K × 100
TRMU
LCE99.69 ± 0.0599.49 ± 0.1199.58 ± 0.0689.92 ± 0.7792.84 ± 0.4586.84 ± 0.97
LFC99.69 ± 0.0999.54 ± 0.1399.59 ± 0.1190.09 ± 0.2992.09 ± 0.3987.07 ± 0.37
LPC99.61 ± 0.0598.99 ± 0.0899.48 ± 0.0689.92 ± 0.4092.47 ± 0.7086.81 ± 0.51
LPF99.72 ± 0.0499.57 ± 0.0799.62 ± 0.0590.16 ± 1.4992.47 ± 1.3387.14 ± 1.86
Loss FunctionsAUHU
LCE97.49 ± 0.2788.34 ± 0.3696.41 ± 0.3999.86 ± 0.0599.89 ± 0.0499.85 ± 0.05
LFC97.63 ± 0.2888.26 ± 1.3796.61 ± 0.4099.75 ± 0.0599.79 ± 0.0499.73 ± 0.04
LPC97.38 ± 0.2588.42 ± 1.1496.25 ± 0.3699.79 ± 0.0599.75 ± 0.0399.78 ± 0.05
LPF97.80 ± 0.0689.35 ± 0.9296.85 ± 0.0899.93 ± 0.0299.95 ± 0.0199.93 ± 0.02
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, M.; Sun, Y.; Xiang, J.; Sun, R.; Zhong, Y. Joint Classification of Hyperspectral and LiDAR Data Based on Adaptive Gating Mechanism and Learnable Transformer. Remote Sens. 2024, 16, 1080. https://doi.org/10.3390/rs16061080

AMA Style

Wang M, Sun Y, Xiang J, Sun R, Zhong Y. Joint Classification of Hyperspectral and LiDAR Data Based on Adaptive Gating Mechanism and Learnable Transformer. Remote Sensing. 2024; 16(6):1080. https://doi.org/10.3390/rs16061080

Chicago/Turabian Style

Wang, Minhui, Yaxiu Sun, Jianhong Xiang, Rui Sun, and Yu Zhong. 2024. "Joint Classification of Hyperspectral and LiDAR Data Based on Adaptive Gating Mechanism and Learnable Transformer" Remote Sensing 16, no. 6: 1080. https://doi.org/10.3390/rs16061080

APA Style

Wang, M., Sun, Y., Xiang, J., Sun, R., & Zhong, Y. (2024). Joint Classification of Hyperspectral and LiDAR Data Based on Adaptive Gating Mechanism and Learnable Transformer. Remote Sensing, 16(6), 1080. https://doi.org/10.3390/rs16061080

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop