3.1. Invariant Attribute Consistency Fusion
As shown in
Figure 2, the Invariant Attribute Consistency Fusion includes two parts: Invariant Attribute Extraction (IAE) and Spatial Consistency Fusion (SCF). Extracting invariant attribute features can counteract local semantic changes caused by pixel rotation and movement or local regional composition changes. We utilize the Invariant Attribute Profiles (IAPs) [
29] for feature extraction to enhance the diversity of features and model the invariant behaviors in multimodal data. This approach generates robust feature extraction for various semantic changes in multimodal remote sensing data. Isotropic filtering [
30] is a well-known and powerful tool in image processing that can robustly tackle rotational or shift variations in image patches and effectively eliminates other variabilities such as salt-and-pepper noise and the absence of local information. Hence, the multimodal remote sensing images are filtered using isotropic filters to obtain spatial invariant features, which can be expressed as follows:
where H represents the HSI and L represents the LiDAR image.
is the input remote sensing image, and
represents isotropic filtering, achieved by convolving
with
, thereby extracting spatially invariant features from local space. Moreover, the robustness of the features is further enhanced through the utilization of super-pixel segmentation methods [
31]. These methods prioritize the spatial invariance of the features by taking into account object semantics, including edges, shapes, and their inherent invariance, which can be expressed by
where
represents the segmentation of super pixels. The final attribute features for the HSI and LiDAR images can be obtained by
where
is the channel concatenation operation. To achieve invariance to translation and rotation in the frequency domain, we construct a continuous histogram of oriented gradients in Fourier polar coordinates. By utilizing the Fourier-based continuous Histogram of Oriented Gradients (HOG) [
32], we ensure invariant feature extraction in polar coordinates. This approach accurately captures rotation behaviors at any angle. Therefore, by mapping the translation or rotation of image blocks in Fourier polar coordinates, discrete attribute features are transformed into continuous contours. Consequently, we obtain Frequency Invariant Features (FIF) in the frequency domain.
By utilizing the extracted spatially invariant features,
and
, along with the frequency invariant features,
and
, we obtain the joint invariant attribute features, denoted as
:
Spatial Consistency Fusion is designed to enhance the consistency of similar features in the observed area’s terrain feature information. We employ the Generalized Graph-Based Fusion (GGF) method [
33] to jointly extract consistent feature information of different modalities’ invariant attributes.
where
and
,
, and
represent HSI, invariant features of HSI, and invariant features of LiDAR, respectively. The HSI is specifically used to capture the consistency information in the spectral dimension.
is the fusion result.
denotes the transformation matrix used to reduce the dimensionality of the feature maps, fuse the feature information, preserve local neighborhood information, and detect manifolds embedded in a high-dimensional feature space.
Initially, a graph structure is constructed to describe the correlation between spatial sample points and obtain the edge consistency information of the graph structure for different features:
where
,
, and
are defined as the edges of the graph structures
,
, and
, respectively, which describe the connections between any two points in the spatial domain. They are obtained through the
k-nearest neighbors (
k-NN) method. When the distance between two sample points is large (weak correlation),
. When two sample points
i and
j are close in distance (strong correlation),
. ⊙ is an XNOR operation to obtain the edge consistency information from these three features
,
, and
. The likelihood of a data point having similar features to its nearest neighbor is greater than with those points that are far away. Therefore, it is necessary to add a distance constraint when calculating graph edges. This can be defined as
where
refers to the matrix of pairwise distances between the individual data points of
. The operator “
” denotes logical negation. When the element in
is 0, this indicates that the edges in
,
, and
are not consistent. We impose a constraint on the maximum distance in
, which helps to reduce the correlation between pairs of vertices in the graph structure. Then,
, the diagonal matrix, is computed based on
. Subsequently, the Laplacian matrix
is obtained through this process:
By combining the known feature information
, the Laplacian matrix
, and the diagonal matrix
, we can use the following generalized eigenvalue calculation formula to obtain different eigenvalues
and their corresponding eigenvectors
:
where
denotes the transpose matrix of
,
represents the eigenvalue, and
with
indicate the number of eigenvalues. Since each eigenvector has its own unique eigenvalue, we can obtain
. Finally, based on all the eigenvectors, we can obtain the desired transformation matrix
:
where
represents an eigenvector corresponding to the
i-th eigenvalue. The fusion result is finally obtained using Equation (
5) with
.
3.2. Bi-Branch Joint Classification
The GCN and CNN are architectural designs used to extract distinct representations of salient information from multimodal remote sensing images. The CNN specializes in capturing intricate local spatial features, while the GCN excels at extracting abundant global spectral feature information from multimodal remote sensing images by utilizing spectral vectors as input. Additionally, the GCN can simulate the topological relationships between samples in graph-structured data. We design a bi-branch joint classification combining the advantages of the GCN and CCN to offer feature diversity.
Traditional GCNs effectively model the relationships between samples to simulate long-range spatial relationships in remote sensing images. However, inputting all samples into the network at once leads to significant memory overhead. To address these issues, the Mini Graph Convolutional Network (MiniGCN) [
13] is introduced to find robust locally optimal feature information by utilizing a sampler for small-sample sampling, dividing the original input graph-structured data into multiple subgraphs. The graph-structured multimodal fused image data are input into the MiniGCN in a matrix form for training. During the training process, the input data are processed and features are extracted and outputted in mini batches. The classification output can be represented by the following equation:
where
is the modified adjacency matrix after adding a unit matrix
and an adjacency matrix
of spatial-frequency-invariant attribute features
.
represents the weight of the
l-th layer in the graph convolutional network.
denotes the diagonal matrix of
.
represents the ReLU non-linear activation function.
represents the feature output of the
l-th layer in the graph convolutional network during the feature extraction process. When
,
corresponds to the original input features
.
represents the feature output of the
-th layer in the graph convolutional network, which serves as the final output spectral features.
In addition, we utilize a simple CNN structure [
34], which can be defined as
where the base structure
includes the convolutional layer, batch normalization layer, max-pooling layer, and ReLU layer. When
,
corresponds to the original input features
. We use adaptive coefficients to combine the detection results of the two networks, which can be represented as
where
represents the classification head function, while
and
refer to the features extracted by the GCN and the CNN, respectively. The
and
are learnable parameters of the network to balance the weight of the bi-branch results.
3.3. Binary Quantization
We introduce the Libra Parameter Binarization (Libra-PB) technique [
35], which incorporates both quantization error and information loss. During forward propagation, the full-precision weights are initially adjusted by computing the difference between the weight and the weights’ mean. This adjustment aims to distribute the quantized values uniformly and normalize the weight, thereby enhancing training stability and mitigating any negative effects caused by weight magnitude. The resultant standardized balanced weight, denoted as
, can be obtained through the following operations:
In the above equation,
represents the standard deviation, while
is the mean of the weights. Generally, the quantization of weights and activations can be defined as
where
and
a represent the floating-point parameters of weights and activations. The
function is commonly employed to obtain binary values.
s is an integer parameter employed to enhance the representation capability of binary weights.
Here,
n represents the dimension of the vector and
denotes its L1 norm. The main operations during the forward propagation of the binary network, involving quantized weights
and activations
, can be expressed as
During backward propagation, due to the discontinuity introduced by binarization, gradient approximation becomes necessary, which can be expressed as
where
is the loss function,
corresponds to the approximation of the
, and
represents its derivative. In our paper, we use the following approximation function:
3.5. Experimental Setup
Data Description: (1) Houston2013 Data: Experiments were carried out using HyperSpectral Imaging (HSI) and Digital Surface Model (DSM) data that were obtained in June 2012 over the University of Houston campus and the adjacent urban area. The HSI data consisted of 144 spectral bands and covered a wavelength range from 380 to 1050 nm, with a spatial resolution of 2.5 m that was consistent with the DSM data. The entire dataset covered an area of
pixels and included 15 classes of natural and artificial objects, which were determined through photo interpretation by the DFTC. The LiDAR data were collected at an average sensor height of 2000 feet, while the HSI was collected at an average height of 5500 feet. The scene contained various natural objects such as water, soil, trees, and grass, as well as artificial objects such as parking lots, railways, highways, and roads. The land-cover classes and their respective counts in the training and testing samples are provided in
Table 1.
(2) Trento Data: This dataset comprises 1 HSI with 63 spectral bands and 1 set of LiDAR data, captured in a rural area located in southern Trento, Italy. The HSI was obtained through the AISA Eagle sensor, while the corresponding LiDAR data were collected using the Optech Airborne Laser Terrain Mapper (ALTM) 3100EA sensor. Both datasets were of size
pixels with a spatial resolution of 1 m, while the wavelength range of HSI was from 0.42 to 0.99
m. This particular dataset consisted of a total of 30,214 ground-truth samples, with research conducted on 6 distinguishable category labels. The land-cover classes and their respective counts in the training and testing samples are provided in
Table 1.
Evaluation Metrics: To comprehensively evaluate the performance of multimodal remote sensing image classification algorithms, this article analyzes and compares various algorithms based on their classification prediction maps and accuracy. While the classification prediction map is subject to a certain degree of subjectivity and may not accurately measure the impact of an algorithm on classification performance, this study employs quantitative evaluation metrics such as overall accuracy (OA), average accuracy (AA), and Kappa coefficient to better measure and compare the performance of different algorithms. A higher value of any of these three indicators represents higher classification accuracy and an overall better performance of the algorithm. Among these three evaluation metrics, Overall Accuracy (OA) refers to the ratio of correctly classified test samples to the total number of test samples. Average Accuracy (AA) refers to the ratio of correctly classified test samples to the total number of test samples in a specific category.
Furthermore, we employ the Bit-Operations (BOPs) count [
36] and parameters [
37] as metrics to evaluate the compression performance.
Implementation Details: Our proposed approach is executed in Python and trained on a single RTX 3090 card. All the networks analyzed in this paper are implemented using the Pytorch framework. Throughout this procedure, we configure the batch size to 32, employ the Adam optimizer with an initial learning rate of 0.001, and conduct the procedure over a total of 200 epochs. The current learning rate is adapted using an exponential learning rate scheme, where the learning rate is multiplied by every 50 epochs. Furthermore, we apply weight regularization employing the L2 norm to stabilize network training and mitigate overfitting.
3.6. Ablation Study
An ablation study is conducted to demonstrate the validity of the proposed components by evaluating several variants of the IABC on HSI and LiDAR datasets.
Invariant Attribute Consistency Fusion: In
Table 2, we discuss the impact of using IACF (IAE structure and SCF module) on CNN and GNN networks in remote sensing image classification tasks and provide a comparison between multimodal and single-modal HSI and LiDAR data. The Houston2013 dataset is used for evaluation. Firstly, the experimental results for HSI data show that both GCN and CNN networks achieve a certain level of accuracy in classification but differ in precision. The introduction of the IAE structure improves classification performance, increasing OA and AA from 79.04% and 81.15% to 91.15% and 91.78%, respectively. This indicates the effectiveness of the IAE structure in improving the accuracy of remote sensing image classification. Secondly, the experimental results for LiDAR data demonstrate a lower classification accuracy when using GCN or CNN networks alone. However, the introduction of the IAE structure significantly improves classification performance. For example, OA increases from 22.74% to 35.46% for the GCN network on the LiDAR dataset. This confirms the effectiveness of the IAE structure in processing LiDAR data. Lastly, fusion experiments are conducted with HSI and LiDAR data. The results show that fusing HSI and LiDAR data further improves classification performance. In particular, when combining the IAE structure and SCF structure on the CNN network, the Overall Accuracy (OA) performance increases from 89.46% (with only the IAE structure) to 91.88%, resulting in a significant improvement of 2.43%.
Similarly, on the Trento dataset, similar conclusions were obtained, as shown in
Table 3. In the case of HSI data, when only GCN or CNN was used, the Overall Accuracy (OA) was 83.96% and 96.06%, respectively. However, when the IAE structure was introduced for invariant feature extraction, the OA accuracy improved to 95.34% (an increase of 11.38%) and 96.93% (an increase of 0.87%) for GCN and CNN, respectively. This indicates that the extraction of spatially invariant attributes can reduce the heterogeneity in extracting pixel features of the same class by CNN and GNN networks, enhancing the discriminative ability for the same class. Moreover, the extraction of invariant attributes has a more significant effect on improving the classification accuracy of the GCN network. When classifying LiDAR data, due to the characteristics of LiDAR data, the performance is relatively low, with only the GCN network achieving an OA of 48.31%. Introducing IAE can improve the GCN network OA by 11.94%. However, introducing IAE to the CNN network instead results in a decrease in classification performance from 90.81% to 68.81%. This might be due to the large size of areas with the same class in the Trento dataset, resulting in minimal elevation changes in the LiDAR images over a considerable area, leading to similar invariant attributes for different classes and interfering with the CNN network’s ability to extract and discriminate local information. This situation can be alleviated by using multimodal data (HSI + LiDAR) for classification. Considering the information from both the HSI and LiDAR, better performance can be observed. The highest classification accuracy (OA 98.05%) was attained when CNN introduced the IAE structure and SCF module. This further demonstrates that SCF can enhance the classification accuracy of the CNN network.
In summary, these experiments prove that the introduction of the IAE structure significantly improves the classification performance of CNN and GNN networks in remote sensing image classification tasks. Additionally, SCF enhances the classification performance of the CNN network. Furthermore, the fusion of multimodal data can further improve classification accuracy.
Bi-Branch Joint Classification: To analyze the performance of the bi-branch joint network for classification, we compare the different networks in the two datasets in
Table 4. The GCNs and CNNs demonstrate different advantages on different datasets. The GCN obtained a better classification performance at an OA of 92.60% compared with the CNN at an OA of 91.88% on the Houston dataset. However, the results show that the CNN achieved a high OA (98.05%), while the GCN obtained a lower OA (97.66%). This indicates that when dealing with sparsely categorized images, such as Trento, extracting local spatial information is useful. However, when dealing with densely distributed categories, such as Houston, extracting global spectral features has greater potential. In contrast, the joint approach yielded the top-performing classification outcomes on the Houston dataset, with an OA of 92.78%, and on the Trento dataset, with an OA of 98.14%. The experimental results demonstrate that using the bi-branch joint network can combine the advantages of CNN and GCN networks, resulting in excellent classification performance in land classification tasks using remote sensing images.
Binary Quantization: With the application of binary quantization, we can effectively address resource limitations and enable real-time decision-making capabilities in the context of processing large-scale data in remote sensing applications. To analyze the performance differences, we conducted a comparative study on classification accuracy and computational resources using different quantization strategies on the IABC network. In
Table 5, 32w and 32a denote the full precision of the weight and activation, while 1w and 1a represent the binary quantization of the weight and activation. The binary quantization module achieved OA accuracies of 98.14%, 98.16%, 85.33%, and 83.44% at different computational levels. Notably, the difference in OA accuracy between the 1w32a quantization level and the full-precision network is relatively small. Additionally, for the CNN network at the 1w32a quantization level, the parameter count is 32.675 KB, which accounts for only 3% of the parameter count of the full-precision network. Likewise, the BOPs are approximately 3% of the BOPs in the full-precision network. As the quantity of quantization bits decreases, there is a corresponding decrease in the accuracy of the classification model, as observed. The decrease in accuracy can be attributed to the reduction in model parameters, which leads to the loss of crucial layer information and subsequently causes a decline in accuracy. It is observed that the binary quantization of the activations has a significantly negative influence on the accuracy of classification, and the OA decreases by 12.81% compared with the full-precision network and 12.53% compared with the quantization weight only (1w32a). In particular, when using the 1w1a network exclusively, the impact is notably significant, with a resulting accuracy reduction of 14.7% compared to a full-precision network. Hence, we only consider the binary quantization of the weights 1w32a in our experiment.
3.7. Quantitative Comparison with the State-of-the-Art Classification Network
To validate the effectiveness of the proposed IABC, we compare the experimental results of the IABC on both HSI and LiDAR datasets with those of other competitive classifiers: MDL_RS_FC [
34], EndNet [
14], RNN [
38], CALC [
39], ViT [
17], MFT [
18], HCT [
40], and Exvit [
19]. The parameters of the whole compared algorithms are optimized on the same server. Additionally, the training and testing samples we utilize are identical and our proposed IABC network abstains from utilizing any data augmentation operations to ensure a fair comparison.
Table 6 and
Table 7 show the classification outcomes of various networks on two datasets. The most favorable results are emphasized using bold highlighting. The superiority of the proposed IABC over other methods is evident. For instance, when considering the Houston2013 dataset, IABC exhibits substantial improvements in Overall Accuracy (OA) for various approaches: MDL_RS_FC sees an improvement of approximately 7.67%, Endnet by 7.6%, RNN by 20.32%, CALC by 2.84%, ViT by 7.58%, MFT by 3.03%, HCT by 2.21%, and Exvit by 1.36%. RNN performs the worst, with only 72.31% OA. The MDL_RS_FC method achieves better performance, with an Overall Accuracy (OA) of 84.96%, thanks to its meticulously designed cross-fusion approach, which facilitates the more effective interaction of information during the fusion, resulting in improved performance. The conventional classifier EndNet works by leveraging deep neural networks to enhance the ability of feature extraction for spectral and spatial features. The CALC method achieves a ranking of three with an OA of 89.79%, fully harnessing and extracting semantic information as well as complementary information. The transformer-based methods ViT and MFT, with their strong feature expression ability in high-level sequential abstract representation, achieve higher accuracy than the traditional deep learning networks (such as Endnet and MDL_RS_FC). In contrast, our IABC method obtains the optimal performance by using spatial–spectral CNN and relation-enhanced GCN features with invariant attribute enhancement. For the Trento dataset, IABC provides approximately 10.36%, 7.54%, 2.20%, 0.52%, 2.16%, 0.31%, 10.41%, and 0.05% OA improvements for MDL_RS_FC, Endnet, RNN, CALC, ViT, MFT, HCT, and Exvit, respectively. The performance of the RNN network on the Trento dataset is noticeably better compared with the result on the Houston2013 dataset, while the MDL_RS_FC method performance is worse on the Trento dataset. It is proven that the generalization performance of these two methods is comparatively poor. The performance of other algorithms is consistent with the performance on the Houston2013 dataset.
Figure 3 and
Figure 4 illustrate a range of visual results, including hyperspectral false-color data, LiDAR images, ground truths, and classification maps acquired using various methods. Each category is accompanied by its respective color scheme. Upon thorough evaluation and comparison, it is clear that the proposed methods yield superior results with significantly reduced noise compared to alternative approaches. Deep learning models excel in capturing the nonlinear relationship between input and output features, thanks to their remarkable ability to extract learnable features. Hence, all the methods generate relatively smooth classification maps, effectively distinguishing between different land-use and land-cover classes. Notably, Vit, MFT, HCT, and Exvit demonstrate their efficacy in classification by extracting high-level sequential abstract representations from images. Consequently, the classification maps exhibit better visual quality compared to fully connecting, CNN, and RNN networks. By enhancing neighboring spatial–spectral information and facilitating the effective transmission of relation-augmented information across layers, the proposed IABC method achieves highly desirable classification maps, particularly in terms of texture and edge details, surpassing CALC, ViT, MFT, HCT, and Exvit.