Multimodal Semantic Collaborative Classification for Hyperspectral Images and LiDAR Data

Wang, Aili; Dai, Shiyu; Wu, Haibin; Iwahori, Yuji

doi:10.3390/rs16163082

Open AccessArticle

Multimodal Semantic Collaborative Classification for Hyperspectral Images and LiDAR Data

¹

Heilongjiang Province Key Laboratory of Laser Spectroscopy Technology and Application, Harbin University of Science and Technology, Harbin 150080, China

²

Computer Science, Chubu University, Kasugai 487-8501, Japan

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 3082; https://doi.org/10.3390/rs16163082

Submission received: 18 July 2024 / Revised: 16 August 2024 / Accepted: 19 August 2024 / Published: 21 August 2024

(This article belongs to the Special Issue Recent Advances in the Processing of Hyperspectral Images)

Download

Browse Figures

Versions Notes

Abstract

Although the collaborative use of hyperspectral images (HSIs) and LiDAR data in land cover classification tasks has demonstrated significant importance and potential, several challenges remain. Notably, the heterogeneity in cross-modal information integration presents a major obstacle. Furthermore, most existing research relies heavily on category names, neglecting the rich contextual information from language descriptions. Visual-language pretraining (VLP) has achieved notable success in image recognition within natural domains by using multimodal information to enhance training efficiency and effectiveness. VLP has also shown great potential for land cover classification in remote sensing. This paper introduces a dual-sensor multimodal semantic collaborative classification network (DSMSC²N). It uses large language models (LLMs) in an instruction-driven manner to generate land cover category descriptions enriched with domain-specific knowledge in remote sensing. This approach aims to guide the model to accurately focus on and extract key features. Simultaneously, we integrate and optimize the complementary relationship between HSI and LiDAR data, enhancing the separability of land cover categories and improving classification accuracy. We conduct comprehensive experiments on benchmark datasets like Houston 2013, Trento, and MUUFL Gulfport, validating DSMSC²N’s effectiveness compared to various baseline methods.

Keywords:

visual-language pretraining (VLP); large language models (LLMs); hyperspectral images; LiDAR

1. Introduction

Remote sensing image classification plays an important role in land cover monitoring, urban planning, forest management, and other fields [1]. Among these, HSI stands out due to its unique advantages. HSI consists of spectra from hundreds of contiguous bands in the same area, offering the unique advantages of high spectral resolution and spectral–spatial integration. Its rich spectral information can be used to identify the composition and intrinsic structure of different land cover materials [2,3,4]. In recent years, deep learning models, such as convolutional neural network(CNN) [5], 3DCNN [6], recurrent neural network (RNN) [7], and transformer [8], have automatically learned complex feature representations in images, providing new approaches for hyperspectral image classification [9].

Recent advances in Earth observation technologies have driven the widespread adoption of multimodal remote sensing data processing. This progress has accelerated the research in semantic segmentation and multimodal data matching [10,11,12], significantly improving the accuracy of remote sensing image classification. Beyond HSI, remote sensing techniques can leverage different types of sensors to simultaneously and synergistically depict terrestrial objects in the same area and time frame [13,14,15,16]. For instance, LiDAR can generate digital surface models (DSMs) that capture the three-dimensional morphology of ground targets. The variations in elevation features derived from these models are widely used in classification tasks [17].

HSI provides not only rich spectral information but also valuable spatial data due to its unique data cube structure. Along the spectral axis, HSI displays a precise distribution of bands. Across the two spatial dimensions, it captures the geometric layout of the scene, creating a structure that integrates both the spectral and spatial information. This structure allows HSI to capture the spatial arrangement and texture of land covers, crucial for distinguishing between types with similar spectral characteristics but different spatial patterns.

When HSI’s spatial information is combined with the high-precision elevation data from LiDAR, the complementary strengths of these two data sources become clear. In complex geographic environments, integrating HSI’s spatial details with LiDAR’s vertical structural data significantly improves the accuracy of land cover classification and identification. This integration not only enhances the separability of the land cover categories but also offers a more comprehensive and detailed description of surface features, greatly improving the performance and reliability of classification tasks.

By deeply integrating the HSI and LiDAR data, we fully harness HSI’s strengths in spectral analysis and spatial positioning while effectively incorporating LiDAR’s expertise in terrain height and spatial detail. This integrated analytical strategy creates a more comprehensive and fine-grained multidimensional representation of land cover properties, significantly enhancing classification accuracy and deepening our understanding of the essential characteristics of land covers.

Despite their complementary strengths, HSI images and LiDAR data represent two distinct remote sensing modalities, exhibiting apparent heterogeneity [18,19]. How to extract richer feature representations to fully exploit the complementary information between the two data sources is one of the key challenges currently limiting the in-depth collaboration of heterogeneous remote sensing data [20].

Numerous studies have explored the comprehensive exploitation of complementary information between HSI and LiDAR. With the rapid development of deep learning techniques, novel network architectures such as ConvNeXt [21,22], InceptionNeXt [23], Vision Transformer(ViT) [24], Dual Attention Vision Transformers(DaViT) [25], and SpectFormer [26] have also been introduced to the collaborative classification task of multi-source remote sensing image data, achieving significant progress. For example, satisfactory collaborative classification results have been achieved in IP-CNN [27], hierarchical CNN and transformer [28], and CNN-transformer [29]. Mainstream multimodal models include dual encoder and fusion encoder, and many research efforts have focused on how to improve the collaborative integration of multimodal models. For example, two-branch CNN [30], EndNet [31], Hypermlp [32], Ms2canet [33], HFSNet [34], and so on. These advances highlight the potential for further improvements in the integration of multi-modal remote sensing data, driving innovation and enhancing the accuracy and reliability of land cover classification models.

Recently, visual-language pretraining (VLP) [35], contrastive language-image pre-training (CLIP) [36], bootstrapping language-image pre-training (BLIP) [37,38], and InstructBLIP [39], have achieved significant success in computer vision and natural language processing, dominating the downstream tasks such as image classification. This success demonstrates the potential of cross-modal learning in understanding and representing complex information. It further suggests that image-text joint pre-training strategies, such as RemoteCLIP [40], could greatly improve models’ ability to grasp the deep semantic content of remote sensing images. This is because remote sensing images are rich in visual information and often include detailed textual descriptions.

Despite the significant advances in classification performance achieved by large-scale image-text pre-training techniques, some inherent limitations remain [41]. For instance, the models occasionally make illogical classification decisions or produce unexpected results. CLIP may misclassify an image of a skier as “snow mountain” in certain scenarios, revealing an overreliance on background features while neglecting primary subjects. This highlights the importance of enhancing the logical coherence and interpretability of classifications.

To tackle this challenge, we propose a novel strategy that uses language as a bridge between hyperspectral image features and land cover categories. By leveraging language models to analyze the detailed characteristics of land covers, we move beyond pre-defined class labels to achieve finer-grained interpretation and feature learning. For instance, when identifying specific land cover types, the model can generate relevant linguistic descriptions based on spectral information, spatial information, and other attributes, accurately recognizing and extracting key features, thus enhancing the model’s explanatory power.

However, manually defining such features is not only time-consuming but also lacks scalability. To overcome this obstacle, extensive world knowledge from large language models (LLMs) such as GPT-3 [42,43] can be utilized. These models can automatically obtain detailed visual descriptions with remote sensing knowledge based on regional categories, thereby enhancing the interpretability and accuracy while ensuring the model’s classification performance.

Faced with the limitations and interpretability demands of current remote sensing image classification tasks, an innovative network, the dual-sensor multimodal semantic collaborative classification network (DSMSC²N) has been developed in this paper. This network initially leverages LLMs to generate high-precision spectral category textual descriptions, thereby enriching the visual features with substantial semantic information. Subsequently, the ModaUnion encoder, a core component, extracts the shared features from the bimodal data of HSI and LiDAR, facilitating effective feature fusion and cooperative analysis of the high-dimensional heterogeneous data. Additionally, the mixture of experts encoder module (MoE-EN) is incorporated to deeply explore and utilize the complementary information between different modal data, effectively addressing the significant differences and high integration difficulty between modalities. Finally, a contrastive learning strategy for land cover classification is employed, which not only optimizes the model’s classification performance but also enhances the differentiation between categories, providing a more intuitive and interpretable foundation for model decision-making. The main contributions can be summarized as follows:

(1): Enhancing Land Cover Classification Accuracy with Instruction-driven Large Language Models: instruction-driven large language models guide the model to focus on and extract the critical features, thereby improving land cover classification accuracy.
(2): Improving Multisource Data Feature Extraction with the ModaUnion Encoder: the ModaUnion encoder enhances multisource data feature extraction quality by implementing parameter sharing.
(3): Addressing Multisource Heterogeneity with MoE-EN and Contrastive Learning: the MoE-EN structure and contrastive learning strategy enhance the expression of complementary information from each data source, effectively managing multisource heterogeneity.

2. Relate Work

2.1. Large Language Models

Large language models (LLMs), such as OpenAI’s GPT series [44,45], have fundamentally redefined the boundaries of human–computer interaction since their inception. They essentially represent and compress human knowledge and experience, mapping it into the language space. LLMs continuously demonstrate their powerful capabilities, excelling not only in natural language processing fields but also extending their influence to a wide range of applications, thereby empowering other domains. They can transcend limitations and solve complex problems once solvable only by specialized algorithms or domain experts, demonstrating exceptional adaptability and innovation in cross-disciplinary tasks.

For instance, in the field of mathematical reasoning, Yue et al. [46] fine-tuned LLMs through instruction-based training, enabling them to excel in mathematical tasks. In the medical domain [47], Elliot et al. [48] fine-tuned smaller, more targeted medical LLMs, achieving impressive results comparable to larger models. In the field of chemistry, Li et al. [49] introduced the concept of in-context molecule alignment (ICMA), enabling large language models to learn molecule-description alignment.

Our study in this paper explores the innovative application of LLMs in the domain of remote sensing image analysis, utilizing them to enhance the accuracy of remote sensing land cover classification.

2.2. Multi-Modal Contrastive Representation Learning

Multi-modal contrastive representation learning aims to create a unified representation space that integrates information from various sensory modalities. This framework includes multiple single-modality encoders, each tailored to a specific modality (such as image, text, or audio). These encoders are responsible for transforming the raw input of their respective modalities into a form that can be compared within a shared representation domain. These single-modal encoders are optimized through pre-training on large-scale paired data with a contrastive loss function, ensuring that semantically related inputs from different modalities are close to each other in the representation space, while unrelated inputs are far apart. This approach captures cross-modal semantic associations within the shared representation space, providing a robust foundation for subsequent multi-modal downstream tasks.

In recent years, visual-language contrastive pre-trained models have achieved remarkable success, with models such as CLIP [36] and UMG-CLIP [50] demonstrating outstanding zero-shot retrieval and classification performance, as well as significant generalization capabilities across numerous downstream task [51,52,53,54,55,56,57]. Arora et al. [58] proposed a theoretical framework to directly analyze the generalization of InfoNCE on downstream tasks. Wang et al. [59] introduced alignment and uniformity as empirical metrics to predict downstream performance. Haochen et al. [60] analyzed self-supervised contrastive learning from the perspective of matrix factorization. Huang et al. [61] primarily focused on theoretically analyzing the generalization of self-supervised contrastive learning, understanding and mathematically characterizing the role of data augmentation. Building on these theoretical foundations, this study also explores the role of contrastive learning in multi-modal remote sensing land cover classification.

3. Method

In this section, we present the technical details of DSMSC²N as illustrated in Figure 1, which comprises the following components.

(1): Automatic generation of category descriptions to create textual data corresponding to the categories.
(2): ModaUnion Encoder is used for extracting shared features from HSI and LiDAR.
(3): HSI and LiDAR Encoders are used for extracting visual embedding vectors. Text Encoder is used for extracting language embedding vectors.
(4): HSI-LiDAR bidirectional contrastive loss (HLBCL), HSI-Text bidirectional contrastive loss (HTBCL), and cross-entropy (CE) loss are used for training the entire model.

3.1. Building Descriptors

We adopted an automated approach to construct a set of hyperspectral category descriptors, as shown in Figure 2. Specifically, we achieved this by instructing LLMs (such as GPT-3) to describe the spectral features unique to each category in hyperspectral images. We guided the LLMs using input prompts like the following example:

Q: What are useful visual features for distinguishing these classes in a Hyperspectral image? To represent with the most significant description.

A: Healthy grass: High chlorophyll content and uniform canopy cover, reflecting strongly in the green and NIR regions.

LLMs possess vast and diverse global knowledge, including extensive geoscientific information in the hyperspectral image domain. By effectively utilizing these large models, researchers can accurately extract relevant knowledge about land cover categories and uncover key features hidden in hyperspectral images. This approach provides a more comprehensive and in-depth perspective for land cover classification, thereby further improving its accuracy and effectiveness.

3.2. Vision and Text Encoder

In multimodal land cover classification research, we followed the mainstream paradigm of vision-language pre-training and trained the model through targeted design and adjustments. Our goal was to fully exploit the information from HSI, LiDAR, and advanced abstract representations of land cover category descriptions provided by language models, achieving more accurate and interpretable land cover category recognition and analysis.

DSMSC²N comprises three different vision encoders and one text encoder. The vision encoders include the ModaUnion encoder, MoE-EN for HSI, and MoE-EN for LiDAR. The ModaUnion encoder extracts common features across multiple modalities, facilitating the deep collaborative processing of heterogeneous data. The MoE-EN module addresses the issue of source heterogeneity by compressing source-specific patterns into distinct encodings.

ModaUnion Encoder: The fusion encoder often struggles to fully utilize the unique information in each modality, while the dual encoder may analyze different modalities in isolation, leading to underutilization of the complementary low-frequency information from HSI and LiDAR. To address this, this study designed the ModaUnion encoder, based on the InceptionNeXt architecture, to effectively integrate information from both modalities, enhance feature learning, and better coordinate complementary information by leveraging HSI’s spatial–spectral coupling and LiDAR’s precise terrain data.

This encoder, through parameter sharing, facilitates the synergistic handling of HSI and LiDAR data. Specifically, the ModaUnion encoder consists of InceptionNeXt modules, as illustrated in Figure 3, inspired by the original Inception framework. These modules decompose large convolutional kernels into parallel smaller convolutional kernels and orthogonal strip convolutional kernels, promoting the effective extraction of information across various scales and orientations. This design not only allows for the efficient capture of low-frequency common features between the two modalities but also significantly boosts the computational efficiency and data processing capacity of the model through parallel processing. Moreover, the implementation of global response normalization (GRN) technology optimizes information transmission, achieving a profound integration of heterogeneous data.

The joint encoding

Z_{p} \in R^{P}

of the two modalities, HSI images and LiDAR data, is achieved through the ModaUnion encoder, represented as follows:

Z_{p} = E_{U n i o n} (X_{H S I}, X_{L i D A R})

(1)

MoE-EN of HSI: The MoE-EN of HSI consists of a spectral–spatial perception branch and a global spectral understanding branch, as shown in Figure 1. The spectral–spatial perception branch utilizes the spatial window multi-head self-attention (SW-MHSA) mechanism and the channel group multi-head self-attention (CG-MHSA) mechanism. Combined with the global spectral understanding branch, this configuration captures local spectral–spatial features and global spectral semantic information, comprehensively enhancing the model’s capability to understand and represent HSI features.

Specifically, the SW-MHSA mechanism finely partitions the input HSI into multiple local regions, defining a bounded spatial window for each region where attention weights are computed and allocated. Within the multi-head architecture, distinct heads specialize in learning various local spatial dependency patterns. This feature enables SW-MHSA to model local spatial information more efficiently, significantly enhancing the model’s capability to capture and learn spatial contextual information.

Meanwhile, the CG-MHSA mechanism subdivides the channel dimension of the input features into multiple independent groups, performing self-attention computations within each group to learn and comprehend the interdependencies among channels. This design allows the model to focus on exploring the intrinsic connections between spectral channels, thereby deepening the understanding of inter-spectral correlations within HSI. Moreover, through the global spectral understanding branch, the model learns and disentangles dependencies among spectral features, acquiring a contextual representation of the central pixel from a global spectral perspective. This enriches the model’s understanding of contextual semantic information between spectral features.

Through the HSI encoder, this study achieves the precise handling of spatial, global spectral, and local associative information in hyperspectral images. This strengthens the model’s ability to understand spatial characteristics and fosters cooperative learning of spectral relational information, ultimately enhancing the model’s overall representational power and classification accuracy.

The joint encoding, denoted as

Z

, undergoes HSI encoding to obtain HSI embeddings

v_{h} \in R^{N}

and is then mapped to a raw embedding

H_{d} \in R^{D}

through a projection head, represented as follows:

v_{h} = E_{H S I} (Z)

(2)

H_{d} = f_{h} (v_{h})

(3)

MoE-EN of LiDAR: The MoE-EN of LiDAR utilizes a frequency domain attention mechanism (spectrum former), as described in Figure 1. Because LiDAR data contain rich elevation information, such as building boundaries and vegetation shapes, this encoder fully learns LiDAR’s elevation features, significantly enhancing the collaborative classification performance.

Specifically, Spectrum-LiDAR innovatively integrates a transformer encoder architecture with a self-attention mechanism enhanced by Fourier transforms, achieving the efficient processing of LiDAR data. By mapping the data into the frequency domain and amplifying its high-frequency components, the system sharply focuses on critical topographical features essential for constructing an overall scene understanding. This process not only markedly enhances the model’s robustness, enabling it to handle data fluctuations caused by varying viewpoints and environmental changes with stability, but also ensures that even under noisy or challenging acquisition conditions, the model can generate more accurate and stable representations.

Furthermore, Spectrum-LiDAR employs the inverse Fourier transform to convert the frequency-optimized data back into the spatial domain. This step paves the way for subsequent multimodal feature integration, ensuring that information from various sensors is effectively combined.

The joint encoding, denoted as

Z

, undergoes LiDAR encoding to obtain LiDAR embeddings

v_{l} \in R^{N}

and is then mapped to a raw embedding

L_{d} \in R^{D}

through a projection head, represented as follows:

v_{l} = E_{L i D A R} (Z)

(4)

L_{d} = f_{l} (v_{l})

(5)

Text Encoder: We used a text encoder (a frozen pre-trained RemoteCLIP ViT-B-32 model) to generate meaningful embeddings

t \in R^{M}

for hyperspectral remote sensing data.

The RemoteCLIP model excels by combining remote sensing images with corresponding specialized textual information (such as land cover classifications and environmental descriptions on satellite images). This approach enables the model to grasp intricate connections between remote sensing visual content and professional narratives. Through this process, the model embeds profound remote sensing expertise, significantly enhancing its ability to recognize complex land cover and phenomena in remote sensing images. This includes accurately distinguishing between different vegetation communities, analyzing changes in land use patterns, and precisely identifying water body characteristics. Compared to the general-purpose CLIP model, RemoteCLIP demonstrates superior accuracy and efficiency in specialized classification and recognition tasks for remote sensing images, due to its tailored learning and understanding of the remote sensing context. We project them to

T_{d} \in R^{D}

as follows:

t = E_{t x t} (X_{T e x t})

(6)

T_{d} = f_{t} (t)

(7)

where

f_{t}

denotes the projection head, while

E_{t x t}

represents the text encoder. This arrangement ensures that the embedding dimension

D

aligns with that of the visual encoder, rendering it conducive for contrastive learning.

3.3. Loss Function

In previous discussions, the heterogeneity of multi-source data has been clearly identified as a key bottleneck limiting the performance improvement in hyperspectral multimodal land cover classification. To address this issue, we introduced an innovative strategy by constructing a specially designed joint training loss function system to guide the model in deeply exploring and effectively integrating shared features from multi-source data. The optimization framework for this training phase comprised three core components. Firstly, HSI-LiDAR bidirectional contrastive loss (HLBCL) is for HSI and LiDAR data. Secondly, HSI-Text bidirectional contrastive loss (HTBCL) is for HSI and text data; and finally, the cross-entropy (CE) loss is for land cover classification. In conclusion, the overall optimization objective can be formally expressed as follows:

L_{a l l} = λ (ω_{1} L_{B C L}^{H L} + ω_{2} L_{B C L}^{H T}) + (1 - λ) L_{C E}

(8)

L_{C E} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{k = 1}^{K} (y_{i k} l o g (p (x_{i k})))

(9)

where

λ

is the hyperparameter to balance the training of the CE Loss and multiple BCL losses;

ω

is a learnable weight used to aggregate the contributions of different modal sources to land classification;

N

represents the number of samples;

K

denotes the total number of classes;

y_{i k}

indicates whether the

i^{t h}

sample belongs to the

k^{t h}

land cover class; and

p (x_{i k})

denotes the probability that the model predicts the

i^{t h}

sample belongs to the

k^{t h}

land cover class.

To comprehensively elucidate the entire training process, we began by providing a detailed introduction to BCL. Given N paired instances from two different modalities, we mapped the

i^{t h}

pair to L2-normalized embeddings

x_{i}

and

t_{i}

via two encoders.

L_{i, j}^{x \to t} = l o g \frac{e^{(\frac{s i m (x_{i}, t_{i})}{τ})}}{\sum_{k = 1, k \neq i}^{N} e^{(\frac{s i m (x_{i}, t_{k})}{τ})}}

(10)

L_{i, j}^{t \to x} = l o g \frac{e^{(\frac{s i m (t_{i}, x_{i})}{τ})}}{\sum_{k = 1, k \neq i}^{N} e^{(\frac{s i m (t_{i}, x_{k})}{τ})}}

(11)

s i m (x, t) = \frac{x^{T} t}{‖x‖ ‖t‖}

(12)

L_{B C L} = - \frac{1}{2 N} \sum_{i = 1}^{N} [L_{i, j}^{x \to t} + L_{i, j}^{t \to x}]

(13)

where

s i m (\cdot)

represents the cosine similarity between the embeddings of

x_{i}

and

t_{i}

and

τ

represents a temperature parameter initialized at 0.07.

Next, we analyzed why multi-modal contrastive learning is useful for downstream land classification tasks, which can be decomposed into two parts.

L_{i, j}^{x \to t} = \frac{s i m (x_{i}, t_{i})}{τ} - l o g (\sum_{j = 1}^{N} e^{(\frac{s i m (x_{i}, t_{j})}{τ})})

(14)

Here,

f (u) = \frac{u}{‖u‖}

. we reformulated the equation as follows:

L_{i, j}^{x \to t} = \frac{f (x_{i})^{T} f (t_{i})}{τ} - l o g (\sum_{j = 1}^{N} e^{(\frac{f (x_{i})^{T} f (t_{j})}{τ})})

(15)

L_{i, j}^{x \to t} = L_{p o s} - L_{r e g}

(16)

The first part

L_{p o s}

ensures that the positive samples are as close as possible in the feature space, which is also one of the goals of contrastive learning. The second part

L_{r e g}

ensures that the centers of different categories are sufficiently far apart in the feature space, as simply bringing positive samples closer together might lead to the dimensional collapse.

As depicted in Figure 4, the samples from different modalities of the same category are drawn infinitely close to the same point, and only a minimal angular separation

\frac{⟨E_{x_{i}}, E_{t_{j}}⟩}{‖E_{x_{i}}‖ ‖E_{t_{j}}‖} < 1

is needed to accurately distinguish different categories. This approach prevents the learned representations from collapse.

The analysis indicated that multimodal contrastive learning enhanced the clustering of the learned feature space, maintaining sufficient distance between category centers and preventing dimensional collapse. This mechanism significantly improved the accuracy of the downstream classification tasks.

Utilizing HTBCL and HLBCL, our method enabled the model to acquire more accurate and semantically rich land cover feature representations. These loss mechanisms work synergistically, driving the model to deeply explore and understand the similarities and differences between different modal data. This comprehensive utilization of multi-source data significantly enhances the performance and accuracy of land cover classification tasks.

3.4. Final Classifier

The overall training process of DSMSC²N is elaborated in Algorithm 1. Feature embeddings

H_{d}

and

L_{d}

extracted from HSI and LiDAR data by the MoE-EN are fed into a carefully designed multilayer perceptron (MLP) to perform the final classification task. This MLP is built on multiple hidden layers, each integrating linear layer and GReLU activation functions, aimed at enhancing the network’s nonlinear expressive power. The size of the output layer is customized according to the specific number of categories in the classification task. The Softmax function is used to normalize the outputs, ensuring that the sum of category probabilities for each sample is 1, thus constructing the posterior probability distribution of category. By identifying the maximum value in the probability distribution, precise land cover classification

Y_{p}

is achieved as follows:

p (x_{i}) = M L P (C o n c a t (H_{d}^{i}, L_{d}^{i}))

(17)

Y_{p} = a r g m a x (p (x_{i}))

(18)

p (x_{i})

denotes the probability that the model predicts the

i^{t h}

sample.

Algorithm 1 gives the overall network training process of DSMSC²N.

Algorithm 1 Training DSMSC²N model.

Input: HSI images:

D_{H S I}

, LiDAR data:

D_{L i D A R}

, Text data:

X_{T e x t}

, Training labels:

y

Output: land cover result:

Y_{p}

1. Initialize: batch size = 64, epochs = 200, the initial learning rate set to 5 × 10⁻⁴ of AdamW;
2. Patches: Divide

D_{H S I}

and D_{L i D A R}

into contextual cubes X_{H S I}

and patches X_{L i D A R}

respectively;
  3. for i = 1 to epochs do
  4. \\ Extact Featre Embedding-ModaUnion Encoder
  5.

Z_{P}^{i}

\leftarrow ModaUnion Encoder (X_{H S I}^{i}

, X_{L i D A R}^{i})

;
6.

H_{d}^{i}

\leftarrow MoE - EN of HSI (Z_{P}^{i})

;
7.

L_{d}^{i}

\leftarrow MoE - EN of LiDAR (Z_{P}^{i})

;
8.

T_{d}^{i}

\leftarrow Text Encoder (X_{T e x t}^{i})

;
9. Optimize Feature Representation and Update the discriminators by optimizing Equation (8);
10. Obtain the land cover result:

Y_{p}

by computing Equations (17) and (18).
11. end for

4. Results

4.1. Datasets

Houston 2013 dataset: This dataset involves two datasets—hyperspectral images and LiDAR-derived DSM, both consisting of 349 × 1905 pixels with the same spatial resolution (2.5 m). The data were acquired by the NSF-funded Center for Airborne Laser Mapping (NCALM) over the University of Houston’s 2013 campus and the neighboring urban area. The HSI has 144 spectral bands in the 380 nm to 1050 nm region, including 15 classes. Figure 5 gives the visualization results of the Houston 2013 dataset. These data and reference classes can be obtained online from the IEEE GRSS website (http://dase.grss-ieee.org/ (accessed on 10 September 2022)). The land class details of Houston 2013 are listed in Table 1.

Trento dataset: This dataset involves two datasets—hyperspectral images and LiDAR-derived DSM, both consisting of 600 × 166 pixels with the same spatial resolution (1 m). The data were acquired by the AISA Eagle sensor, and the LiDAR DSM was produced using first- and last-point cloud pulses obtained by the Optech ALTM 3100EAsensor. The HSI has 63 spectral bands covering the 402.89 to 989.09 nm region and includes six classes. Figure 6 gives the visualization results of the Trento dataset, and Table 2 lists the number of samples of different classes and the color of each class.

MUUFL Gulfport dataset: This dataset was captured over the University of Southern Mississippi campus in November 2010 using the reflective optics system imaging spectrometer (ROSIS) sensor. It comprises a hyperspectral image (HSI) with dimensions of 325 × 220 pixels and 72 spectral bands. Additionally, the LiDAR image includes elevation data provided in two raster layers, with a total of 64 bands after the removal of 8 initial and final bands affected by noise. This dataset encompasses 11 urban land-cover classes, with 53,687 ground truth pixels annotated. There are a total of 53,687 ground truth pixels spanning 11 classes. For training, 100 pixels per class are selected, leaving 52,587 pixels for testing. Figure 7 provides a visualization of the MUUFL Gulfport dataset, and the number of samples in each class and the corresponding colors for each class is listed in Table 3.

4.2. Implementation Details

Configuration: This experiment was conducted on a computational server equipped with a Tesla P100 GPU and an Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz processor running Ubuntu 18.04. The experimental environment was built using Python 3.8 and PyTorch 2.0.1 deep learning framework. Our network utilized a fixed patch size of 11 × 11 for all datasets. These patches are centered around pixels with known ground truth labels. The model was trained with a batch size of 64 for 200 epochs. The training and validation datasets were randomly split at a ratio of 8:2, ensuring no overlap between training and test samples. The AdamW optimizer and cosine learning rate scheduling strategy were employed, with an initial learning rate set to 5 × 10⁻⁴ and a weight decay coefficient of 1 × 10⁻¹. For the CG-MHSA module, the number of groups

g

was set to 1, and for the contrastive learning loss, the hyperparameter

τ

was set to 0.07. Additionally, the weighting hyperparameter

λ

in the final joint loss was set to 0.6.

Evaluation Matrix: We employed several widely used quantitative metrics to evaluate the performance of the proposed network and compared it with other methods. These metrics include overall accuracy (OA), average accuracy (AA), and the statistical Kappa coefficient (K). OA indicates the proportion of correctly classified test samples to all test samples, while AA represents the average accuracy across classes. The statistical Kappa coefficient (K) reflects the level of agreement between the classification maps generated by the model under consideration and the provided ground truth.

4.3. Ablation Study

To assess the impact of the proposed network architecture and the integration of LiDAR and text modal data on land classification performance, ablation experiments were conducted on the Trento dataset. Transformer was employed as the baseline network for comparative analysis. Furthermore, the impact of the joint loss function on the proposed network was examined.

Effect of Incorporating LiDAR and Text Modal: The experimental results in Table 4 demonstrate a significant improvement in hyperspectral land cover classification performance with the inclusion of LiDAR and textual modalities.

We compared the classification performance of single HSI images, single LiDAR data, the combination of HSI and LiDAR, and the tri-modal combination of HSI, LiDAR, and text. The LiDAR data provide elevation information, helping to distinguish different land cover classes, while the textual modality contributes semantic descriptions of land covers (such as names, functions, and attributes), enriching the semantic context of the classification task. The tri-modal combination exhibited improvements over the baseline models using only HSI or LiDAR, with increases in OA, AA, and Kappa of 2.79%, 3.59%, and 3.73%, respectively, compared to the HSI baseline model, and of 9.94%, 2.47%, and 13.16%, respectively, compared to the LiDAR baseline model. Additionally, compared to the combined HSI-LiDAR baseline model, the tri-modal setup achieved increases in OA, AA, and Kappa of 0.03%, 0.01%, and 0.04%, respectively, providing evidence of the value of integrating LiDAR and text modalities for enhancing the hyperspectral land classification performance.

Effect of Fusion Strategies: We investigated the collaborative methods of auxiliary encoders based on LiDAR and text data and a primary encoder based on HSI in classification models, implementing three fusion strategies: early fusion, late fusion, and our proposed fusion method. Early fusion integrates the feature information extracted at various levels by the LiDAR and text auxiliary encoders with the hyperspectral primary encoder. Late fusion combines the features of the LiDAR and text auxiliary encoders with those of the HSI primary encoder only at the final layer of the model. In contrast, our proposed fusion method first uses the ModaUnion module for early fusion to extract the common features from the HSI and LiDAR modalities, then employs MoE-EN to obtain the characteristic features of the HSI and LiDAR modalities, and finally performs late fusion on the features of HSI, LiDAR, and text. The experimental results are shown in Table 5.

Compared with the early and late fusion strategies, our proposed fusion method leverages the advantages of the ModaUnion module and MoE-EN, demonstrating superior classification performance and effectively utilizing the complementary information among LiDAR, HSI, and text modalities.

Effect of MoE-EN: We analyzed the experimental results of the expert hybrid network structure, revealing its outstanding performance and advantages in remote sensing classification tasks. Regarding the spectral–spatial perception branch in the HSI encoder, the roles of the SW-MHSA and CG-MHSA mechanisms were discussed (see Table 6). The experimental results indicate that using CG-MHSA alone, compared to using SW-MHSA alone, achieved increases of 0.37%, 0.63%, and 0.51% in OA, AA, and Kappa metrics, respectively. This comparison indirectly reveals the importance of the rich spectral information contained in HSI images for improving the classification performance. Specifically, in analyses dominated by spectral features, CG-MHSA effectively utilizes this information to enhance the feature learning capabilities. Moreover, the combined use of SW-MHSA and CG-MHSA, compared to using no mechanism, improved OA, AA, and Kappa metrics by 1.37%, 2.16%, and 1.83%, respectively.

When compared to using SW-MHSA alone, the combined approach increased OA, AA, and Kappa by 0.82%, 1.35%, and 1.1%, respectively. Additionally, compared to using CG-MHSA alone, the combined approach resulted in enhancements of 0.45%, 0.72%, and 0.59% in OA, AA, and Kappa, respectively. This series of experimental results strongly indicates the critical importance of simultaneously considering local spatial structural features and overall spectral information in HSI classification tasks. Particularly in complex scenarios, a single type of attention mechanism is insufficient to fully exploit all potentially beneficial information. The fusion of spatial features and spectral information becomes an effective strategy for improving recognition accuracy. This finding further validates the significant enhancement of HSI image classification performance achieved through the combined application of SW-MHSA and CG-MHSA, highlighting the value of multidimensional information integration.

Furthermore, we discussed the roles of the spectral–spatial perception branch, global spectral understanding branch, and the LiDAR encoder (see Table 7 for details). By incorporating global spectral understanding, the model’s performance in handling land cover classes with complex spectral characteristics, such as roads, showed improvements of 0.25%, 0.41%, and 0.35% in OA, AA, and Kappa, respectively, compared to the model without global spectral understanding. However, if the spectral–spatial perception branch is absent, leaving only the global spectral understanding branch, there is a decrease in OA, AA, and Kappa of 6.63%, 9.89%, and 8.8%, respectively. The spectrum former in the LiDAR encoder efficiently extracts the elevation features, such as building boundaries and vegetation shapes, improving the recognition of land cover categories dependent on elevation information, such as ground and roads.

We also observed an interesting phenomenon: when the global spectral understanding branch is combined with the spectrum former, the performance unexpectedly decreases compared to using only the global spectral understanding branch. Our analysis suggests that the elevation frequency domain information extracted by spectrum former interferes with the spectral feature information captured by the global spectral understanding branch, resulting in performance degradation. However, the introduction of the spectral–spatial perception branch, through operations in the spatial dimension, helps the model better understand and utilize the relationships among the spectral, spatial, and frequency domain features, thus enhancing performance.

In the MoE-EN, HSI and LiDAR demonstrate significant advantages in capturing HSI features and extracting elevation features, respectively, collectively contributing to the precise and efficient execution of remote sensing classification tasks.

Effect of Loss Function: We also discussed the ablation study on the effect of loss functions on the Trento dataset, as illustrated in Figure 8 with the t-SNE plots of feature embeddings. Compared to using the BCE loss function alone, the combined application of HTBCL and HLBCL significantly improves the feature embedding performance. This combined approach demonstrates a more pronounced intra-class cohesion and inter-class separation, resulting in tighter clustering and clearer boundaries between categories. These findings strongly confirm that the combination of HTBCL and HLBCL can optimize the model’s feature representation capabilities, which is particularly critical for addressing the complex challenges in land cover classification.

4.4. Comparison with Other Methods

To validate the effectiveness of our proposed model for joint land cover classification, we compared it with seven state-of-the-art methods. Specifically, these methods include: two-branch [30], which utilizes a dual-branch convolutional network for land cover classification; EndNet [31], which integrates multi-modal features using a deep encoding-decoding network for land cover classification; MDL-Middle [62], which employs an intermediate layer fusion network with multi-source features for land cover classification; MAHiDFNet [63], which employs a multi-attention hierarchical dense fusion network for land cover classification; FusAtNet [64], which combines self-attention and cross-attention mechanisms to extract spatial–spectral features from HSI and LiDAR data for land cover classification; CALC [65], a classification method based on coupled adversarial learning for land cover classification; and SepG-ResNet50 [66], which introduces the SepDGConv module to automatically learn the group convolution structure within a single-stream neural network, enabling deep single-stream CNN models for land cover classification based on this module.

We meticulously followed the original documentation and open-source code default configurations for all comparative methods, optimizing their parameters accordingly. To ensure consistency, the training and testing samples used by all comparative methods were identical to those used by the proposed model. Performance metrics were derived from the average results of 10 experiments for each method, enhancing statistical reliability. Although data augmentation is a common strategy to mitigate overfitting and improve classification performance, simple image transformation-based augmentations (e.g., flipping and rotating) can be quickly learned by the model, potentially compromising the fairness of comparisons. Therefore, DSMSC²N did not employ any data augmentation strategies in its comparative evaluations.

Results on Houston 2013 dataset:

Table 8 presents the classification results of different algorithms on the Houston 2013 dataset. The results demonstrate that the proposed DSMSC²N outperforms other methods in terms of OA (91.49%), AA (92.69%), and Kappa (90.76%) coefficients. It achieves the highest accuracy in seven categories, namely “Healthy grass”, “Soil”, “Commercial”, “Road”, “Railway”, “Parking Lot2”, and “Tennis court”. In particular, the performance of “Soil” (100%) stands out significantly compared to other methods. However, the effectiveness of SepDGConv, which employs a deep single-stream CNN model, falls short in comparison to other methods. Although SepDGConv exhibits a clever design with good global modeling capability, the feature representation of a single-stream model is insufficient for the joint classification of the HSI and LiDAR data.

Figure 9 presents the classification result maps of all the methods included in Table 8. In comparison to other models, DSMSC²N demonstrates a reduction in classification errors, particularly for ground objects with high similarity in complex scenes, such as “Road”. This highlights the significant advantage of DSMSC²N, which leverages a cleverly designed network structure guided by text information, in effectively expressing and organizing heterogeneous information.

Results on Trento dataset:

Table 9 presents the classification results of all methods on the Trento dataset, further demonstrating the effectiveness of our proposed DSMSC²N, which outperforms other methods in terms of OA with a score of 98.93%, AA with a score of 98.16%, and Kappa with a score of 98.57%. Considering the differences in size and data type, the classification accuracy on the Trento dataset is generally higher than that of the Houston 2013 and MUUFL Gulfport datasets. From Table 9, it can be observed that DSMSC²N achieves satisfactory results in each category, which can be attributed to the carefully designed ModaUnion encoder and MoE-EN modules in DSMSC²N. In particular, the performance of the “Vineyard” category (99.70%) stands out compared to other methods.

We visualize the classification results in Figure 10. It can be observed that DSMSC²N accurately delineates the edges of categories such as “Apples”, presenting clearer and smoother contours. In contrast, the boundaries obtained by other methods exhibit noticeable jagged edges and lack smoothness. This indicates that the DSMSC²N model is more powerful than other classification methods in terms of fine-grained feature representation and extraction capabilities. It can capture subtle details of complex scenes and achieve more precise and coherent semantic understanding, particularly excelling in depicting complex boundaries.

Results on MUUFL Gulfport dataset:

On the challenging MUUFL Gulfport dataset, Table 10 presents the classification results of various methods. It is encouraging to note that our approach achieved the highest OA of 98.93%, AA of 98.16%, and Kappa of 98.57%. These results demonstrate that DSMSC²N, by effectively utilizing multimodal information and leveraging language models, achieved the highest accuracy rates in categories such as “Trees”, “Soil”, “Mostly grass”, “Mixed ground surface”, and “Sidewalk”.

We visually depict the classification results in Figure 11. Upon observation, DSMSC²N exhibited a superior visual performance compared to other comparative algorithms, resulting in a closer proximity to the ground truth. Notably, DSMSC²N successfully captured subtle details in complex scenes, such as “Sidewalk”, and demonstrated enhanced coherence in the classification of “Trees”, accompanied by clear boundaries. These findings further validate the model’s exceptional performance in capturing details and accurately delineating boundaries.

4.5. Computational Complexity Analysis

In this paper, we evaluate the computational complexity of different models using two metrics, floating-point operations (FLOPs) and the number of parameters (#param), as shown in Table 11. FLOPs represent the number of floating-point operations required for processing a single image and completing one forward pass, reflecting the time complexity of the model. #param represents the total number of parameters in the model, determining the size of the model itself and directly impacting the memory consumption during inference, reflecting the spatial complexity of the model.

Compared to the two-branch, EndNet, MDL-Middle, MAHiDFNet, FusAtNet, CALC, and SepG-ResNet50 models, our proposed model in this paper presents a more compact and efficient architecture design, with a relatively small number of trainable parameters, ranking fourth among the eight models. It achieves good performance with a lower model space complexity. Additionally, DSMSC²N employs multiple transformer-based encoder branches, enabling more comprehensive feature extraction. However, the multi-head self-attention mechanism in transformers incurs high memory usage and frequent high-bandwidth memory (HBM) read/write operations, imposing significant pressure on computational efficiency and memory consumption. To address this challenge and improve DSMSC²N’s computational efficiency, we accelerate it using Flash Attention [67], resulting in DSMSC²N ranking fifth in terms of FLOPs among the eight models. Considering the trade-off between effectiveness and complexity, although our model requires more FLOPs, it occupies less memory space. This trade-off leads to a significant improvement in classification accuracy, achieving an optimal balance between performance and complexity.

5. Conclusions

This paper introduces a novel remote sensing multimodal semantic cooperative classification Network designed to fully leverage the semantic information from large language models and synergistically utilize multi-source remote sensing data for the precise joint classification of HSI and LiDAR data. Using an instruction-driven approach, DSMSC²N employs LLMs to generate informative descriptions of land cover categories enriched with remote sensing domain knowledge, guiding the model to focus on and exploit key features accurately. Furthermore, DSMSC²N effectively fuses multi-source remote sensing data using the ModaUnion and MoE-EN modules, mining complementary features across different modalities and mitigating the adverse effects of data heterogeneity. Comprehensive experiments on three benchmark datasets validated the superior performance of DSMSC²N compared to various baseline methods. This work demonstrates the immense potential of integrating LLMs with multimodal remote sensing learning, offering a novel and effective solution for enhancing the accuracy and robustness of remote sensing scene understanding.

In the future, we plan to explore advanced fusion strategies and further refine the design of the textual modality to achieve greater performance improvements in remote sensing tasks. Concurrently, we will also continue to refine the model architecture, focusing on in-depth adjustments to the ModaUnion and MoE-EN modules. Our plan includes implementing adaptive mechanisms that allow the network to adjust automatically according to data characteristics and task requirements, enhancing its adaptability across diverse environments and complex scenarios.

Author Contributions

Conceptualization, A.W., S.D., H.W. and Y.I.; methodology, S.D.; software S.D.; validation S.D.; writing—review and editing A.W., S.D., H.W. and Y.I.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Key Research and Development Plan Project of Heilongjiang (JD2023SJ19), the Natural Science Foundation of Heilongjiang Province (LH2023F034), the high-end foreign expert introduction program (G2022012010L), the Science and Technology Project of Heilongjiang Provincial Department of Transportation (HJK2024B002) and the Key Research and Development Program Guidance Project of Heilongjiang (GZ20220123). Iwahori’s research was supported by the Japan Society for the Promotion of Science (JSPS) Grant-in-Aid Scientific Research(C) (24K15019) and the Chubu University Grant.

Data Availability Statement

Houston2013: https://hyperspectral.ee.uh.edu/?page_id=459 (accessed on 10 September 2022); Trento: https://drive.google.com/drive/folers/1HK3eL3loI4WdRFr1psLLmVLTVDLctGd (accessed on 17 July 2012), MUFFL: https://github.com/GatorSense/MUUFLGulfport/ (accessed on 17 April 2017).

Conflicts of Interest

The authors declare no conflicts of interest.

Correction Statement

This article has been republished with a minor correction to the Funding statement. This change does not affect the scientific content of the article.

References

Sishodia, R.P.; Ray, R.L.; Singh, S.K. Applications of remote sensing in precision agriculture: A review. Remote Sens. 2020, 12, 3136. [Google Scholar] [CrossRef]
Tan, K.; Jia, X.; Plazae, A. Special Section Guest Editorial: Satellite Hyperspectral Remote Sensing: Algorithms and Applications. J. Appl. Remote Sens. 2021, 42601, 1. [Google Scholar]
Ahmad, M.; Shabbir, S.; Roy, S.K.; Hong, D.; Wu, X.; Yao, J.; Khan, A.M.; Mazzara, M.; Distefano, S.; Chanussot, J. Hyperspectral image classification—Traditional to deep models: A survey for future prospects. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 15, 968–999. [Google Scholar]
Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Deep learning for hyperspectral image classification: An overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar]
Liu, Q.; Xiao, L.; Yang, J.; Chan, J.C.-W. Content-guided convolutional neural network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 6124–6137. [Google Scholar]
Zakaria, Z.B.; Islam, M.R. Hybrid 3DNet: Hyperspectral Image Classification with Spectral-spatial Dimension Reduction using 3D CNN. Int. J. Comput. Appl. 2022, 975, 8887. [Google Scholar]
Ma, A.; Filippi, A.M.; Wang, Z.; Yin, Z.; Huo, D.; Li, X.; Güneralp, B. Fast sequential feature extraction for recurrent neural network-based hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5920–5937. [Google Scholar]
Yang, X.; Cao, W.; Lu, Y.; Zhou, Y. Hyperspectral image transformer classification networks. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5528715. [Google Scholar]
He, L.; Li, J.; Liu, C.; Li, S. Recent advances on spectral–spatial hyperspectral image classification: An overview and new guidelines. IEEE Trans. Geosci. Remote Sens. 2017, 56, 1579–1597. [Google Scholar]
Zhang, Y.; Lan, C.; Zhang, H.; Ma, G.; Li, H. Multimodal remote sensing image matching via learning features and attention mechanism. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5603620. [Google Scholar]
Ma, X.; Zhang, X.; Pun, M.-O.; Liu, M. A multilevel multimodal fusion transformer for remote sensing semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403215. [Google Scholar]
Wang, Q.; Chen, W.; Huang, Z.; Tang, H.; Yang, L. MultiSenseSeg: A cost-effective unified multimodal semantic segmentation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4703724. [Google Scholar]
Li, J.; Hong, D.; Gao, L.; Yao, J.; Zheng, K.; Zhang, B.; Chanussot, J. Deep learning in multimodal remote sensing data fusion: A comprehensive review. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102926. [Google Scholar]
Gómez-Chova, L.; Tuia, D.; Moser, G.; Camps-Valls, G. Multimodal classification of remote sensing images: A review and future directions. Proc. IEEE 2015, 103, 1560–1584. [Google Scholar]
Ma, M.; Ma, W.; Jiao, L.; Liu, X.; Li, L.; Feng, Z.; Yang, S. A multimodal hyper-fusion transformer for remote sensing image classification. Inf. Fusion 2023, 96, 66–79. [Google Scholar]
Ghamisi, P.; Rasti, B.; Yokoya, N.; Wang, Q.; Hofle, B.; Bruzzone, L.; Bovolo, F.; Chi, M.; Anders, K.; Gloaguen, R. Multisource and multitemporal data fusion in remote sensing: A comprehensive review of the state of the art. IEEE Geosci. Remote Sens. Mag. 2019, 7, 6–39. [Google Scholar]
Dong, P.; Chen, Q. LiDAR Remote Sensing and Applications; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar]
Dalponte, M.; Bruzzone, L.; Gianelle, D. Fusion of hyperspectral and LIDAR remote sensing data for classification of complex forest areas. IEEE Trans. Geosci. Remote Sens. 2008, 46, 1416–1427. [Google Scholar]
Ghamisi, P.; Benediktsson, J.A.; Phinn, S. Land-cover classification using both hyperspectral and LiDAR data. Int. J. Image Data Fusion 2015, 6, 189–215. [Google Scholar]
Dong, W.; Yang, T.; Qu, J.; Zhang, T.; Xiao, S.; Li, Y. Joint contextual representation model-informed interpretable network with dictionary aligning for hyperspectral and LiDAR classification. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6804–6818. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]
Yu, W.; Zhou, P.; Yan, S.; Wang, X. Inceptionnext: When inception meets convnext. arXiv 2023, arXiv:2303.16900. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Ding, M.; Xiao, B.; Codella, N.; Luo, P.; Wang, J.; Yuan, L. Davit: Dual attention vision transformers. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 74–92. [Google Scholar]
Patro, B.N.; Namboodiri, V.P.; Agneeswaran, V.S. SpectFormer: Frequency and Attention is what you need in a Vision Transformer. arXiv 2023, arXiv:2304.06446. [Google Scholar]
Zhang, M.; Li, W.; Tao, R.; Li, H.; Du, Q. Information fusion for classification of hyperspectral and LiDAR data using IP-CNN. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5506812. [Google Scholar]
Zhao, G.; Ye, Q.; Sun, L.; Wu, Z.; Pan, C.; Jeon, B. Joint classification of hyperspectral and LiDAR data using a hierarchical CNN and transformer. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5500716. [Google Scholar]
Xue, Z.; Tan, X.; Yu, X.; Liu, B.; Yu, A.; Zhang, P. Deep hierarchical vision transformer for hyperspectral and LiDAR data classification. IEEE Trans. Image Process. 2022, 31, 3095–3110. [Google Scholar]
Xu, X.; Li, W.; Ran, Q.; Du, Q.; Gao, L.; Zhang, B. Multisource remote sensing data classification based on convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2017, 56, 937–949. [Google Scholar]
Hong, D.; Gao, L.; Hang, R.; Zhang, B.; Chanussot, J. Deep encoder–decoder networks for classification of hyperspectral and LiDAR data. IEEE Geosci. Remote Sens. Lett. 2020, 19, 5500205. [Google Scholar]
Li, J.; Liu, Y.; Song, R.; Liu, W.; Li, Y.; Du, Q. HyperMLP: Superpixel Prior and Feature Aggregated Perceptron Networks for Hyperspectral and Lidar Hybrid Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5505614. [Google Scholar]
Wang, X.; Zhu, J.; Feng, Y.; Wang, L. MS2CANet: Multi-scale Spatial-Spectral Cross-modal Attention Network for Hyperspectral image and LiDAR Classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5501505. [Google Scholar]
Song, T.; Zeng, Z.; Gao, C.; Chen, H.; Li, J. Joint Classification of Hyperspectral and LiDAR Data Using Height Information Guided Hierarchical Fusion-and-Separation Network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5505315. [Google Scholar]
Chen, F.-L.; Zhang, D.-Z.; Han, M.-L.; Chen, X.-Y.; Shi, J.; Xu, S.; Xu, B. Vlp: A survey on vision-language pre-training. Mach. Intell. Res. 2023, 20, 38–56. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, Baltimore, MA, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
Dai, W.; Li, J.; Li, D.; Tiong, A.M.H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.N.; Hoi, S. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2024; p. 2142. [Google Scholar]
Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. Remoteclip: A vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Online, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Floridi, L.; Chiriatti, M. GPT-3: Its nature, scope, limits, and consequences. Minds Mach. 2020, 30, 681–694. [Google Scholar]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Roumeliotis, K.I.; Tselikas, N.D. Chatgpt and open-ai models: A preliminary review. Future Internet 2023, 15, 192. [Google Scholar] [CrossRef]
Yue, X.; Qu, X.; Zhang, G.; Fu, Y.; Huang, W.; Sun, H.; Su, Y.; Chen, W. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv 2023, arXiv:2309.05653. [Google Scholar]
Zhou, H.; Gu, B.; Zou, X.; Li, Y.; Chen, S.S.; Zhou, P.; Liu, J.; Hua, Y.; Mao, C.; Wu, X. A survey of large language models in medicine: Progress, application, and challenge. arXiv 2023, arXiv:2311.05112. [Google Scholar]
Bolton, E.; Venigalla, A.; Yasunaga, M.; Hall, D.; Xiong, B.; Lee, T.; Daneshjou, R.; Frankle, J.; Liang, P.; Carbin, M. Biomedlm: A 2.7 b parameter language model trained on biomedical text. arXiv 2024, arXiv:2403.18421. [Google Scholar]
Li, J.; Liu, W.; Ding, Z.; Fan, W.; Li, Y.; Li, Q. Large Language Models are in-Context Molecule Learners. arXiv 2024, arXiv:2403.04197. [Google Scholar]
Shi, B.; Zhao, P.; Wang, Z.; Zhang, Y.; Wang, Y.; Li, J.; Dai, W.; Zou, J.; Xiong, H.; Tian, Q. UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding. arXiv 2024, arXiv:2401.06397. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
Luo, H.; Ji, L.; Zhong, M.; Chen, Y.; Lei, W.; Duan, N.; Li, T. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 2022, 508, 293–304. [Google Scholar]
Rao, Y.; Zhao, W.; Chen, G.; Tang, Y.; Zhu, Z.; Huang, G.; Zhou, J.; Lu, J. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18082–18091. [Google Scholar]
Narasimhan, M.; Rohrbach, A.; Darrell, T. Clip-it! language-guided video summarization. Adv. Neural Inf. Process. Syst. 2021, 34, 13988–14000. [Google Scholar]
Zhang, R.; Guo, Z.; Zhang, W.; Li, K.; Miao, X.; Cui, B.; Qiao, Y.; Gao, P.; Li, H. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8552–8562. [Google Scholar]
Elizalde, B.; Deshmukh, S.; Al Ismail, M.; Wang, H. Clap learning audio concepts from natural language supervision. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Xue, L.; Yu, N.; Zhang, S.; Panagopoulou, A.; Li, J.; Martín-Martín, R.; Wu, J.; Xiong, C.; Xu, R.; Niebles, J.C. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. arXiv 2023, arXiv:2305.08275. [Google Scholar]
Arora, S.; Khandeparkar, H.; Khodak, M.; Plevrakis, O.; Saunshi, N. A theoretical analysis of contrastive unsupervised representation learning. arXiv 2019, arXiv:1902.09229. [Google Scholar]
Wang, T.; Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Online, 13–18 July 2020; pp. 9929–9939. [Google Scholar]
HaoChen, J.Z.; Wei, C.; Gaidon, A.; Ma, T. Provable guarantees for self-supervised deep learning with spectral contrastive loss. Adv. Neural Inf. Process. Syst. 2021, 34, 5000–5011. [Google Scholar]
Huang, W.; Yi, M.; Zhao, X.; Jiang, Z. Towards the generalization of contrastive self-supervised learning. arXiv 2021, arXiv:2111.00743. [Google Scholar]
Hong, D.; Gao, L.; Yokoya, N.; Yao, J.; Chanussot, J.; Du, Q.; Zhang, B. More diverse means better: Multimodal deep learning meets remote-sensing imagery classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4340–4354. [Google Scholar]
Wang, X.; Feng, Y.; Song, R.; Mu, Z.; Song, C. Multi-attentive hierarchical dense fusion net for fusion classification of hyperspectral and LiDAR data. Inf. Fusion 2022, 82, 1–18. [Google Scholar]
Mohla, S.; Pande, S.; Banerjee, B.; Chaudhuri, S. Fusatnet: Dual attention based spectrospatial multimodal fusion network for hyperspectral and lidar classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 92–93. [Google Scholar]
Lu, T.; Ding, K.; Fu, W.; Li, S.; Guo, A. Coupled adversarial learning for fusion classification of hyperspectral and LiDAR data. Inf. Fusion 2023, 93, 118–131. [Google Scholar]
Yang, Y.; Zhu, D.; Qu, T.; Wang, Q.; Ren, F.; Cheng, C. Single-stream CNN with learnable architecture for multisource remote sensing data. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5409218. [Google Scholar]
Dao, T.; Fu, D.; Ermon, S.; Rudra, A.; Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv. Neural Inf. Process. Syst. 2022, 35, 16344–16359. [Google Scholar]

Figure 1. An overview of the proposed DSMSC²N.

Figure 2. Workflow for automated construction of a high-dimensional spectral class descriptor collection.

Figure 3. Graphical representation of ModaUnion encoder.

Figure 4. The mechanism of clustering.

Figure 5. The visualization of the Houston 2013 dataset. (a) Pseudo color map of an HSI. (b) DSM of LiDAR. (c) Training sample map. (d) Testing sample map.

Figure 6. The visualization of the Trento dataset. (a) Pseudo color map of an HSI. (b) DSM of LiDAR. (c) Training sample map. (d) Testing sample map.

Figure 7. The visualization of the MUUFL Gulfport dataset. (a) Pseudo color map of an HSI. (b) DSM of LiDAR. (c) Training sample map. (d) Testing sample map.

Figure 8. T-SNE visualization of loss functions on Trento. (a) CE; (b) without HTBCL; (c) all.

Figure 9. Classification maps of Houston 2013. (a) Ground-truth map; (b) two-branch. (c) EndNet; (d) MDL-Middle; (e) MAHiDFNet; (f) FusAtNet; (g) CALC; (h) SepG-ResNet50; (i) DSMSC²N.

Figure 10. Classification maps of Trento. (a) Ground-truth map; (b) two-branch; (c) EndNet; (d) MDL-Middle; (e) MAHiDFNet; (f) FusAtNet; (g) CALC; (h) SepG-ResNet50; (i) DSMSC²N.

Figure 11. Classification maps of MUUFL Gulfport. (a) Ground-truth map; (b) two-branch; (c) EndNet; (d) MDL-Middle; (e) MAHiDFNet; (f) FusAtNet; (g) CALC; (h) SepG-ResNet50; (i) DSMSC²N.

Table 1. Houston 2013 dataset: the numbers of training and testing samples for each class.

Class	Class Name	Train Num	Test Num
C1	Healthy Grass	198	1053
C2	Stressed Grass	190	1064
C3	Synthetic Grass	192	505
C4	Trees	188	1056
C5	Soil	186	1056
C6	Water	182	143
C7	Residential	196	1072
C8	Commercial	191	1053
C9	Road	193	1059
C10	Highway	191	1036
C11	Railway	181	1054
C12	Parking Lot1	192	1041
C13	Parking Lot2	184	285
C14	Tennis Court	181	247
C15	Running Track	187	473
-	Total	2832	12,197

Table 2. Trento dataset: the numbers of training and testing samples for each class.

Class	Class Name	Train Number	Test Number
C1	Apples	129	3905
C2	Buildings	15	2778
C3	Ground	105	374
C4	Woods	154	8969
C5	Vineyard	184	10,317
C6	Roads	122	3252
-	Total	819	29,395

Table 3. MUUFL Gulfport dataset: the numbers of training and testing samples for each class.

Class	Class Name	Train Number	Test Number
C1	Trees	100	23,146
C2	Mostly grass	100	4170
C3	Mixed ground surface	100	6782
C4	Dirt and sand	100	1726
C5	Road	100	6587
C6	Water	100	366
C7	Buildings shadow	100	2133
C8	Buildings	100	6140
C9	Sidewalk	100	1285
C10	Yellow curb	100	83
C11	Cloth panels	100	169
-	Total	1100	52,587

Table 4. Ablation experiments of incorporating multi-modal on Trento.

Data Source			Evaluation Index
HSI	LiDAR	Text	OA (%)	AA (%)	K × 100
✓	✗	✗	95.99	94.55	94.64
✗	✓	✗	88.84	85.67	85.21
✓	✓	✗	98.75	98.13	98.33
✓	✓	✓	98.78	98.14	98.37

Table 5. Ablation experiments of fusion strategies on Trento.

Network Structure	Evaluation Index
Network Structure	OA (%)	AA (%)	K × 100
Transformer (EF)	96.96	94.43	95.28
MoE-EN(LF)	97.01	94.46	96.01
ModaUnion + MoE-EN (MF)	98.78	98.14	98.37

Table 6. Ablation experiments of MoE-EN on Trento.

Network Structure	Evaluation Index
Network Structure	OA (%)	AA (%)	K × 100
MHSA	97.41	95.98	96.54
SW-MHSA	97.96	96.79	97.27
CG-MHSA	98.33	97.42	97.78
SW-MHSA + CG-MHSA	98.78	98.14	98.37

Table 7. Ablation experiments of MoE-EN on Trento.

MoE-EN			ACC (%)						Evaluation Index
Spatial–Channel Sub-Branch	Spectral Context Sub-Branch	Spectrum Former	Apples	Buildings	Ground	Woods	Vineyard	Roads	OA (%)	AA (%)	K × 100
✓	✗	✗	97.17	98.03	99.37	99.92	99.99	87.55	98.10	97.01	97.46
✗	✓	✗	88.37	99.08	77.99	95.72	95.03	69.04	91.72	87.53	89.01
✓	✓	✗	98.35	99.59	98.74	99.99	99.97	90.47	98.35	97.42	97.81
✓	✗	✓	96.78	98.76	95.81	99.58	99.93	92.27	98.42	97.19	97.90
✗	✓	✓	86.59	95.59	77.99	97.27	91.40	75.20	91.00	87.34	88.04
✓	✓	✓	97.76	99.48	98.95	99.62	99.97	93.01	98.78	98.13	98.37

Table 8. Comparison of classification results on Houston 2013.

Class	Two- Branch	EndNet	MDL- Middle	MAHiDFNet	FusAtNet	CALC	SepG-ResNet50	DSMSC²N
Healthy grass	82.90	81.58	83.10	82.91	80.72	86.51	72.36	90.12
Stressed grass	84.31	83.65	85.06	84.68	97.46	84.59	77.35	84.59
Synthetic grass	96.44	100.00	99.60	100.00	90.69	90.50	34.85	98.81
Trees	96.59	93.09	91.57	93.37	99.72	91.86	86.84	90.91
Soil	99.62	99.91	98.86	99.43	97.92	100.00	91.38	100.00
Water	82.52	95.10	100.00	99.30	93.71	99.30	95.10	95.80
Residential	85.54	82.65	97.64	83.58	91.98	88.71	81.62	91.12
Commercial	76.64	81.29	88.13	81.96	85.19	83.19	61.73	95.38
Road	87.35	88.29	85.93	83.76	85.93	91.60	86.31	95.04
Highway	60.71	89.00	74.42	66.41	69.50	65.44	46.26	67.66
Railway	90.61	83.78	84.54	74.57	85.48	95.92	69.35	97.22
Parking Lot1	90.78	90.39	95.39	88.38	89.15	90.78	86.94	93.66
Parking Lot2	86.67	82.46	87.37	88.42	77.19	91.93	78.25	92.63
Tennis court	92.31	100.00	95.14	100.00	84.21	94.74	87.04	100.00
Running track	99.79	98.10	100.00	100.00	87.53	100.00	18.82	97.46
OA (%)	86.68	88.52	89.55	85.87	88.14	88.84	72.67	91.49
AA (%)	87.52	89.95	91.05	88.55	87.76	90.34	71.63	92.69
K × 100	85.56	87.59	87.59	84.76	87.12	87.92	70.40	90.76

Table 9. Comparison of classification results on Trento.

Class	Two- Branch	EndNet	MDL- Middle	MAHiDFNet	FusAtNet	CALC	SepG-ResNet50	DSMSC²N
Apples	98.61	93.95	99.93	100.00	99.45	94.55	93.28	99.33
Buildings	98.93	96.54	98.14	99.80	89.87	99.55	99.38	97.42
Ground	75.16	96.24	97.08	96.03	91.23	92.69	74.35	96.66
Woods	98.72	99.36	99.93	100.00	93.86	100.00	99.88	99.29
Vineyard	97.43	80.72	98.54	95.30	92.92	99.53	95.91	99.70
Roads	96.83	90.14	89.51	86.74	90.71	93.82	68.05	96.60
OA (%)	96.53	90.86	98.14	96.89	93.53	98.30	93.82	98.93
AA (%)	92.30	92.81	97.19	96.31	93.01	96.69	88.47	98.16
K × 100	95.38	88.01	97.52	95.87	91.51	97.74	91.79	98.57

Table 10. Comparison of classification results on MUUFL Gulfport.

Class	Two- Branch	EndNet	MDL- Middle	MAHiDFNet	FusAtNet	CALC	SepG- ResNet50	DSMSC²N
Trees	90.29	83.55	87.31	89.87	80.36	91.97	86.78	94.23
Mostly grass	75.68	79.38	76.38	63.19	73.57	81.77	78.47	85.81
Mixed ground surface	69.71	76.28	68.33	75.85	68.24	77.56	71.20	81.97
Dirt and sand	93.97	87.08	78.74	96.18	70.74	95.19	89.98	86.65
Road	91.79	89.59	83.76	88.52	80.95	89.19	76.38	89.72
Water	99.73	95.90	88.52	85.25	81.15	100.00	99.73	99.65
Buildings shadow	91.84	88.28	92.12	9..72	89.40	95.17	91.70	94.18
Buildings	94.79	92.07	89.69	95.44	87.92	96.91	87.31	90.87
Sidewalk	72.30	76.96	77.04	75.80	77.98	69.81	69.88	79.75
Yellow curb	96.39	95.18	86.75	91.57	78.31	95.18	90.36	92.00
Cloth panels	97.63	97.63	99.41	99.41	99.41	100.00	99.41	98.82
OA (%)	87.03	84.33	83.54	86.45	79.24	89.31	82.90	91.19
AA (%)	88.56	87.45	84.37	86.80	80.73	90.25	85.56	90.87
K × 100	83.10	79.85	78.84	82.37	73.81	86.05	77.94	88.33

Table 11. Comparison of FLOPs and parameters of different classification models on Houston 2013.

Methods	#Param. (M)	FLOPs (M)
Two-Branch	5.6	120.81
EndNet	0.09	0.09
MDL-Middle	0.1	5.28
MAHiDFNet	77.0	155.00
FusAtNet	36.9	3460.31
CALC	0.3	28.79
SepG-ResNet50	14.7	48.28
DSMSC²N	0.7	104.92

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, A.; Dai, S.; Wu, H.; Iwahori, Y. Multimodal Semantic Collaborative Classification for Hyperspectral Images and LiDAR Data. Remote Sens. 2024, 16, 3082. https://doi.org/10.3390/rs16163082

AMA Style

Wang A, Dai S, Wu H, Iwahori Y. Multimodal Semantic Collaborative Classification for Hyperspectral Images and LiDAR Data. Remote Sensing. 2024; 16(16):3082. https://doi.org/10.3390/rs16163082

Chicago/Turabian Style

Wang, Aili, Shiyu Dai, Haibin Wu, and Yuji Iwahori. 2024. "Multimodal Semantic Collaborative Classification for Hyperspectral Images and LiDAR Data" Remote Sensing 16, no. 16: 3082. https://doi.org/10.3390/rs16163082

APA Style

Wang, A., Dai, S., Wu, H., & Iwahori, Y. (2024). Multimodal Semantic Collaborative Classification for Hyperspectral Images and LiDAR Data. Remote Sensing, 16(16), 3082. https://doi.org/10.3390/rs16163082

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Semantic Collaborative Classification for Hyperspectral Images and LiDAR Data

Abstract

1. Introduction

2. Relate Work

2.1. Large Language Models

2.2. Multi-Modal Contrastive Representation Learning

3. Method

3.1. Building Descriptors

3.2. Vision and Text Encoder

3.3. Loss Function

3.4. Final Classifier

4. Results

4.1. Datasets

4.2. Implementation Details

4.3. Ablation Study

4.4. Comparison with Other Methods

4.5. Computational Complexity Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Correction Statement

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI