HyperKon: A Self-Supervised Contrastive Network for Hyperspectral Image Analysis

Ayuba, Daniel La’ah; Guillemaut, Jean-Yves; Marti-Cardona, Belen; Mendez, Oscar

doi:10.3390/rs16183399

Open AccessArticle

HyperKon: A Self-Supervised Contrastive Network for Hyperspectral Image Analysis

¹

Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, Surrey GU2 7XH, UK

²

Centre for Environmental Health and Engineering, University of Surrey, Guildford, Surrey GU2 7XH, UK

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(18), 3399; https://doi.org/10.3390/rs16183399

Submission received: 26 July 2024 / Revised: 4 September 2024 / Accepted: 10 September 2024 / Published: 12 September 2024

(This article belongs to the Special Issue Advances in Hyperspectral Remote Sensing Image Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The use of a pretrained image classification model (trained on cats and dogs, for example) as a perceptual loss function for hyperspectral super-resolution and pansharpening tasks is surprisingly effective. However, RGB-based networks do not take full advantage of the spectral information in hyperspectral data. This inspired the creation of HyperKon, a dedicated hyperspectral Convolutional Neural Network backbone built with self-supervised contrastive representation learning. HyperKon uniquely leverages the high spectral continuity, range, and resolution of hyperspectral data through a spectral attention mechanism. We also perform a thorough ablation study on different kinds of layers, showing their performance in understanding hyperspectral layers. Notably, HyperKon achieves a remarkable 98% Top-1 retrieval accuracy and surpasses traditional RGB-trained backbones in both pansharpening and image classification tasks. These results highlight the potential of hyperspectral-native backbones and herald a paradigm shift in hyperspectral image analysis.

Keywords:

hyperspectral imaging; soil property estimation; self-supervised learning; deep learning; remote sensing; precision agriculture

1. Introduction

Hyperspectral images (HSIs), with their ability to capture detailed spectral information across hundreds of contiguous bands, have rapidly advanced the capabilities of remote sensing analysis in various domains, including agriculture, mineralogy, and environmental monitoring [1,2]. These high-dimensional data offer a rich representation of scenes, enabling finer material distinctions than traditional RGB and multispectral images [3]. However, the exploitation of HSIs, particularly using deep learning techniques initially designed for RGB images, presents considerable challenges [4].

Convolutional Neural Networks (CNNs), which dominate the computer vision landscape, are predominantly trained and evaluated on RGB data [5]. These models often struggle to generalize effectively to hyperspectral data due to the vast difference in spectral resolution and the unique characteristics of HSIs [5,6,7]. Moreover, the scarcity of native hyperspectral backbones necessitates extensive fine-tuning for their application in this domain.

Hyperspectral-native CNN architectures, including 3D-CNNs [8], spatial–spectral residual CNNs [9,10], and hybrid dilated convolution networks [11], were developed to effectively process spectral–spatial information in hyperspectral data. These models have made significant progress by utilizing attention-based mechanisms that focus on the most relevant spectral and spatial features, thereby enhancing interpretability and boosting performance in various downstream tasks [12]. Hybrid CNN–Transformer architectures, which combine the strengths of local and global feature analysis through spectral–spatial and self-attention layers [13], have also emerged as powerful tools in this domain. However, despite their promising performance, these hybrid architectures still face challenges when dealing with the high-dimensional nature of hyperspectral data. In contrast, existing CNN layers such as the Squeeze and Excitation Block (SEB) [14] and Convolutional Block Attention Module (CBAM) [15] offer efficient solutions for processing high-dimensional data but have been largely unexplored in the hyperspectral domain. Our work presents an ablation study that demonstrates how these layers not only excel at handling high-dimensional hyperspectral data but also bring the benefits of attention mechanisms into traditional CNN architectures.

Recent advancements in the field have led to the development of Remote Sensing Foundation Models (RSFMs), which represent significant progress in the processing and analysis of remote sensing data, including HSIs. These models, drawing inspiration from the achievements of large language models in natural language processing, aim to provide a flexible and reliable foundation for a range of remote sensing applications. One notable example is SpectralGPT [16], which, although trained on multispectral images with 12 channels, adapts the generative pretrained transformer (GPT) architecture for spectral data analysis. SpectralGPT’s training on only 12 bands likely reflects the challenges of scaling transformer architectures to the hundreds of channels typically found in HSIs. Scaling transformers to such a large number of channels leads to a quadratic increase in the model’s number of parameters, making them computationally expensive and challenging to implement effectively in hyperspectral contexts. In contrast, modern attention mechanisms that operate within CNN architectures, such as the SEB, offer similar advantages of attention found in transformers but with a significantly smaller number of parameters.

The emergence of self-supervised learning in remote sensing, exemplified by models like SeCo [17], has opened new avenues for leveraging large amounts of unlabeled satellite imagery. These approaches enable the learning of rich, transferable representations from diverse remote sensing data sources, potentially improving the performance of downstream tasks in HSI analysis. Furthermore, the development of multimodal foundation models for remote sensing, such as the works by [18,19,20], demonstrates the potential for integrating various data sources to create more comprehensive and robust models for Earth observation tasks.

Self-supervised contrastive learning has emerged as a powerful paradigm in remote sensing, addressing the challenge of limited labeled data. Recent works have adapted contrastive learning frameworks for remote sensing applications [21], demonstrating improved performance in land cover change detection tasks. In the hyperspectral domain, approaches like [22,23] have shown promising results in learning spectral–spatial features without relying on labeled data. The integration of contrastive learning with attention mechanisms [24] has further improved the ability to capture long-range dependencies in hyperspectral data, leading to more robust models.

Despite these advances, several challenges persist in HSI analysis. The high dimensionality and distinct properties of hyperspectral data continue to pose computational challenges, particularly in resource-constrained environments like onboard satellite systems. The increasing demand for real-time remote sensing data analysis, as exemplified by missions like Intuition-1 [25], necessitates the development of efficient, lightweight models capable of processing hyperspectral data in orbit. To address the constraints and scarcity of hyperspectral-native solutions, we introduce HyperKon, a self-supervised contrastive network trained solely on HSIs from Environmental Mapping and Analysis Program (EnMAP) [26]. HyperKon, unlike generic CNN backbones and newer RSFMs, is trained on HSI with 224 channels while maintaining a compact architecture.

The main contributions of this paper are:

HyperKon: a hyperspectral-native CNN backbone that can learn useful representations from large amounts of unlabeled data.
EnHyperSet-1: an EnMAP dataset curated for use in precision agriculture and other deep learning projects.
Hyperspectral perceptual loss: a novel perceptual loss function minimizing errors in the spectral domain.
Demonstration that the representations learned by HyperKon improve performance in hyperspectral downstream tasks.

The remainder of this paper is organized as follows. Section 2 describes the materials and methodology, including the HyperKon architecture and contrastive learning approach. Section 3 presents experimental results and evaluations. Section 4 offers discussions on the findings, and Section 5 concludes the paper with future research directions.

2. Materials and Methods

2.1. Dataset

This study introduces EnHyperSet-1, a hyperspectral dataset curated from the EnMAP mission [26] for deep learning applications. The dataset comprises 800 scenes (200 Level 1B, 200 Level 1C, 400 Level 2A), each with an average pixel size of 1300 × 1200 and 224 spectral bands ranging from 420 nm to 2450 nm. Each 30 m × 30 m pixel captures spectral information at a resolution of 6.5 nm to 12 nm. EnHyperSet-1 features diverse global urban, forest, and agricultural scenes, as summarized in Table 1. For analysis, we extracted 160 × 160 pixel patches using a sliding window with a 5% overlap buffer, with edge patches zero-padded when necessary. The choice of 160 × 160 pixel patches was driven by the need to maintain consistency with existing networks and backbones used in hyperspectral image processing, such as hyperspectral pansharpening [27], while also balancing spatial detail and computational efficiency. Although the dataset consisted of data from different processing levels, these levels were not conflated or used as one. Instead, they were strategically leveraged for contrastive learning, enabling the network to learn more robust representations. A detailed explanation of how these different processing levels were utilized in the contrastive learning process is provided in Section 2.3.2.

HyperKon was pretrained on EnHyperSet-1 using self-supervised contrastive learning with the Normalized Temperature-Scaled Cross Entropy (NT-Xent) loss [28]. The network was trained for 1000 epochs with a batch size of 32, using the Adam optimizer (initial learning rate: 1 × 10⁻⁴) and a StepLR scheduler. This methodology enabled effective representation learning from unlabeled hyperspectral data while addressing the unique challenges of different processing levels.

2.2. Hyperspectral Backbone Architecture

The HyperKon architecture (Figure 1) is based on ResNeXt’s multibranch cardinality [29] and employs parallel paths to improve representational and computational efficiency. The ResNeXt architecture has fewer parameters, making it less sensitive to learning rates and other hyperparameters, especially when compared to its predecessor, ResNet [30].

Developing an effective feature extractor for this network required overcoming two main challenges: capturing the complex, high-dimensional spectral and spatial features of hyperspectral data and addressing the computational constraints associated with such high-dimensional data. In our model, we fine-tuned the feature maps to better reflect channel interdependencies using a specialized architecture based on the SEB [14]. This approach processed an input feature map

X \in R^{H \times W \times C}

to calculate channelwise statistics, allowing the model to more effectively highlight relevant spectral and spatial details within the hyperspectral data. This targeted recalibration ensured the network’s computations were directly aligned with the critical features of the data, leading to more efficient and accurate analysis.

The excitation operation then used a gating mechanism:

\begin{matrix} s_{c} & = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{i j c}, \\ e_{c} & = σ (β \cdot s_{c} + γ), \end{matrix}

(1)

where H and W represent the height and width of the input feature map, respectively. The term

x_{i j c}

represents the value of the feature map X at spatial position

(i, j)

in channel c. Here,

σ

is the sigmoid activation function, and

β

and

γ

are trainable parameters. The recalibrated feature map

X^{'} \in R^{H \times W \times C}

was then given by:

x_{i j c}^{'} = e_{c} \cdot x_{i j c}

(2)

This recalibration emphasized channels with high interdependencies and suppresses others. The feature extractor provided preliminary feature representations which, while powerful, might not be fine-tuned to the specific challenges presented by hyperspectral data. To address this, we introduced a projection head. Given the deep feature representation

F \in R^{D}

from the feature extractor, the projection head transformed it into a new space

Z \in R^{D^{'}}

:

Z = W_{2} σ (W_{1} F + b_{1}) + b_{2}

(3)

where

W_{1} \in R^{D^{'} \times D}

,

W_{2} \in R^{D^{'} \times D^{'}}

,

b_{1} \in R^{D^{'}}

, and

b_{2} \in R^{D^{'}}

are trainable parameters, and

σ

is a non-linear activation function. This transformation aided in emphasizing discriminative features for hyperspectral data.

2.3. Contrastive Learning

2.3.1. Self-Supervised Contrastive Loss

While Triplet loss [31] and InfoNCE [32] can be effective, a careful selection of samples is especially challenging for hyperspectral data. The NT-Xent loss [28], which builds upon the InfoNCE concept with a softmax output layer, incorporates an L2 normalization of embeddings and typically functions as a symmetric loss. It encourages the network to discriminate between augmented versions (positive pairs) of the same hyperspectral sample and dissimilar samples (negative pairs) within a mini-batch. The primary motivation for selecting NT-Xent loss is its efficiency in contexts where generating explicit negative samples for HSIs is challenging. Positive pairs are typically generated through data augmentation techniques like random spectral scaling, spatial cropping, and random channel permutations [33]. While methods like MoCo [34] and SwAV [35] introduce momentum queues and memory banks to maintain a large set of negatives, NT-Xent relies solely on the current mini-batch for negative sampling. This computational efficiency is particularly beneficial for training on hyperspectral data, which often have high dimensionality due to the large number of spectral bands (≥200 bands). Furthermore, it is adaptable to varying batch sizes, uses cosine similarity for feature vector orientation, and incorporates a temperature parameter that controls the influence of the discrimination task on the learning process [36].

The NT-Xent loss is defined as:

ℓ_{(z_{i}, z_{j})} = - \frac{1}{N} \sum_{i = 1}^{N} log \frac{exp (s i m (z_{i}, z_{i^{'}}) / τ)}{\sum_{j = 1}^{2 N} exp (s i m (z_{i}, z_{j}) / τ)}

(4)

where N is the number of samples,

z_{i}

is the feature representation of sample i,

s i m (z_{i}, z_{j}) = z_{i}^{T} z_{j}

is the cosine similarity between

z_{i}

and

z_{j}

,

τ

is a temperature parameter that controls the smoothness of the distribution, and

z_{i^{'}}

refers to the positive feature representation for the query sample i.

2.3.2. HSI Contrastive Sampling

Self-supervised contrastive learning relies heavily on data augmentation to generate informative positive and negative sample pairs for training [37]. In our approach, we leveraged the unique characteristics of hyperspectral data by utilizing different processing levels (e.g., 1B, 1C, 2A) to create naturally augmented samples before applying traditional transformations.

Let us denote

x_{i}^{l} \in I_{k}^{h}

as a sequence of patches

(x_{i}^{l})

taken from a series of HSIs

(I_{k}^{h} \forall k = 0 \dots S)

, where S represents the dataset size and l represents the processing level. For each anchor patch

x_{i}^{l_{1}}

, we selected its positive pair

x_{i}^{l_{2}}

from the same spatial region but in a different processing level. This approach ensured that positive pairs maintained spatial content while introducing processing-level variations.

These patches then underwent a series of spectral and spatial transformations denoted by

A = a_{1}, a_{2} \dots a_{n}

, which were enacted on the patches

x_{i}^{l}

as follows:

{x_{i}^{l}}^{'} = a_{n} (\dots a_{2} (a_{1} (x_{i}^{l})) \dots)

(5)

To further enhance the learning process, we leveraged the concept of hard negative mining during triplet selection [38]. This strategy focused on choosing negative pairs within the same batch that were most similar to the anchor patch. Selecting such “hard negatives” forced the model to learn more discriminative features that distinguish between even subtle spectral variations, which is crucial for HSI data due to their high dimensionality and potential spectral redundancy.

Mathematically, we used an encoder function f to map each patch

x

to its corresponding embedding

z = f (x)

. The cosine similarity function

s (z_{i}, z_{j})

then measured the similarity between embeddings

z_{i}

and

z_{j}

for a batch of patches. During hard negative mining, for a given anchor patch

x^{a}

and its positive pair

x^{+}

, the hardest negative patch

x^{-}

was selected as:

x^{-} = \underset{x_{k} \in {x_{1}, x_{2}, \dots, x_{B}} ∖ {x^{a}, x^{+}}}{argmax} s (f (x^{a}), f (x_{k}))

(6)

This means the patch

x_{k}

that gave the highest similarity score with the anchor patch

x^{a}

was chosen from all patches in the batch, with the exception of the positive pair

x^{+}

and the anchor patch

x^{a}

itself (as depicted in Figure 2). By combining the use of different processing levels with hard negative mining and traditional augmentations, our approach emphasized the importance of targeted data augmentation in HSIs. This method leveraged the advantages of self-supervised learning, improving task performance by learning detailed representations from large, unlabeled datasets while preserving the unique spectral characteristics of hyperspectral data.

2.4. Hyperspectral Perceptual Loss

The HyperSpectral Perceptual Loss (HSPL) function is designed to quantify the spectral differences between a predicted hyperspectral image

(\hat{I})

and its reference

(I_{r e f})

in the feature embedding space. Unlike traditional loss functions that operate directly on pixel values, the HSPL leveraged the rich spectral information captured by our pretrained HyperKon network. This approach allowed for a more nuanced comparison of hyperspectral data, capturing complex spectral relationships that may not have been evident in simple pixelwise comparisons.

The HSPL is computed as follows:

ℓ_{h s p l} = \sum_{l = 1}^{N} \frac{1}{C_{l} H_{l} W_{l}} {||f^{h} (I_{r e f}, l) - f^{h} (\hat{I}, l)||}_{F}

(7)

where

f^{h} (\cdot, l)

represents the feature maps extracted by the lth layer of the pretrained network, N is the total number of layers considered, and

C_{l}

,

H_{l}

, and

W_{l}

are the number of channels, height, and width of the feature maps at layer l, respectively. The Frobenius norm

{∥ \cdot ∥}_{F}

quantifies the difference between the feature maps, effectively capturing the perceptual difference between the images in the hyperspectral domain.

Figure 3 illustrates the conceptual difference between the HSPL and traditional RGB-based perceptual losses. While RGB losses are limited to three broad spectral bands, the HSPL leverages the full spectral range of hyperspectral data, providing a more comprehensive measure of spectral similarity.

In terms of computational complexity, the HSPL does incur additional overhead compared to simpler pixelwise losses due to the need to compute feature maps across multiple layers. However, this increased complexity is offset by the improved performance and more meaningful loss calculations, especially for tasks that require accurate preservation of spectral characteristics.

3. Results

3.1. Ablation Study

We evaluated multiple versions of the model, each integrating different architectural components, to see how well each component of our HyperKon architecture performed and how it impacted the network’s capacity to learn meaningful representations from hyperspectral data.

Figure 4 presents the results of this ablation study, showing the Top-1 HSI retrieval accuracy for each model variant during the pretraining phase. We experimented with several key components: 3D convolutions to capture spatial–spectral relationships, depthwise separable convolutions (Depthwise Separable Convolutions (DSC)) for efficient feature extraction, convolutional block attention module (CBAM) to enhance feature refinement, and squeeze and excitation block (SEB) for adaptive feature recalibration. The results clearly demonstrated the impact of each component on the model’s performance. Notably, the version incorporating the SEB achieved the highest accuracy, surpassing other configurations. This suggested that the adaptive feature recalibration provided by the SEB was particularly effective in capturing the complex spectral–spatial relationships in hyperspectral data.

Band Attention

We investigated various dimensionality reduction techniques commonly applied to hyperspectral data [39,40,41], including Principal Component Analysis (PCA), manual, and full band selection. These techniques are crucial in HSI processing for streamlining computation and refining feature extraction by discarding redundant information.

As shown in Figure 5, the 3D convolution model outperformed both the SEB and CBAM models when dimensionality was reduced using PCA. While the SEB and CBAM generally excelled at selecting and prioritizing information across the entire spectrum of hyperspectral bands, their advantage diminishes when PCA was applied, potentially rendering their roles redundant. In contrast, 3D convolutions demonstrated superior performance when band selection was conducted a priori using PCA with 112 components, likely because 3D convolutions can effectively learn from both contiguous and non-contiguous spectral channels. This flexibility allowed 3D convolutions to adapt to the structure of the data, regardless of the continuity of the spectral bands.

As suggested in Figure 6, manual band selection gave 3D convolutions a distinct advantage. However, most notably, the SEB and CBAM clearly surpassed 3D convolution when no band selection was carried out. This observation supported our hypothesis that when band selection was not performed a priori, attention mechanisms such as the SEB and CBAM had the ability to focus on the most relevant spectral bands across the entire channel spectrum. This clearly demonstrated their robustness and ability to manage the high dimensionality common with hyperspectral data.

3.2. Super-Resolution

For the super-resolution evaluation, we utilized datasets such as the Pavia Center [42], Botswana [43], Chikusei [44], and EnMAP [26] datasets, each featuring different spectral bands. Specific bands from the EnHyperSet-1 dataset were selected to match the wavelength of each dataset, optimizing the training of HyperKon. The data preparation adhered to methodologies outlined in [27,45]. The results from HyperKon were benchmarked against traditional RGB-trained backbones using metrics [46,47,48] such as the Correlation Coefficient (CC), Spectral Angle Mapper (SAM), Root-Mean-Square Error (RMSE), Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS), and Peak Signal-to-Noise Ratio (PSNR).

Table 2 shows the average quantitative results for the Pavia Center dataset. HyperKon achieved superior performance across multiple metrics compared to other methods, demonstrating its effectiveness in hyperspectral super-resolution. A broader comparison of the average quantitative pansharpening results for RGB-native vs. HSI-native perceptual loss across multiple datasets is presented in Table 3. HyperKon consistently achieved the best results, indicating the advantages of a hyperspectral-native approach. Figure 7 provides visual results generated by different pansharpening algorithms, including HyperKon, for the Pavia Center, Botswana, Chikusei, and EnMAP datasets. Mean Absolute Error (MAE) heatmaps show that HyperKon had the lowest error across all spectral bands, further highlighting its superior performance.

Figure 7. Visual results generated by different pansharpening algorithms (HyperPNN [49], DARN [45], GPPNN [50], HyperTransformer [51], HyperKon (ours), and ground truth) for Pavia Center [42], Botswana [43], Chikusei [44], and EnMAP [26] datasets. MAE denotes the (normalized) Mean Absolute Error across all spectral bands.

Table 2. Average quantitative results: Pavia Center [42] *.

	Pavia Center Dataset
Method	CC↑	SAM↓	RMSE↓	ERGAS↓	PSNR↑
HySure [52]	0.966	6.13	1.8	3.77	35.91
HyperPNN [49]	0.967	6.09	1.67	3.82	36.7
PanNet [53]	0.968	6.36	1.83	3.89	35.61
Darn [45]	0.969	6.43	1.56	3.95	37.3
HyperKite [27]	0.98	5.61	1.29	2.85	38.65
SIPSA [54]	0.948	5.27	2.38	4.52	33.65
GPPNN [50]	0.963	6.52	1.91	4.05	35.36
HyperTransformer [51]	0.9881	4.1494	0.9862	0.5346	40.9525
HyperKon	0.9883	3.9551	0.9369	0.5152	41.9808

* Best values are in bold, 2nd best values are underlined. RMSE values are

\times 10^{- 2}

. ↑ Means higher value is better. ↓ Means smaller value is better.

Table 3. Comparison of the average quantitative pansharpening results for RGB-native vs. HSI-native perceptual loss *.

	Botswana Dataset		Chikusei Dataset		Pavia Dataset
Metric	RGB-Native	HS-Native	RGB-Native	HS-Native	RGB-Native	HS-Native
CC↑	0.9104	0.9411	0.9801	0.9777	0.9881	0.9883
SAM↓	3.1459	2.5798	2.2547	2.4192	4.1494	3.9551
RMSE↓	0.0233	0.0193	0.0123	0.0131	0.0098	0.0093
ERGAS↓	0.6753	0.5249	0.8662	0.9193	0.5346	0.5152
PSNR↑	27.3925	29.4128	36.8861	36.2889	40.9525	41.9808

* Best values are in bold. ↑ Means higher value is better. ↓ Means smaller value is better.

3.3. Transfer Learning Capability of HyperKon

In addition to the super-resolution task, the performance of the HyperKon network was evaluated on the HSI classification task using the Indian Pines, Pavia University, and Salinas Scene datasets [55]. These datasets, covering a wide range of crops, serve as an excellent benchmark for assessing the accuracy and proficiency of the HyperKon network in classifying different crops. The frozen backbone of the HyperKon network, which had been pretrained on a variety of hyperspectral data, was utilized for this task. For these transfer learning experiments, we employed a patch size of 25 × 25 pixels, which provided a suitable balance between spatial context and computational efficiency for the downstream classification task.

The results were juxtaposed with those of established methods such as SSAN [56], SSRN [57], RvT [58], HiT [59], SSFTT [60], and QSSPN [61]. The assessment utilized overall accuracy (OA), average accuracy (AA), and Kappa coefficient (Kappa) metrics. As shown in Table 4, HyperKon consistently matched or surpassed other networks, highlighting its robustness and adaptability to new data. The high accuracy demonstrated how the model could use discriminative features from its self-supervised learning phase to provide predictions for HSI classification that were reliable. The qualitative outcomes of the HSI classification can be viewed in Figure 8, and Figure 9 shows a zoomed-out prediction accuracy map for Indian Pines.

3.4. Model Efficiency Analysis

As shown in Table 4, HyperKon demonstrated higher classification accuracy across the three datasets, consistently outperforming other methods in overall accuracy, average accuracy, and Kappa coefficient. However, this performance came at a computational cost. HyperKon had a relatively high parameter count (5.54 M for Indian Pines, 4.08 M for Pavia University, 5.62 M for Salinas Scene) compared to more lightweight models like SSAN [56] and SSFTT [60]. Its FLOPs were also higher, ranging from 370.59 M to 1.32 G. Despite this, as evident from Table 5, HyperKon achieved near-perfect classification accuracy for most classes across all three datasets, with only a few classes in Indian Pines falling slightly below 100%. The model’s inference times were consistently low (around 0.0009 s per sample) across datasets, resulting in high throughput (1038–1088 samples/second). All experiments were carried out in a PyTorch [62] environment using an NVIDIA RTX 3090 GPU with 24 GB of memory.

4. Discussion

4.1. Architectural Considerations

The ablation study results (Figure 4) highlight the importance of carefully selecting architectural components for hyperspectral data processing. The superior performance of the SEB over other configurations suggests that adaptive feature recalibration is particularly effective for capturing the complex spectral–spatial relationships in HSIs.

4.2. Self-Supervised Learning for Hyperspectral Data

The success of HyperKon in learning meaningful representations from unlabeled hyperspectral data emphasizes the potential of self-supervised learning approaches in remote sensing. By achieving high Top-1 retrieval accuracy during pretraining (Figure 4), our model demonstrated its ability to capture important spectral–spatial features without the need for extensive labeled datasets. This is particularly valuable in the hyperspectral domain, where labeled data are often scarce and expensive to obtain [63,64].

4.3. Transfer Learning and Downstream Task Performance

The strong performance of HyperKon in downstream tasks, particularly in hyperspectral super-resolution (Table 2 and Table 3) and image classification (Table 4), demonstrates the transferability of the learned representations. This is a crucial finding, as it suggests that self-supervised pretraining on diverse hyperspectral data can lead to robust features that generalize well to specific tasks and datasets. The superior performance of HyperKon compared to RGB-native approaches in pansharpening tasks (Table 3) highlights the benefits of developing hyperspectral-native models. This aligns with previous hypotheses suggesting that models designed for RGB imagery may not fully exploit the rich spectral information available in hyperspectral data.

4.4. Hyperspectral Perceptual Loss

The introduction of the HSPL represents a novel contribution to the field of HSI processing. By focusing on spectral differences across all bands, rather than just pixelwise differences, the HSPL provides a more comprehensive measure of similarity for HSIs. The improved performance observed when using the HSPL (Figure 3) suggests that this approach could be valuable for a range of HSI processing tasks beyond super-resolution, such as image fusion or denoising [65].

4.5. Limitations and Future Work

While HyperKon demonstrates excellent performance across various tasks, there are several areas for potential improvement and future research. Future work could explore techniques such as network pruning or quantization to reduce computational requirements without sacrificing performance [66]. While HyperKon focuses on hyperspectral data, future research could explore integrating information from other sensor modalities, such as LiDAR or SAR, to create more comprehensive representations of Earth observation data. The current model does not explicitly account for temporal changes in hyperspectral imagery. Incorporating temporal information could enhance the model’s ability to capture dynamic processes such as vegetation phenology.

5. Conclusions

This study presented HyperKon, a self-supervised contrastive network developed for HSI analysis. HyperKon uses a unique hyperspectral-native convolutional architecture and the novel HSPL function to enhance performance in hyperspectral super-resolution and classification applications. The experimental results suggested that HyperKon outperformed traditional RGB-trained models and other state-of-the-art approaches, demonstrating its ability to preserve spectral integrity and capture complex spectral–spatial relationships. The successful application of self-supervised contrastive learning allowed for robust feature extraction from large volumes of unlabeled hyperspectral data.

The creation of the EnHyperSet-1 dataset, a comprehensive collection of high-resolution HSIs, contributed to advance research in this field. The dataset supports the development of models like HyperKon, which are capable of handling the high dimensionality and unique characteristics of hyperspectral data. Future research directions will involve integrating multi-modal data, improving model interpretability, and optimizing training and inference pipelines for resource-constrained environments. These efforts will ensure that HyperKon remains a valuable tool in the constantly evolving field of remote sensing.

Author Contributions

Conceptualization, D.L.A., B.M.-C. and O.M.; methodology, D.L.A.; software, D.L.A.; validation, D.L.A., B.M.-C. and O.M.; formal analysis, D.L.A.; investigation, D.L.A.; resources, O.M. and B.M.-C.; data curation, D.L.A. and B.M.-C.; writing—original draft preparation, D.L.A.; writing—review and editing, D.L.A., B.M.-C., J.-Y.G. and O.M.; visualization, D.L.A.; supervision, B.M.-C., J.-Y.G. and O.M.; project administration, O.M.; funding acquisition, O.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All EnMAP data are freely available through the EnMAP data access portal at the following link: https://www.enmap.org/data_access/. The EnMAP data are licensed products of The German Aerospace Center (DLR), all rights reserved. The utility tool to create EnHyperset-1 is available here: https://github.com/kleffy/enhyperset. Indian Pines, Pavia University, and Salinas dataset are avialble here—https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes.

Acknowledgments

This research is part of a PhD study funded by https://sixteensands.com/ (Sixteen Sands Ltd.) The authors express their gratitude to Abayomi Awobokun for his generous support and insightful discussions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MAE	Mean Absolute Error
RMSE	Root-Mean-Square Error
PSNR	Peak Signal-to-Noise Ratio
CC	Correlation Coefficient
SAM	Spectral Angle Mapper
ERGAS	Erreur Relative Globale Adimensionnelle de Synthèse
CNN	Convolutional Neural Network
DSC	Depthwise Separable Convolutions
SEB	Squeeze and Excitation Block
CBAM	Convolutional Block Attention Module
PCA	Principal Component Analysis
HSI	Hyperspectral image
NT-Xent	Normalized Temperature-Scaled Cross Entropy
HSPL	HyperSpectral Perceptual Loss
RSFM	Remote Sensing Foundation Model
EnMAP	Environmental Mapping and Analysis Program

References

Cheng, C.; Zhao, B. Prospect of application of hyperspectral imaging technology in public security. In Proceedings of the International Conference on Applications and Techniques in Cyber Security and Intelligence ATCI 2018: Applications and Techniques in Cyber Security and Intelligence; Springer: Berlin/Heidelberg, Germany, 2019; pp. 299–304. [Google Scholar]
Brisco, B.; Brown, R.; Hirose, T.; McNairn, H.; Staenz, K. Precision agriculture and the role of remote sensing: A review. Can. J. Remote. Sens. 1998, 24, 315–327. [Google Scholar] [CrossRef]
da Lomba Magalhães, M.J. Hyperspectral Image Fusion—A Comprehensive Review. Master’s Thesis, University of Eastern Finland, Kuopio, Finland, 2022. [Google Scholar]
Yu, S.; Jia, S.; Xu, C. Convolutional neural networks for hyperspectral image classification. Neurocomputing 2017, 219, 88–98. [Google Scholar] [CrossRef]
Signoroni, A.; Savardi, M.; Baronio, A.; Benini, S. Deep learning meets hyperspectral image analysis: A multidisciplinary review. J. Imaging 2019, 5, 52. [Google Scholar] [CrossRef]
Shi, C.; Sun, J.; Wang, L. Hyperspectral image classification based on spectral multiscale convolutional neural network. Remote. Sens. 2022, 14, 1951. [Google Scholar] [CrossRef]
Bouchoucha, R.; Braiek, H.B.; Khomh, F.; Bouzidi, S.; Zaatour, R. Robustness assessment of hyperspectral image CNNs using metamorphic testing. Inf. Softw. Technol. 2023, 162, 107281. [Google Scholar] [CrossRef]
Li, Y.; Zhang, H.; Shen, Q. Spectral–spatial classification of hyperspectral imagery with 3D convolutional neural network. Remote. Sens. 2017, 9, 67. [Google Scholar] [CrossRef]
Feng, F.; Wang, S.; Wang, C.; Zhang, J. Learning deep hierarchical spatial–spectral features for hyperspectral image classification based on residual 3D-2D CNN. Sensors 2019, 19, 5276. [Google Scholar] [CrossRef]
Lu, Z.; Xu, B.; Sun, L.; Zhan, T.; Tang, S. 3-D channel and spatial attention based multiscale spatial–spectral residual network for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2020, 13, 4311–4324. [Google Scholar] [CrossRef]
Li, C.; Qiu, Z.; Cao, X.; Chen, Z.; Gao, H.; Hua, Z. Hybrid dilated convolution with multi-scale residual fusion network for hyperspectral image classification. Micromachines 2021, 12, 545. [Google Scholar] [CrossRef]
Gbodjo, Y.J.E.; Ienco, D.; Leroux, L.; Interdonato, R.; Gaetano, R.; Ndao, B. Object-based multi-temporal and multi-source land cover mapping leveraging hierarchical class relationships. Remote. Sens. 2020, 12, 2814. [Google Scholar] [CrossRef]
Li, Q.; Zhong, R.; Du, X.; Du, Y. TransUNetCD: A hybrid transformer network for change detection in optical remote-sensing images. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 5622519. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–28 October 2018; pp. 3–19. [Google Scholar]
Hong, D.; Zhang, B.; Li, X.; Li, Y.; Li, C.; Yao, J.; Yokoya, N.; Li, H.; Ghamisi, P.; Jia, X.; et al. SpectralGPT: Spectral remote sensing foundation model. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5227–5244. [Google Scholar] [CrossRef]
Manas, O.; Lacoste, A.; Giró-i Nieto, X.; Vazquez, D.; Rodriguez, P. Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9414–9423. [Google Scholar]
He, X.; Chen, Y.; Huang, L.; Hong, D.; Du, Q. Foundation model-based multimodal remote sensing data classification. IEEE Trans. Geosci. Remote. Sens. 2023, 62, 5502117. [Google Scholar] [CrossRef]
Guo, X.; Lao, J.; Dang, B.; Zhang, Y.; Yu, L.; Ru, L.; Zhong, L.; Huang, Z.; Wu, K.; Hu, D.; et al. Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2024; pp. 27672–27683. [Google Scholar]
Yan, Z.; Li, J.; Li, X.; Zhou, R.; Zhang, W.; Feng, Y.; Diao, W.; Fu, K.; Sun, X. RingMo-SAM: A foundation model for segment anything in multimodal remote-sensing images. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Dong, H.; Ma, W.; Wu, Y.; Zhang, J.; Jiao, L. Self-supervised representation learning for remote sensing image change detection based on temporal prediction. Remote. Sens. 2020, 12, 1868. [Google Scholar] [CrossRef]
Hou, S.; Shi, H.; Cao, X.; Zhang, X.; Jiao, L. Hyperspectral imagery classification based on contrastive learning. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
Huang, L.; Chen, Y.; He, X. Spectral–spatial masked transformer with supervised and contrastive learning for hyperspectral image classification. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–18. [Google Scholar] [CrossRef]
Hu, X.; Li, T.; Zhou, T.; Liu, Y.; Peng, Y. Contrastive learning based on transformer for hyperspectral image classification. Appl. Sci. 2021, 11, 8670. [Google Scholar] [CrossRef]
Nalepa, J.; Myller, M.; Cwiek, M.; Zak, L.; Lakota, T.; Tulczyjew, L.; Kawulok, M. Towards on-board hyperspectral satellite image segmentation: Understanding robustness of deep learning through simulating acquisition conditions. Remote. Sens. 2021, 13, 1532. [Google Scholar] [CrossRef]
Storch, T.; Honold, H.P.; Chabrillat, S.; Habermeyer, M.; Tucker, P.; Brell, M.; Ohndorf, A.; Wirth, K.; Betz, M.; Kuchler, M.; et al. The EnMAP imaging spectroscopy mission towards operations. Remote. Sens. Environment 2023, 294, 113632. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Valanarasu, J.M.J.; Patel, V.M. Hyperspectral pansharpening based on improved deep image prior and residual reconstruction. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2020; pp. 1597–1607. [Google Scholar]
Wu, P.; Cui, Z.; Gan, Z.; Liu, F. Three-dimensional resnext network using feature fusion and label smoothing for hyperspectral image classification. Sensors 2020, 20, 1652. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26–30 June 2016; pp. 770–778. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Ohri, K.; Kumar, M. Review on self-supervised image recognition using deep neural networks. Knowl.-Based Syst. 2021, 224, 107090. [Google Scholar] [CrossRef]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 2020, 33, 9912–9924. [Google Scholar]
Le-Khac, P.H.; Healy, G.; Smeaton, A.F. Contrastive representation learning: A framework and review. IEEE Access 2020, 8, 193907–193934. [Google Scholar] [CrossRef]
Purushwalkam, S.; Gupta, A. Demystifying contrastive self-supervised learning: Invariances, augmentations and dataset biases. Adv. Neural Inf. Process. Syst. 2020, 33, 3407–3418. [Google Scholar]
Robinson, J.; Chuang, C.Y.; Sra, S.; Jegelka, S. Contrastive learning with hard negative samples. arXiv 2020, arXiv:2010.04592. [Google Scholar]
Li, W.; Feng, F.; Li, H.; Du, Q. Discriminant analysis-based dimension reduction for hyperspectral image classification: A survey of the most recent advances and an experimental comparison of different techniques. IEEE Geosci. Remote. Sens. Mag. 2018, 6, 15–34. [Google Scholar] [CrossRef]
Kumar, B.; Dikshit, O.; Gupta, A.; Singh, M.K. Feature extraction for hyperspectral image classification: A review. Int. J. Remote. Sens. 2020, 41, 6248–6287. [Google Scholar] [CrossRef]
Zhang, L.; Luo, F. Review on graph learning for dimensionality reduction of hyperspectral image. Geo-Spat. Inf. Sci. 2020, 23, 98–106. [Google Scholar] [CrossRef]
Plaza, A.; Benediktsson, J.A.; Boardman, J.W.; Brazile, J.; Bruzzone, L.; Camps-Valls, G.; Chanussot, J.; Fauvel, M.; Gamba, P.; Gualtieri, A.; et al. Recent advances in techniques for hyperspectral image processing. Remote. Sens. Environ. 2009, 113, S110–S122. [Google Scholar] [CrossRef]
Ungar, S.G.; Pearlman, J.S.; Mendenhall, J.A.; Reuter, D. Overview of the earth observing one (EO-1) mission. IEEE Trans. Geosci. Remote. Sens. 2003, 41, 1149–1159. [Google Scholar] [CrossRef]
Yokoya, N.; Iwasaki, A. Airborne Hyperspectral Data over Chikusei; Tecnical Report SAL-2016-05-27; University Tokyo: Tokyo, Japan, 2016; Volume 5. [Google Scholar]
Zheng, Y.; Li, J.; Li, Y.; Guo, J.; Wu, X.; Chanussot, J. Hyperspectral pansharpening using deep prior and dual attention residual network. IEEE Trans. Geosci. Remote. Sens. 2020, 58, 8059–8076. [Google Scholar] [CrossRef]
Singh, A.K.; Kumar, H.; Kadambi, G.R.; Kishore, J.; Shuttleworth, J.; Manikandan, J. Quality metrics evaluation of hyperspectral images. Int. Arch. Photogramm. Remote. Sens. Spat. Inf. Sci. 2014, 40, 1221–1226. [Google Scholar] [CrossRef]
Deborah, H.; Richard, N.; Hardeberg, J.Y. A comprehensive evaluation of spectral distance functions and metrics for hyperspectral image processing. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2015, 8, 3224–3234. [Google Scholar] [CrossRef]
Chaithra, C.; Taranath, N.; Darshan, L.; Subbaraya, C. A Survey on Image Fusion Techniques and Performance Metrics. In Proceedings of the 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), IEEE, Coimbatore, India, 29–31 March 2018; pp. 995–999. [Google Scholar]
He, L.; Zhu, J.; Li, J.; Plaza, A.; Chanussot, J.; Li, B. HyperPNN: Hyperspectral pansharpening via spectrally predictive convolutional neural networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2019, 12, 3092–3100. [Google Scholar] [CrossRef]
Xu, S.; Zhang, J.; Zhao, Z.; Sun, K.; Liu, J.; Zhang, C. Deep gradient projection networks for pan-sharpening. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 21–25 June 2021; pp. 1366–1375. [Google Scholar]
Bandara, W.G.C.; Patel, V.M. HyperTransformer: A textural and spectral feature fusion transformer for pansharpening. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1767–1777. [Google Scholar]
Simoes, M.; Bioucas-Dias, J.; Almeida, L.B.; Chanussot, J. A convex formulation for hyperspectral image superresolution via subspace-based regularization. IEEE Trans. Geosci. Remote. Sens. 2014, 53, 3373–3388. [Google Scholar] [CrossRef]
Yang, J.; Fu, X.; Hu, Y.; Huang, Y.; Ding, X.; Paisley, J. PanNet: A deep network architecture for pan-sharpening. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5449–5457. [Google Scholar]
Lee, J.; Seo, S.; Kim, M. Sipsa-net: Shift-invariant pan sharpening with moving object alignment for satellite imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 21–25 June 2021; pp. 10166–10174. [Google Scholar]
Green, R.O.; Eastwood, M.L.; Sarture, C.M.; Chrien, T.G.; Aronsson, M.; Chippendale, B.J.; Faust, J.A.; Pavri, B.E.; Chovit, C.J.; Solis, M.; et al. Imaging Spectroscopy and the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS). Remote. Sens. Environ. 1998, 65, 227–248. [Google Scholar] [CrossRef]
Sun, H.; Zheng, X.; Lu, X.; Wu, S. Spectral–spatial attention network for hyperspectral image classification. IEEE Trans. Geosci. Remote. Sens. 2019, 58, 3232–3245. [Google Scholar] [CrossRef]
Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral–spatial residual network for hyperspectral image classification: A 3-D deep learning framework. IEEE Trans. Geosci. Remote. Sens. 2017, 56, 847–858. [Google Scholar] [CrossRef]
Heo, B.; Yun, S.; Han, D.; Chun, S.; Choe, J.; Oh, S.J. Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11936–11945. [Google Scholar]
Yang, X.; Cao, W.; Lu, Y.; Zhou, Y. Hyperspectral image transformer classification networks. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–spatial feature tokenization transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, Y.; Zhou, Y. Quantum-Inspired Spectral-Spatial Pyramid Network for Hyperspectral Image Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9925–9934. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Ayush, K.; Uzkent, B.; Meng, C.; Tanmay, K.; Burke, M.; Lobell, D.; Ermon, S. Geography-aware self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10181–10190. [Google Scholar]
Ou, X.; Liu, L.; Tan, S.; Zhang, G.; Li, W.; Tu, B. A hyperspectral image change detection framework with self-supervised contrastive learning pretrained model. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2022, 15, 7724–7740. [Google Scholar] [CrossRef]
Loncan, L.; de Almeida, L.B.; Bioucas-Dias, J.M.; Briottet, X.; Chanussot, J.; Dobigeon, N.; Fabre, S.; Liao, W.; Licciardi, G.A.; Simoes, M.; et al. Hyperspectral pansharpening: A review. IEEE Geosci. Remote. Sens. Mag. 2015, 3, 27–46. [Google Scholar] [CrossRef]
Liang, T.; Glossner, J.; Wang, L.; Shi, S.; Zhang, X. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 2021, 461, 370–403. [Google Scholar] [CrossRef]

Figure 1. General Overview of HyperKon System Architecture.

Figure 2. Illustration of HSI contrastive sampling.

Figure 3. Conceptual comparison of HSI vs. RGB perceptual loss.

Figure 4. Top-1 HSI retrieval accuracy achieved by various versions of the HyperKon model during the pretraining phase. The performance of each version is presented as a bar in the chart, illustrating how the integration of different components, such as 3D convolutions, DSC, the CBAM, and the SEB, affected the model’s accuracy. The chart underscores the superior performance of the SEB.

Figure 5. Top-1 HSI retrieval accuracy for the 3D Conv, SEB, and CBAM models following dimensionality reduction using PCA. The graph indicates a superior performance by the 3D convolution model.

Figure 6. Top-1 HSI retrieval accuracy for the 3D Conv, SEB, and CBAM models when manual band selection was employed. It shows an initial advantage for 3D Conv, but over time, the SEB and CBAM models catch up to similar levels of performance.

Figure 8. HyperKon image classification visualization for Indian Pines, Pavia University, and Salinas datasets. (a) Predicted classification map generated by HyperKon, (b) Predicted classification map with masked regions (showing only labeled areas), (c) predicted accuracy map: green for correct predictions, red for incorrect predictions, and black for unlabeled areas, (d) ground-truth classification map, and (e) original RGB image.

Figure 9. Zoom-out predicted accuracy map for Indian Pines: green for correct predictions, red for incorrect predictions, and black for unlabeled areas.

Table 1. A comparison to popular hyperspectral datasets *.

Dataset	Number of Bands	Size	Spectral Range	Number of Images	Spatial Resolution	Imaging Location	Platform Type
Indian Pines	200	145 × 145	400–2500 nm	1	30 m	Indiana, USA	Airborne
Pavia Centre	102	1096 × 1096	430–860 nm	1	1.3 m	Pavia, Italy	Airborne
Salinas	204	512 × 217	360–2500 nm	1	3.7 m	Salinas Valley, CA, USA	Airborne
Harvard	31	1392 × 1040	420–720 nm	50	-	Harvard, USA	Airborne
Botswana	145	1476 × 256	400–2500 nm	1	30 m	Botswana	Airborne
Chikusei	100	2517 × 2335	263–1018 nm	1	2.5 m	Chikusei, Japan	Airborne
EnHyperSet-1	224	1300 × 1200	420–2450 nm	800	30 m	Global, on demand	Spaceborne

* Best values are in bold.

Table 4. Comparative analysis of hyperspectral image classification methods’ performance metrics and computational requirements for various hyperspectral image classification methods across three datasets *.

Datasets	Metrics	SSAN [56]	SSRN [57]	RvT [58]	HiT [59]	SSFTT [60]	QSSPN-3 [61]	HyperKon
IP	OA(%)	89.46	91.85	83.85	90.59	96.35	95.87	98.77
	AA(%)	85.99	81.51	79.67	86.71	89.99	96.40	97.82
	Kappa(%)	88.04	90.73	81.68	89.27	95.82	95.34	98.60
	Params.	148.83 K	735.88 K	10.78 M	49.60 M	148.50 K	910.50 K	5.54 M
	FLOPs	7.88 M	212.48 M	17.83 M	345.88 M	3.66 M	34.54 M	1.27 G
PU	OA(%)	99.15	99.63	97.37	99.43	99.52	99.71	99.89
	AA(%)	98.70	99.29	95.86	99.09	99.20	99.43	99.76
	Kappa(%)	98.87	99.51	96.52	99.24	99.36	99.61	99.86
	Params.	94.63 K	396.99 K	9.77 M	42.41 M	148.03 K	609.16 K	4.08 M
	FLOPs	5.57 M	108.04 M	16.83 M	190.85 M	3.66 M	10.25 M	370.59 M
SA	OA(%)	98.92	99.31	98.11	99.38	99.53	99.66	100.00
	AA(%)	99.33	99.70	98.83	99.70	99.72	99.81	100.00
	Kappa(%)	98.80	99.23	97.90	99.31	99.47	99.63	100.00
	Params.	149.71 K	750 K	10.82 M	50 M	148.50 K	926.90 K	5.62 M
	FLOPs	7.97 M	216.84	17.80 M	354.42 M	3.66 M	35.87 M	1.32 G

* Best values are in bold, 2nd best values are underlined.

Table 5. Classwise accuracies and performance metrics for the HyperKon model on Indian Pines, Pavia University, and Salinas datasets.

Indian Pines		Pavia University		Salinas
Class	Acc. (%)	Class	Acc. (%)	Class	Acc. (%)
Alfalfa	100.00	Asphalt	100.00	Broccoli_green_weeds_1	100.00
Corn-notill	99.72	Meadows	100.00	Broccoli_green_weeds_2	100.00
Corn-mintill	99.16	Gravel	100.00	Fallow	100.00
Corn	100.00	Trees	99.87	Fallow_rough_plow	100.00
Grass-pasture	99.59	Painted metal sheets	100.00	Fallow_smooth	100.00
Grass-trees	100.00	Bare Soil	100.00	Stubble	100.00
Grass-pasture-mowed	100.00	Bitumen	99.85	Celery	100.00
Hay-windrowed	100.00	Self-blocking bricks	99.95	Grapes_untrained	100.00
Oats	95.00	Shadows	99.58	Soil_vinyard_develop	100.00
Soybean-notill	99.69			Corn_senesced_green	100.00
Soybean-mintill	99.80			Lettuce_romaine_4wk	100.00
Soybean-clean	99.49			Lettuce_romaine_5wk	100.00
Wheat	100.00			Lettuce_romaine_6wk	100.00
Woods	99.29			Lettuce_romaine_7wk	100.00
Buildings-grass-trees-drives	100.00			Vinyard_untrained	100.00
Stone-steel-towers	100.00			Vinyard_vertical	100.00
Performance Metrics
Total training time (s)	2346.18	Total training time (s)	9251.00	Total training time (s)	12,172.18
Total test time (s)	52.07	Total test time (s)	513.87	Total test time (s)	294.16
Avg. inference time (s)	0.0009	Avg. inference time (s)	0.0009	Avg. inference time (s)	0.0010
Throughput (samples/s)	1088.39	Throughput (samples/s)	1078.45	Throughput (samples/s)	1038.11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ayuba, D.L.; Guillemaut, J.-Y.; Marti-Cardona, B.; Mendez, O. HyperKon: A Self-Supervised Contrastive Network for Hyperspectral Image Analysis. Remote Sens. 2024, 16, 3399. https://doi.org/10.3390/rs16183399

AMA Style

Ayuba DL, Guillemaut J-Y, Marti-Cardona B, Mendez O. HyperKon: A Self-Supervised Contrastive Network for Hyperspectral Image Analysis. Remote Sensing. 2024; 16(18):3399. https://doi.org/10.3390/rs16183399

Chicago/Turabian Style

Ayuba, Daniel La’ah, Jean-Yves Guillemaut, Belen Marti-Cardona, and Oscar Mendez. 2024. "HyperKon: A Self-Supervised Contrastive Network for Hyperspectral Image Analysis" Remote Sensing 16, no. 18: 3399. https://doi.org/10.3390/rs16183399

APA Style

Ayuba, D. L., Guillemaut, J.-Y., Marti-Cardona, B., & Mendez, O. (2024). HyperKon: A Self-Supervised Contrastive Network for Hyperspectral Image Analysis. Remote Sensing, 16(18), 3399. https://doi.org/10.3390/rs16183399

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HyperKon: A Self-Supervised Contrastive Network for Hyperspectral Image Analysis

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Hyperspectral Backbone Architecture

2.3. Contrastive Learning

2.3.1. Self-Supervised Contrastive Loss

2.3.2. HSI Contrastive Sampling

2.4. Hyperspectral Perceptual Loss

3. Results

3.1. Ablation Study

Band Attention

3.2. Super-Resolution

3.3. Transfer Learning Capability of HyperKon

3.4. Model Efficiency Analysis

4. Discussion

4.1. Architectural Considerations

4.2. Self-Supervised Learning for Hyperspectral Data

4.3. Transfer Learning and Downstream Task Performance

4.4. Hyperspectral Perceptual Loss

4.5. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI