1. Introduction
Hyperspectral images (HSIs) typically consist of tens or even hundreds of spectral bands, providing a richer source of spectral information. This abundance of information enables hyperspectral images to effectively distinguish different objects, materials, and land cover types. Thanks to these characteristics, hyperspectral images have been widely applied in numerous computer vision tasks, including medical image processing [
1,
2], object tracking [
3], mineral exploration [
4], hyperspectral anomaly detection [
5], plant detection [
6], etc. Hyperspectral imaging typically requires high spectral resolution, resulting in a large amount of data that needs to be collected. However, due to the hardware limitations of sensors and acquisition equipment, spatial resolution is often sacrificed to obtain higher spectral resolution. To better serve downstream visual tasks, enhancing the spatial resolution of hyperspectral images has become a crucial research topic.
Hyperspectral super-resolution is an effective technique for enhancing the spatial resolution of hyperspectral images, allowing for the reconstruction of low-resolution hyperspectral images into higher-resolution hyperspectral images. Hyperspectral super-resolution methods can be categorized into two types based on their use of auxiliary information: (1) fusion-based hyperspectral image super-resolution. (2) Single hyperspectral image super-resolution. The former requires additional auxiliary information, such as RGB, panoramic images (PAN) [
7,
8], multispectral images (MSI) [
9,
10], etc., to enhance spatial resolution, while the latter relies solely on a single low-resolution hyperspectral image to restore its corresponding high-resolution counterpart. While fusion-based hyperspectral image super-resolution can achieve good performance, it is limited by the requirement of acquiring auxiliary images in the same scene as the low-resolution image, which increases the complexity of the task. Therefore, research on single-image super-resolution is more aligned with the practical needs of real-world scenarios.
Over the past few decades, a multitude of remarkable single hyperspectral image super-resolution methods have been proposed. Akgun et al. [
11] improved the spatial resolution of hyperspectral images by modeling the image acquisition process. Li et al. [
12] considered the sparsity of spectral decomposition and the repetitiveness of spatial-spectral blocks while proposing a hyperspectral super-resolution architecture based on spectral mixture and spatial-spectral group sparsity. Wang et al. [
13] proposed a tensor-based super-resolution method that models the intrinsic characteristics of hyperspectral images. However, these methods rely on handcrafted priors, which are difficult to optimize. Recently, with the rapid advancement of deep learning, deep neural networks have showcased remarkable non-linear fitting capabilities. Consequently, natural super-resolution methods [
14,
15,
16,
17,
18,
19,
20,
21] based on convolutional neural networks (CNNs) have garnered significant attention. Unlike natural images, hyperspectral images exhibit spectral continuity, where neighboring spectral bands are often correlated. Therefore, 3D convolution is frequently employed to extract features from hyperspectral images. Mei et al. [
22] constructed a fully 3D convolutional neural network to perform hyperspectral super-resolution. Yang et al. [
23] proposed a multi-scale wavelet 3D CNN by modulating wavelets with 3D CNN to improve the restoration of details. While 3D CNN has powerful representational capabilities, it is often accompanied by a substantial computational burden and a large number of parameters. Thus, Li et al. [
24] proposed MCNet, which stacks hybrid modules modulating 2D and 3D convolutions to extract both spatial and spectral features. Furthermore, Li et al. proposed ERCSR to address the issue of parallel structure redundancy in MCNet [
25]. Zhang et al. [
26] proposed a multi-scale network that utilizes wavelet transform and multi-scale feature fusion to learn features across different spectral bands. In ref. [
27], researchers designed a multi-domain feature learning strategy based on 2D/3D units to integrate information from different layers. Tang et al. [
28] proposed FRLGN, which incorporates a feedback structure to propagate high-level information for guiding the generation of low-level features. Zhang et al. [
29] explored the coupling relationship between the spectral and spatial domains, then utilized spectral high-frequency information to improve channel and spatial attention.
While CNN-based methods have achieved impressive results, they still face the following issue: the widespread use of convolution kernels with a size of 3 in these CNN-based methods severely limits the receptive field range, hindering the model from considering a wider range of contextual information. While it is theoretically expected that the receptive field increases as the network deepens, the effective receptive field is often much smaller than the anticipated result [
30] in practice. Directly increasing the kernel size could improve the receptive field size. However, it also leads to a significant increase in parameters and computational complexity. This is especially impractical for 3D convolutions.
Recently, transformers [
31] in the field of natural language processing have gained increasing popularity due to their powerful long-range modeling capabilities. Moreover, ViT [
32] has successfully extended the application of transformers to the field of computer vision, leading to the emergence of numerous outstanding vision transformer models [
33,
34,
35,
36,
37,
38,
39,
40,
41]. Vision transformers are primarily designed to capture long-range dependencies in the spatial domain. However, as shown in 
Figure 1, each spectral band of a hyperspectral image exhibits significant sparsity in the spatial domain, with many regions lacking meaningful information. Performing vanilla self-attention calculations on sparse spatial domains would result in significant wastage of computational resources. Consequently, vision transformers originally designed for natural images need to be further improved to adapt to hyperspectral images.
In this paper, we design a hybrid architecture named HyFormer to integrate the advantages of CNN and transformer for hyperspectral super-resolution tasks. CNN and transformer extract features from different perspectives. Specifically, the transformer branch could achieve intra-spectra interaction for fine-grained contextual details on each specific wavelength. The CNN branch helps to conduct efficient inter-spectra feature extraction among different wavelengths while maintaining a large receptive field. By effectively modulating the two types of features, their advantages can complement each other, thereby enhancing the modeling capability of the model.
In the transformer branch, we present a novel Grouping-Aggregation transformer (GAT) that aims to capture intra-spectra interactions of object-specific contextual information from a spectral perspective. By decomposing the spectra of each wavelength, diverse contextual details of objects are implicitly expressed in different channels. The proposed GAT considers each channel as an individual token and focuses on modeling long-range dependencies among different details within the spectra of the same wavelength. Moreover, self-attention can consider all positions in the spatial domain to further capture global dependencies. During the computation of self-attention, the GAT employs a novel grouping-aggregation self-attention, which consists of grouping self-attention and aggregation self-attention. Specifically, grouping self-attention is employed to extract features from rich details coming from different channels, while the grouping mechanism is utilized to maintain lower computational complexity. On the other hand, our aggregation self-attention aims to fuse features from different channels to facilitate the exchange of information across channels. The GAT modulates dual self-attention to adaptively model fine-grained contextual details.
In the CNN branch, we propose a Wide-Spanning Separable 3D Attention (WSSA) to enhance the receptive field of the model while keeping a low parameter number. This method stacks a set of specifically designed small-kernel 3D convolutions to simulate the same receptive field as a large-kernel 3D convolution. Specifically, this method consists of three steps to simulate a large-kernel 3D convolution. Firstly, we simulate a large-kernel 3D convolution using the concept of depthwise separable convolution to reduce the parameter number, which cascades a pointwise and a large-kernel depthwise 3D convolutions. Subsequently, the large-kernel depthwise 3D convolution is decomposed into a small-kernel 3D depthwise convolution and a small-kernel dilated depthwise 3D convolution. Finally, the aforementioned two types of small-kernel depthwise 3D convolutions are further separated in the spatial and spectral dimensions to reduce computational complexity. Unlike 2D pixel attention, which solely considers spatial dimensions, WSSA preserves the inherent spatial-spectral consistency information. Based on WSSA, we construct a wide-spanning CNN module to extract inter-spectra features among different wavelengths while achieving a large receptive field to consider a wider range of contextual information.
Our hybrid architecture enables adaptive feature interactions between the CNN and transformer modules at each layer to facilitate the fusion of various features.
In summary, the contributions of this paper can be summarized as follows:
- We propose a novel Grouping-Aggregation transformer (GAT) to capture intra-spectra interactions of object-specific contextual information from a spectral perspective. By modulating grouping self-attention and aggregation self-attention, GAT can adaptively model fine-grained contextual details. 
- We introduce Wide-Spanning Separable 3D Attention (WSSA) to explore inter-spectra feature extraction. It significantly enhances the receptive field while only increasing minimal additional parameters. 
- We designed a hybrid architecture named HyFormer that modulates the strengths of both CNN and transformer structures. HyFormer could adaptively fuse features extracted by CNN and transformer components, resulting in improved reconstruction outcomes. Abundant experimental results unequivocally demonstrate the substantial superiority of our HyFormer over state-of-the-art methods. 
  2. Proposed Methods
In this section, we will provide a detailed description of our method. The goal of hyperspectral image super-resolution is to restore a low-resolution hyperspectral image to a high-resolution hyperspectral image. Let 
 and 
 denote the low-resolution and high-resolution hyperspectral images, respectively. The 
S denotes the number of spectral bands. We use 
 to denote restored hyperspectral image. Therefore, the process of super-resolution can be represented as follows:
  2.1. Overall Network
The proposed HyFormer is illustrated in 
Figure 2. The top part represents the transformer branch, while the bottom part represents the CNN branch. There are interactions between the two branches at each layer to exchange information with different characteristics. At the ends of branches, the features extracted from both branches are adaptively fused to enhance the representation.
Initially, we employ separable 3D convolution [
42] 
 to extract shallow-level features from the input low-resolution (LR) image. Separable 3D convolutions have been demonstrated to exhibit similar performance as conventional 3D convolutions while offering reduced computational complexity [
42]. This process can be represented as follows:
        where 
 denotes extracted shallow-level features, and 
C denotes channel number. The 
 denotes reshaping 
 into the shape of (
). Next, the feature 
 is simultaneously fed into the transformer and CNN branches to extract different types of features. For transformer branch, the feature 
 needs to be reshaped into 
 by 
. The feature extraction process for both branches in the first layer can be represented as follows:
        where 
 and 
 respectively denote the first transformer and CNN modules. The 
 and 
 are extracted features. In the subsequent feature extraction layers, we perform adaptive fusion of the two types of features to complement each other. Mathematically, the subsequent feature extraction process can be represented as:
        where 
 and 
 are learnable coefficients used to adaptively adjust the feature fusion ratio, with an initial value of 1. The fused features are then fed into convolution layers 
 and 
 to reduce the channel number. The 
 denotes the reshape operation from (
) to (
).
To enable the network to learn more informative representations, we perform feature fusion at the end of both the transformer and CNN branches. During this fusion process, learnable coefficients are employed to control the fusion ratio. Mathematically, this process can be represented as follows:
        where 
 and 
 denote learnable coefficients, initialized to 1.
The transformer branch focuses on capturing intra-spectra interactions at each specific wavelength, whereas the CNN branch facilitates efficient inter-spectra feature extraction among different wavelengths. Consequently, we fuse these two features to enhance the representation. Lastly, we employ deconvolution layers to perform upsampling, thereby increasing the spatial resolution. Due to the significant similarity between the LR input image and the SR output image, we incorporate the bicubic interpolated LR image into the output to guide the model to focus on high-frequency residual features. The final reconstruction process can be represented as follows:  
        where 
 and 
 are learnable coefficients to adjust the feature fusion ratio, with an initial value of 1. The 
 denotes the deconvolution to increase the spatial resolution. The 
 is separable 3D convolution used to decrease the channel number.
  2.2. Grouping-Aggregation Transformer Module
The transformer models have shown remarkable capability in capturing long-range dependencies in the spatial domain. However, these transformer models designed for natural images are not suitable for hyperspectral images due to their specific characteristics. Hyperspectral images exhibit sparsity in the spatial domain. Therefore, dense self-attention in the vanilla transformer would lead to numerous inefficient computations.
Based on the analysis mentioned above, we propose a grouping-aggregation transformer (GAT) to capture intra-spectra interactions of object-specific contextual information from a spectral perspective. During the extraction of shallow features, 3D convolution is applied to decompose each specific wavelength of the spectral data, thereby implicitly expressing diverse texture information across different channels. Our GAT treats each channel as a token and performs self-attention in the channel dimension to capture intra-spectra interactions for fine-grained contextual details on each specific wavelength. Meanwhile, self-attention allows for the consideration of all positions within the spatial domain, enabling the model to capture global dependencies more effectively.
We improve the standard of self-attention by introducing grouping-aggregation self-attention (GASA). The GASA is composed of grouping self-attention (GSA) and aggregation self-attention (ASA). To be more specific, we assign half of the channels to grouping self-attention, which is used to extract rich details from different channels. Meanwhile, the grouping mechanism ensures that the computation of self-attention is only performed within each group, reducing the computational cost of self-attention. The remaining half of the channels are dedicated to aggregation self-attention, which utilizes aggregation to merge features from different channels, allowing interaction among features of different textures. Subsequently, the extracted two features are concatenated to align the channel dimensions while achieving modulation of features from both self-attentions to adaptively model long-range channel dependencies from a spectral perspective.
Figure 3 illustrates the workflow of grouping-aggregation self-attention. Let 
 denote the input features of GAT. For the grouping of self-attention, we linearly map 
 to obtain 
 , 
 , and 
 :
        where 
 are learnable parameters. Subsequently, we equally divide the 
, 
 and 
 tensors by channels into 
k different groups, ensuring that the self-attention calculation is performed only within each group. The grouping self-attention can be formulated as:
 For aggregation self-attention, we map 
 to get 
 , 
 , and 
 :
        where 
 is 
 convolution to aggregate diverse features from different channels, while decreasing channel number from 
C to 
, in which 
 is hyperparameter to control aggregation ratio. The 
, 
 are learnable parameters. The aggregation self-attention can be formulated as:
Next, we concatenate 
 and 
 to incorporate features extracted by dual self-attention:
        where 
 is the feature representation extracted by grouping-aggregation self-attention. As shown in 
Figure 4, after grouping-aggregation self-attention, we utilize a multi-layer perceptron (MLP) with the non-linear activation function GELU [
43] to enhance the representations. Before the grouping-aggregation self-attention and MLP, we use the LayerNorm layer [
44] to do normalization. The feature extraction process in the transformer module can be represented as follows:
        where 
 denotes the LayerNorm and 
 denotes multi-layer perceptron. The 
 denotes the grouping-aggregation self-attention.
  2.3. Wide-Spanning CNN Module
The 3D CNN has shown promising performance in hyperspectral image super-resolution, as it can extract features while preserving spatial-spectral consistency. It is indeed a common practice in previous works to utilize a  convolution kernel in 3D convolution, which inherently restricts the receptive field of the model. This limitation becomes particularly problematic for hyperspectral images, as the sparse nature of the spatial domain necessitates a larger receptive field. A small receptive field hampers the model’s ability to comprehend context, leading to potential limitations in performance. Directly increasing the size of the convolution kernel can expand the receptive field, but it comes with a significant increase in parameters and computational complexity, making it an impractical approach. In this paper, we propose a novel approach called wide-spanning separable 3D attention to address the aforementioned issue.
The wide-spanning, separable 3D attention is shown in the bottom right corner of 
Figure 5. We stack a set of specifically designed small-kernel 3D convolutions to simulate the same receptive field as a large-kernel 3D convolution.
In this paper, we simulate the large-kernel 3D convolution through three steps. First, depthwise separable convolution can significantly reduce the parameters and computational complexity. We simulate a 
 3D convolution using the concept of depthwise separable convolution to reduce the parameter number, which cascades 
 pointwise and 
 depthwise 3D convolution. Second, we further decompose 
 depthwise 3D convolution into a 
 depthwise 3D convolution and a 
 dilated depthwise 3D convolution with a dilation factor of 
, which achieves the same receptive field as a 
 depthwise 3D convolution. As shown in 
Figure 6, we demonstrate this process of simulating the large receptive field.
However, due to the inherent nature of 3D convolutions, they still have a relatively large number of parameters. Separable 3D convolution has been proven to have similar effects to vanilla 3D convolution [
42]. Therefore, we adopt a similar concept to separable 3D convolution to further decompose the 
 depthwise 3D convolution and the 
 dilated depthwise 3D convolution mentioned earlier. Specifically, we use a 
 depthwise 3D convolution and a 
 depthwise 3D convolution to simulate a 
 depthwise 3D convolution. Meanwhile, we use a 
 dilated depthwise 3D convolution with a dilation factor of 
 and a 
 dilated depthwise 3D convolution with a dilation factor of 
 to simulate a 
 dilated depthwise 3D convolution. As a result, the parameter number is significantly reduced, allowing for more feasible usage.
To effectively extract inter-spectra features among different wavelengths while achieving a large receptive field to consider a wider range of contextual information, we designed a wide-spanning CNN module that is applied to the CNN branch. The structure of our CNN module is illustrated in 
Figure 5. We employ both 3D convolution and 2D convolution to extract features and apply the proposed wide-spanning separable 3D attention at the end of the module to expend the receptive field. Skip connections are employed between each component to enhance the flow of information.
As mentioned earlier, separable 3D convolution provides similar performance to regular 3D convolution. Thus, in our CNN module, we employ separable 3D convolution to extract local spatial and spectral features simultaneously. It preserves spatial-spectral consistency, which is beneficial for restoring physically meaningful spectral curves. Furthermore, to enhance the extraction of spatial information, we incorporate 2D CNN units following the 3D convolution to explore spatial features. Thanks to the inclusion of our wide-spanning separable 3D attention, the model has a wider receptive field, allowing it to extract spatial and spectral features within a larger spatial-spectral context. The feature extraction in the CNN module can be represented as follows:
        
        where 
 denotes separable 3D convolution and 
 denotes 2D convolution unit. The 
 is the proposed wide-spanning separable 3D attention.