Difference Curvature Multidimensional Network for Hyperspectral Image Super-Resolution

: In recent years, convolutional-neural-network-based methods have been introduced to the ﬁeld of hyperspectral image super-resolution following their great success in the ﬁeld of RGB image super-resolution. However, hyperspectral images appear different from RGB images in that they have high dimensionality, implying a redundancy in the high-dimensional space. Existing approaches struggle in learning the spectral correlation and spatial priors, leading to inferior performance. In this paper, we present a difference curvature multidimensional network for hyperspectral image super-resolution that exploits the spectral correlation to help improve the spatial resolution. Speciﬁcally, we introduce a multidimensional enhanced convolution (MEC) unit into the network to learn the spectral correlation through a self-attention mechanism. Meanwhile, it reduces the redundancy in the spectral dimension via a bottleneck projection to condense useful spectral features and reduce computations. To remove the unrelated information in high-dimensional space and extract the delicate texture features of a hyperspectral image, we design an additional difference curvature branch (DCB), which works as an edge indicator to fully preserve the texture information and eliminate the unwanted noise. Experiments on three publicly available datasets demonstrate that the proposed method can recover sharper images with minimal spectral distortion compared to state-of-the-art methods. PSNR/SAM is 0.3–0.5 dB/0.2–0.4 better than the second best methods.


Introduction
Obtained from hyperspectral sensors, a hyperspectral image (HSI) is a collection of tens to hundreds of images at different wavelengths for the same area. It contains threedimensional hyperspectral (x, y, λ) data, where x and y represent the horizontal and vertical spatial dimensions, respectively, and λ represents the spectral dimension. Compared to previous imaging techniques such as multi-spectral imaging, hyperspectral imaging has much narrower bands, resulting in a higher spectral resolution. Hyperspectral remote sensing imagery has a wide variety of studies from target detection, classification, and feature analysis, and has many practical applications in mineralogy, agriculture, medicine, and other fields [1][2][3][4][5][6][7]. Consequently, higher spatial-spectral resolution of hyperspectral images allows a more efficient way to explore and classify surface features.
To ensure the reception of high-quality signals with low signal-to-noise ratio, there is a trade-off between the spatial and spectral resolution of the imaging process [8][9][10]. Accordingly, HSIs are often accessed under relatively low spatial resolution, which would impede the perception of details and learning of discriminative structural features, as well as further analysis in related applications. With the hope of recovering spatial features economically, post-processing techniques such as super-resolution are an ideal way to restore details from a low-resolution hyperspectral image.
Generally, there exist two prevailing methods to enhance the spatial details, i.e., the fusion-based HSI super-resolution and single HSI super-resolution approaches. For the former category, Palsson et al. [11] proposed a 3D convolutional network for HSI super-resolution by incorporating an HSI and a multispectral image. Han et al. [12] first combined a bicubic up-sampled low-resolution HSI with a high-resolution RGB image into a CNN. Dian et al. [13] propose a CNN denoiser based method for hyperspectral and multispectral image fusion, which overcomes the difficulty of not enough training data and has achieved outstanding performance. Nevertheless, auxiliary multispectral images have fewer spectral bands than HSIs, which will cause spectral distortion of the reconstructed images. To address these drawbacks, some improvements have been made in subsequent works [14][15][16][17][18]. However, the premise is that the two images should be aligned, otherwise the performance will be significantly degraded [19][20][21]. Compared with the former, single HSI super-resolution methods require no auxiliary images, which are more convenient to apply to the real scenario. Many approaches such as sparse regularization [22] and low rank approximation [21,23] have been proposed in this direction. However, such hand-crafted priors are time-consuming and have limited generalization ability.
Currently, the convolutional neural network (CNN) has achieved great success in RGB image super-resolution tasks, and was introduced to restore the HSI. Compared with RGB image super resolution, HSI super-resolution is more challenging. On the one hand, HSIs have far more bands than RGB images and most of the bands are useful for actual analysis of surface features, but unfortunately several public datasets have much smaller training sets compared to RGB images. Hence, the network needs to preserve the spectral information and avoid distortion while increasing the spatial resolution of HSI and the design needs to be delicate enough to refrain from overfitting caused by insufficient data. To handle this problem, there have been many attempts in recent years. For instance, Li et al. [24] proposed a spatial constraint method to increase the spatial resolution as well as preserve the spectral information. Furthermore, Li et al. [25] presented a grouped deep recursive residual network (GDRRN) to find a mapping function between the low-resolution HSI and high-resolution HSI. They first combined the spectral angler mapping (SAM) loss with the mean square error (MSE) loss for network optimization, reducing the spectral distortion. Nonetheless, the spatial resolution is relatively low. To better learn spectral information, Mei et al. [26] proposed a novel three-dimensional full convolutional neural network (3D CNN), which can better learn the spectral context and alleviate distortion. In addition, [27,28] applied 3D convolution to their network. However, the use of 3D convolution requires a huge amount of computation due to the high-dimensional nature of HSI. On the other hand, there exists a large amount of unrelated redundancy in the spatial dimension, which hinders the effective processing of images. Although existing approaches try to extract texture features, it is still difficult to recover the delicate texture in the reconstructed highresolution HSI [29,30]. For example, Jiang et al. [30] introduced a deep residual network with a channel attention module (SSPSR) and applied a skip-connection mechanism to help promote attention to high-frequency information.
To deal with the hyperspectral image super-resolution (HSI SR) problem, we propose a difference curvature multidimensional network (DCM-Net) in this paper. First, we group the input images in a band-wise manner and feed them into several parallel branches. In this way, the number of parameters can be reduced while the performance can also be improved as evidenced by the experimental results. Then, in each branch, we devise a novel multidimensional enhanced block (MEB), consisting of several cascaded multidimensional enhanced convolution (MEC) units. MEC can exploit long-range intra-and inter-channel correlations through bottleneck projection and spatial and spectral attention. In addition, we design a difference curvature branch (DCB) to facilitate learning edge information and removing unwanted noise. It consists of five convolutional layers with different filters and can easily be applied to the network to recalibrate features. Extensive evaluation of three public datasets demonstrates that the proposed DCM-Net can increase the resolution of HSI with sharper edges as well as preserving the spectral information better than state-of-the-art (SOTA) methods.
In summary, the contributions of this paper are threefold.

1.
We propose a novel difference curvature multidimensional network (DCM-Net) for hyperspectral image super-resolution, which outperforms existing methods in both quantitative and qualitative comparisons.

2.
We devise a multidimensional enhanced convolution (MEC), which leverages a bottleneck projection to reduce the high dimensionality and encourage inter-channel feature fusion, as well as an attention mechanism to exploit spatial-spectral features.

3.
We propose an auxiliary difference curvature branch (DCB) to guide the network to focus on high-frequency components and improve the SR performance on fine texture details.
The rest of the paper is organized as follows. In Section 2 we present the proposed method. The experimental results and analysis are presented in Section 3. Some ablation experiments and a discussion are presented in Section 4. Finally, we conclude the paper in Section 5.

Materials and Methods
In this section, we present the proposed DCM-Net in detail, including the network structure, the multidimensional enhanced block (MEB), the difference curvature-based branch (DCB), and the loss function. The overview network structure of the proposed DCM-Net is illustrated in Figure 1.

Network Architecture
The network of DCM-Net mainly consists of two parts: a two-step network for deep feature extraction and a reconstruction layer. Given the input low-resolution HSI I LR ∈ R h×w×c , we want to reconstruct the corresponding high-resolution HSI I SR ∈ R H×W×C , where H and W (h and w) denote the height and width of the high-resolution (low-resolution) image, and C represents the number of spectral bands.
First, we feed the input I LR ∈ R h×w×c to two branches, a difference curvature-based branch (DCB), which is designed to further exploit the texture information, and a structuralpreserving branch (SPB), which can be formulated as follows: where F DCB , F SPB , H DCB , H SPB stand for the output feature maps and the functions of DCB and SPB, respectively. The two branches both use MEB as the basic unit while in the SPB a group strategy inspired by [24] is adopted: we channel-wisely split the input image to several groups, by which we can reduce the parameters needed for the network, thus lowering the burden on the device. More importantly, given the strong correlation between adjacent spectral bands, we adopt such group strategy to promote the interaction between channels with strong spatial-spectral correlation to a certain extent.
Given that the input is I LR , it can be divided into multiple groups: LR , . . . , I (... ) LR . Let the number of groups be S, we feed these groups I (S) LR into multiple MEBs to obtain the deep spatial-spectral feature, where we use a novel convolutional operation that can promote channel-wise interaction as well as exploit the long-range spatial dependency: where H MEB (·) denotes the function of the MEBs, which we will thoroughly demonstrate in the following part.
After obtaining the outputs of both branches, we concatenate them for further global feature extraction and this can be written as F concate . To lower the parameters needed and computational complexity, a convolutional layer is applied to reduce the dimension. It is worth noting that, considering that the pre-upsampling approach not only brings about the growth of the number of parameters but also brings about problems such as noise amplification and blurring, and post-upsampling makes it difficult to learn the mapping function directly when the scaling factor is large, we adopt a progressive upsampling method, hoping that through such a compromise, we can avoid the problems brought about by the above two upsampling methods [31][32][33]. The up-sampled values will be noted in the implementation details.
Finally, after obtaining the output of the global branch, we use a convolution layer for reconstruction: where f rec (·) denotes the reconstruction layer and I SR denotes the final output of the network.

Overview
The structure of MEB is shown in Figure 2, which is designed to better learn the spectral correlation and the spatial details. Denoting F n−1 MEB and F n MEB as the input and output of the block, and f s1 and f s2 as the stacked multidimensional enhanced convolution (MEC) layers, we have: where f s1 and f s2 stand for the two steps of the block. The details of MEC will be thoroughly discussed as follows.

Multidimensional Enhanced Convolution (MEC)
The residual network structure proposed by He et al. [34] has been widely used in many image restoration tasks and achieved impressive performance. However, as mentioned before, dealing with HSI is more tricky, since standard 2D convolution is inadequate to explicitly extract discriminating feature maps from the spectral dimensions, while 3D convolution is more computationally costly. To address this issue, we introduce an effective convolution block to better exploit spectral correlation and reduce the redundancy in the spectral dimension while preserving more useful information.
Specifically, given the input X, we first channel-wisely split it into two groups to reduce the computational burden. Furthermore, we add a branched path in one of the groups; in this path, 1 × 1 convolution is applied as cross-channel pooling [35] to reduce the spectral dimensionality instead of spatial dimensionality, which indeed performs a linear recombination on the input feature maps and allows information interaction between channels. Besides, the structure builds long-range spatial-spectral dependencies, which can further improve the network's performance. In addition, the parameters can also be reduced, which allows us to apply 5 × 5 convolution and enlarge the fields of view. Given the input X, the formulation is presented as follows: where f 1×1 and f 3×3 denote 1 × 1 convolution and 3 × 3 convolution, respectively. F 1 and F 2 are the outputs of the upper and lower branches in Figure 2. σ is the sigmoid function; the setting of r will be mentioned in the implementation details.

Attention-Based Guidance
The attention mechanism is a prevalent practice in CNNs nowadays. It allows the network to attend to specific regions in the feature maps to emphasize important features. To further improve the ability of spectral correlation learning, we apply the channel attention module proposed by Zhang et al. [36] in the final part of the MEB. Specifically, with the input F n MEB , a spatial global pooling operation is used to aggregate the spatial information: where H GP denotes the spatial global pooling. Then, a simple gating mechanism with a sigmoid function is applied: where f CA denotes the gating mechanism. s is the attention map, which is used to re-scale the input F n MEB via an element-wise multiplication operation:

Difference Curvature-Based Branch (DCB)
In the field of computer vision, there is a long history of using the gradient or curvature to extract texture features. For example, Chang et al. [37] simply concatenated the firstorder and second-order gradients for feature representation based on the luminance values of the pixels in the patch. Zhu et al. [38] proposed a gradient-based super-resolution method to exploit more expressive information from the external gradient patterns. In addition, Ma et al. [39] applied a first-order gradient to a generative adversarial network (GAN)-based method as structure guidance for super-resolution. Although they can extract high-frequency components, simple concatenation of gradients also brings undesired noise, which hinders feature learning. Compared with the gradient-based method, curvature is better for representing high-frequency features. There exist three main kinds of curvature: Gaussian curvature, mean curvature, and difference curvature. Chen et al. [40] proposed and applied difference curvature as an edge indicator for image denoising, which is able to distinguish isolated noise from the flat and ramp edge regions and outperforms Gaussian curvature and mean curvature. Later, Huang et al. [41] applied difference curvature for selective patch processing and learned the mixture prior models in each group. As for the hyperspectral image, due to its high dimensionality and relatively low spatial resolution, it is necessary to extract fine texture information efficiently to increase the spatial resolution.
To efficiently exploit the texture information of HSI, we designed an additional DCB to help the network focus on high-frequency components. Compared with traditional gradient-based guidance, which cannot effectively distinguish between edges and ramps, difference curvature combines the first-and second-order gradients, which are more informative. Consequently, it can effectively distinguish edges and ramps together whiling removing unwanted noise. The difference curvature can be defined as follows: where f i and f µ i are defined as: As demonstrated in Figure 3, the curvature calculation is easy to implement by using five  Based on these, the calculated difference-curvature has the following properties in different image regions. (1) |f | is large but |f µ | is small for edges, so D is large; (2) for smooth regions, |f | and |f µ | are both small, so D is small; and (3) for noise, |f | is large but |f µ | is also large, so D is small. Therefore, most parts of the curvature map have small values, and only high frequency information is preserved. After the extraction module, we feed the curvature map into multiple MEBs to obtain higher-level information. Then, as shown in Figure 1, the output of the branch is fused with the features from the main branch. In this way, DCB guides the network to focus on high-frequency components and improve the SR performance on fine texture details.

Loss Function
In previous image restoration works in recent years, L1 loss and MSE loss have been two widely used losses for network optimization. In the field of HSI super-resolution, previous works have also explored other losses, such as SAM loss [24] and SSTV loss [30], considering the special characteristics of HSI. These losses encourage the network to preserve the spectral information. Following the practice, we add the SSTV loss to the L1 loss [30] as the final training objective of our DCM-Net, i.e., where I n LR and I n HR represent the n-th low-resolution image and its corresponding highresolution one. H DCM−Net denotes the proposed network. ∇ h , ∇ w , and ∇ w denote the horizontal, vertical, and spectral gradient calculation operators, respectively. The setting of the hyper-parameter α that balances the two losses follows the previous work [30], i.e., it is set to 0.001 in this paper.

Evaluation Metrics
We adopted six prevailing metrics to evaluate the performance from both the spatial and spectral aspects. These metrics include the peak signal-to-noise ratio (PSNR), structure similarity (SSIM) [42], spectral angle mapper (SAM) [43], cross correlation (CC) [44], root mean square error (RMSE), and erreur relative globale adimensionnelle de synthese (ERGAS) [45]. PSNR and SSIM are widely used to assess the similarities between images, while the remaining four metrics are often used to evaluate the HSI: CC is a spatial measurement, SAM is a spectral measurement, RMSE and ERGAS are global measurements. In the following experiments, we regard PSNR, SSIM, and SAM as the main metrics, which are defined as follows: SAM(x i ,x i ) = arccos( where MAX l denotes the maximum pixel value in the l-th band, and µ I SR , µ I HR represent the mean of I SR and I HR , respectively. σ l I SR and σ l I HR denote the variance of I SR and I HR in the l-th band while σ I SR I HR is the covariance of I SR and I HR in the l-th band. · denotes the dot product operation.

1.
Chikusei dataset [46]: the Chikusei dataset (https://www.sal.t.u-tokyo.ac.jp/hype rdata/ accessed on 29 July 2014) was taken by the Headwall Hyperspec-VNIR-C imaging sensor over agricultural and urban areas in Chikusei, Ibaraki, Japan. The central point of the scene is located at coordinates 36.294946N, 140.008380E. The hyperspectral dataset has 128 bands in the spectral range from 363 nm to 1018 nm. The scene consists of 2517 × 2335 pixels and the ground sampling distance was 2.5 m.
A ground truth of 19 classes was collected via a field survey and visual inspection using high-resolution color images obtained by a Canon EOS 5D Mark II together with the hyperspectral data.

2.
Cave dataset [47]: the Cave dataset (https://www.cs.columbia.edu/CAVE/databas es/multispectral/ accessed on 29 April 2020) was obtained from Cooled CCD camera and contains full spectral resolution reflectance data from 400 nm to 700 nm at a resolution of 10 nm (31 bands in total), covering 32 scenes of everyday objects. The image size is 512 × 512 pixels and each image is stored as a 16-bit grayscale PNG image per band.

3.
Harvard dataset [48]: the Harvard Dataset (http://vision.seas.harvard.edu/hyperspe c/index.html accessed on 29 April 2020) contains fifty images captured under daylight illumination from a commercial hyperspectral camera (Nuance FX, CRI Inc. in U.S.), which is capable of acquiring images from 420 nm to 720 nm at a step of 10 nm (31 bands in total).

Implementation Details
Because the numbers of spectral bands in the three datasets are different, the experiment setting varies. For the Chikusei dataset, we divided 128 bands into 16 groups, i.e., 8 bands per group. For the Cave and Harvard datasets, which both have 31 bands, we put 4 bands in one group with an overlap of one band between each group (10 groups). The number of MEBs was set to 3 for Chikusei and 6 for Cave and Harvard. As for the MEC module, we applied two 1 × 1 convolutions to reduce the dimension by half. For the 3 × 3 convolution, to keep the spatial size of feature maps, the padding size was set to 1. We implemented the network with PyTorch and optimized it using the ADAM optimizer with an initial learning rate of 1 × 10 −4 , which was halved by every 15 epochs. The batch size was 16.

Results
In this section, we describe the experiments conducted to evaluate the effectiveness of the proposed DCM-Net and compare it with existing single HSI super-resolution methods on three public datasets, which will be discussed in detail in the following sections.

Results for the Chikusei Dataset
Taken by the Headwall Hyperspec-VNIR-C imaging sensor over agricultural and urban areas in Chikusei, Ibaraki, Japan, the hyperspectral dataset has 128 bands in the spectral range from 363 nm to 1018 nm [46]. To be consistent with previous works [30], we followed the the same setup and crop four images with 512 × 512 × 128 pixels for testing and used the rest for training (3226 pics for scale factor of 2, and 3119 pics for ×4, 757 for ×8). The results for the different methods are summarized in Table 1. As can be seen, at a scale factor of 2, we have the greatest advantage, with psnr 0.58 dB higher than the second best result and SAM 0.09 lower than the second best result; when the scale factor is 4, the (PSNR/SAM/SSIM) is (0.18 dB, 0.05, 0.005) better than SSPSR; at a scale factor of 8, we also achieved the best performance.
To further illustrate the superiority of DCM-Net, we present the visual results in Figure 4 as well as their spectral curves in Figure 5. As can be seen, it is obvious that our method outperforms others. In Figure 4a, there is a thin light-colored line along the dark black thick line in the ground truth image, which is not captured and restored by 3DFCNN. Although it can be observed in the results reconstructed by GDRRN and SSPSR, the line is too inconspicuous to be easily detected. By contrast, our method can preserve the details, making it much clearer. In Figure 4b, compared with the results by other methods, which are blurry, our DCM-Net yields a better result with clear details, e.g., the edge is sharper and the structure is more consistent. In Figure 4c, there are two very close lines in the ground truth image, which are hardly distinguished in the results by other methods. By contrast, these two lines can still be observed in our result, demonstrating that our DCM-Net can exploit the spectral interactions and benefit from the difference-curvature guidance to reconstruct fine edges. Subsequently, we also present the spectral curves of the three test images and their super-resolution results in Figure 5, which is yielded by ENVI (remote-sensing software that provides hyperspectural image analysis, image enhancement, and feature extraction). We can see that the curves of 3DFCNN, GDRRN, and SSPSR are very close to the bicubic interpolation, implying a limited performance for the restoration of spectral information. By contrast, the curve of our DCM-Net is close to the ground truth, demonstrating that our network can better preserve spectral information and avoid distortions. In addition, we show the absolute error maps of these three images in Figure 6. Usually, the bluer the image is, the closer the reconstructed image is to the original image; here again, it can be seen that our reconstructed method is better able to preserve edge features.

Results for the Cave Dataset
Different from the Chikusei Dataset, which was obtained from a remote sensing camera, the Cave dataset obtained from a cooled CCD camera contains full spectral resolution reflectance data from 400 to 700 nm at a resolution of 10 nm (31 bands in total), covering 32 scenes of everyday objects [47]. The images are of size 512 × 512 pixels and are stored as 16-bit grayscale PNG images per band. We randomly chose 8 scenes for testing and used the left images during training. They were randomly cropped into the size of 32 × 32 pixels, 64 × 64 pixels, and 128 × pixels, when the scale factors were 2, 4, and 8, respectively (1555 pics for scale factors of 2, 4, and 8).
The same as for the Chikusei dataset, we tested our method on the Cave dataset at three scale factors and compared it with three recent approaches, i.e., 3DFCNN, GDRRN, and SSPSR. The results are reported in Table 2. As can be seen, our DCM-Net outperforms the second best method to varying degrees. To better illustrate that, we also show the visual results of two test images in the Cave dataset by different methods in Figure 7, absolute maps in Figure 8, and their spectral curves in Figure 9. By comparing the absolute error maps of different methods it is not difficult to find that the absolute error map generated by our method is bluer, especially around the edges. This indicates that our ability to recover texture information is better and the recovered image is closest to the original image. In general, the above results show that our network not only gains better performance on HSI images with hundreds of bands, but also outperforms other methods on the multispectral dataset.  As can be seen in all three figures, for the recovery of lines, especially those in close proximity to each other, we were able to recover better.

Results for the Harvard Dataset
The Harvard dataset contains fifty images captured under daylight illumination from a commercial hyperspectral camera (Nuance FX, CRI Inc., Woburn, MA, USA), which is capable of acquiring images from 420 to 720 nm at a step of 10 nm (31 bands in total) [48]. For training, we randomly selected 90% of the images (45 images) and cropped them into 32 × 32 patches, 64 × 64 patches, and 128 × 128 patches when the scale factors were 2, 4, and 8, respectively. We used the other 5 images for testing (3888 pics for scale factors of 2, 4, and 8). Table 3 summarizes the results of different approaches on the Harvard dataset for scale factors 2, 4, and 8. As can be seen, the results here are different from our performance on Chikusei and Cave; our method is slightly behind the SSPSR when the scale factor is 2, while we have a clear advantage when the scale factor is 8. To better illustrate the superiority of our DCM-Net, we present the super-resolution results of a test image from the Harvard dataset by different methods in Figure 10. Here we chose the scale factor 8 to illustrate the robustness of our method. From Figure 10b, we can see that the superresolution images by 3DFCNN and GDRRN are very blurry. Besides, white grid artifacts can be found in their zoom-in results. As for SSPSR, it recovers sharper images at the first glance. However, many structures in the original images have been lost. For such a large scale factor, although our DCM-Net does not recover the fine structures of the words, it indeed captures the outline of the words without causing geometry-inconsistency, which is closest to the ground truth. As for Figure 10a, it is also obvious that among all the methods, DCM-Net yields a super-resolution image with few structural distortions.

Analysis on Loss Function
The choice of the loss function is crucial for reconstructing high-quality images, and here we mainly experiment and discuss for L1.MSE and the SSTV loss we use. In previous work, people preferred to use MSE loss to train their networks, because it is believed that MSE loss converges faster and yields better metrics [24,25,49]. However, through the experiments we conducted, it can be seen in Table 4 that MSE loss is not a good choice for HSI SR. First, as can be seen from the loss graph in Figure 11, when it is close to optimum, its derivatives are too small and the learning slows down, and this actually makes the network convergence time much longer than expected. In addition, studies have shown that MSE loss yields images of relatively poor perceptual quality because there is a strong penalty for large errors and a low penalty for small errors, and if a texture or mesh appears, then optimizing MSE may smooth out this area [50]. Spatial-spectral total variation (SSTV) was proposed by Aggarwal et al. [51] and was applied as a loss function by Jiang et al. [30], and it is presumed to encourage the network to reserve spatial-spectral information and avoid distortion. Through the experiment, we can confirm that although there is a certain improvement in SSTV loss compared with L1 loss, the improvement is very limited because the main body of this loss is still L1 loss.

Train loss/epoch
Validation loss/epoch Figure 11. The curves yielded by different loss functions. It is worth stating that because the training loss curves of L1 and SSTV are too close, only one line can be visually observed on the tensorboard, and the curve of SSTV is covered by the blue line.

Analysis of Multidimensional Enhanced Convolution (MEC)
Before we discuss the impact that MEC brings to our network, we experimented on MEC with two popular structures, res2net [52] and SCConv [53], which inspired our design and modification of MEC in the initial phase of the experiment (presented in Figure 12). In 2019, Gao et al. [52] constructed a new CNN structure, Res2Net, which represents multi-scale features at the granularity level and enlarges the perceptual field of each layer by constructing hierarchical residual connections within a single residual block, and claimed that it can be used in state-of-the-art backbone networks. Then in 2020, Liu et al. [53] proposed a novel selfcalibrated convolution, SCConv, which models long-range dependencies as well as enlarging fields of view by average-pooling. It can also be plugged into any network to augment standard convolution. However, according to the Table 5, neither structure performs well in this experiment, and this is mainly due to the small amount of data provided by the hyperspectral image dataset. In the res2net experiment, the loss of the training set keeps decreasing, while the validation loss keeps failing to converge. SCConv reduces overfitting and accelerates convergence to some extent by adding pooling but it does not get a good result, which may be attributed to pooling again, and which deprives the network of some information that is essential for image reconstruction [54].
Next, we performed an ablation study of MEC. In Table 6, "Our" and "Our-w/o MEC" denote the model equipped both modules and the model without MEC, which only uses standard convolutions instead. As can be seen, after removing MEC, PSNR drops 0.16 dB and SAM is 0.06 higher. The results clearly demonstrate that MEC outperforms standard convolutions and can better learn the spatial and spectral correlation. It not only improves the spatial resolution but also avoids the spectral distortion.  More importantly, using MEC reduces the parameters by a factor of two and saves nearly 10(G) of computation compared to using normal 3 × 3 convolution.

Difference Curvature-Based Branch (DCB)
The additional difference-curvature branch is designed to extract the curvature and provide guidance information for the network to preserve the texture and fine details. As can be seen from Table 6, without DCB, the PSNR of "Our-w/o DCB" is 0.10 dB lower than that of "Our", demonstrating the effectiveness of the DCB. Besides, the SSIM and SAM scores are also inferior to those of "Our". In addition, we show the visual results after curvature extraction in Figure 13 and from which it is clear that after curvature extraction, the edges are well preserved and we wish to use it to guide the network to focus on the texture and edge areas to preserve the fine details in the super-resolution results. On the right side of Figure 13, the visual differences between our method with and without DCB are presented and it can be seen that with the help of DCB, the lines of the image are sharper and more detailed features are preserved. Most importantly, the module does not bring too much computational burden.
It is worth noting that, although DCB tends to recover sharper images, this does not usually mean a significant increase in the numerical index, but certainly improves the visual quality.

Analysis of Channel Group Numbers
The application of grouping strategy in hyperspectral images actually exists in many different forms [25,30,55], and they all aim at reducing the computational overhead and making the subsequent upscaling operation feasible, especially for hyperspectral images with a much smaller data volume than RGB images but several hundred channels; the grouping strategy is theoretically important to ensure the network performance while avoiding the network being too wide and too difficult to train. To better understand the impact of the grouping approach on both computational overhead and network performance, we conducted some experiments on the number of groups. First of all, the experimental combinations we chose were 1, 16, 20, and 25. Next, we selected PSNR, SAM and SSIM as indicators of image reconstruction quality. Multiply and accumulation per second (MACs(G)) and Params(M) indicate the calculating overhead and parameters needed. As we can see from Table 7, firstly, without group strategy, although the computational overhead of the network remains consistent with that of a grouped network, its required parameters are greatly increased. This makes it much more difficult for the network to process hyperspectral data with a generally small number of training sets. Therefore, a dimensionality reduction strategy like grouping is effective. After experimenting with multiple groupings, we set the number of groupings to 20, taking into account the computational overhead and the performance of the network.

Analysis on Attention-Driven Guidance
Channel attention has been verified as a very effective tool for learning the channel correlation and has been adopted by various methods in different fields [30,36,56]. We tested this mechanism on the Chikusei dataset with a scaling factor of 4. What can be seen from Table 6 is that, without channel attention, our network performs the worst on SAM, which means the worst capability of learning spectral correlation. In addition, both PSNR and SSIM have declined to varying degrees. In the case of comparing the computation with and without CA, adding CA actually does not bring too much computational overhead, which also shows that it does not improve the effect by blindly increasing the computation. For hyperspectral images with hundreds of spectral bands, CA undoubtedly plays an important role.

Complexity Analysis
As can be seen on Table 8, although the network consisting of two branches looks very complex, we reduced the computational overhead of the model by applying some methods. First of all, by applying parameter sharing, the number of parameters has been reduced by at least 70 percent. The grouping strategy we used reduced the required parameters and computational overhead while ensuring the performance of the network; without the grouping function, the network needs to be wider and deeper to keep up the performance. In addition, by using MEC instead of standard 3 × 3 convolution, we lowered the param(M) from 20.3 to 10.96, and MACs(G) from 58.98G to 41.89G.

Conclusions
In this paper, we proposed a deep difference curvature-based network with multidimensional enhanced convolutions for HSI super-resolution. Specifically, to reduce the redundancy as well as better exploit the spectral information, we introduced a multidimensional enhanced convolution unit into the network, which can learn the useful spectral correlation through a self-attention mechanism and a bottleneck projection. In addition, we designed an additional difference curvature branch to extract the delicate texture features of a hyperspectral image. This works as an edge indicator to fully preserve the texture information and eliminates the unwanted noise. Experiments on three public datasets demonstrated that our method is able to recover finer details and yield sharper images with minimal spectral distortion compared to state-of-the-art methods. Despite the good results obtained by the network, it is still difficult to apply in real-world applications due to the heavy computational overhead. We understand the difficulty and significance of hardware-based implementation of high-quality super-resolution, and we will next work on making the network lightweight and able to be applied on hardware.