Mixed 2D/3D Convolutional Network for Hyperspectral Image Super-Resolution

: Deep learning-based hyperspectral image super-resolution (SR) methods have achieved great success recently. However, there are two main problems in the previous works. One is to use the typical three-dimensional convolution analysis, resulting in more parameters of the network. The other is not to pay more attention to the mining of hyperspectral image spatial information, when the spectral information can be extracted. To address these issues, in this paper, we propose a mixed convolutional network (MCNet) for hyperspectral image super-resolution. We design a novel mixed convolutional module (MCM) to extract the potential features by 2D/3D convolution instead of one convolution, which enables the network to more mine spatial features of hyperspectral image. To explore the effective features from 2D unit, we design the local feature fusion to adaptively analyze from all the hierarchical features in 2D units. In 3D unit, we employ spatial and spectral separable 3D convolution to extract spatial and spectral information, which reduces unaffordable memory usage and training time. Extensive evaluations and comparisons on three benchmark datasets demonstrate that the proposed approach achieves superior performance in comparison to existing state-of-the-art


Introduction
Hyperspectal imaging system collects surface information in tens to hundreds of continuous spectral bands to acquire hyperspectral image. Compared with multispectral image or natural image, hyperspectral image has more abundant spectral information of ground objects, which can reflect the subtle spectral properties of the measured objects in detail [1]. As a result, it is widely used in various fields, such as mineral exploration [2], medical diagnosis [3], plant detection [4], etc. However, the obtained hyperspectral image is often low-resolution because of the interference of environment and other factors. It limits the performance of high-level tasks, including change detection [5], image classification [6], etc.
To better and accurately describe the ground objects, the hyperspectral image super-resolution (SR) is proposed [7][8][9]. It aims to restore high-resolution hyperspectral image from degraded low-resolution hyperspectral image. In practical applications, the objects in the image are often detected or recognized according to the spectral reflectance of the object. Therefore, the change of spectral curve should be taken into account in reconstruction, which is different from natural image SR in computer vision [10].
Since the spatial resolution of hyperspectral images is lower than that of RGB image [11], existing methods mainly fuse high-resolution RGB image with low-resolution hyperspectral image [12,13]. For instance, Kwon et al. [12] use the RGB image corresponding to high-resolution • The novel mixed convolutional module (MCM) is proposed to mine the potential features.
Using the correlation between 3D and 2D feature maps, 3D and 2D convolution share spatial information by reshaping. Compared with using only 3D convolution, it not only reduce the parameters of the network, but also makes the network learning relatively easy. • Spatial and spectral separable 3D convolution is employed to extract spatial and spectral features in each 3D unit. It can effectively reduce unaffordable memory usage and training time.
• The local feature fusion is designed to adaptively preserve the accumulated features for 2D unit. It makes full use of all the hierarchical features in each 2D unit after changing the size of feature maps. • Extensive experiments on three benchmark datasets demonstrate that the proposed approach achieves superior performance in comparison to existing state-of-the-art methods.

Related Work
There exists an extensive body of literature on hyperspectral image SR. Here we first outline several deep learning-based hyperspectral image SR methods. To better understand the proposed method, we then give a brief introduction to 3D convolution.

Deep Learning-Based Methods
Recently, deep learning-based methods [31] have achieved remarkable advantages in the field of hyperspectral image SR. Here, we will briefly introduce several methods with CNNs. Li et al. [25] propose a deep spectral difference convolutional neural network (SDCNN) by using five convolutional layers to improve spatial resolution. Under spatial constraint strategy, it makes the reconstructed hyperspectral image preserve spectral information through post-processing. Jia et al. [24] present spectral-spatial network (SSN), including spatial and spectral sections. It tries to learn the mapping function between low-resolution and high-resolution images and fine-tune spectrum. Yuan et al. [23] use the knowledge from natural image to restore high-resolution hyperspectral image by transfer learning, and collaborative non-negative matrix factorization is proposed to enforce collaborations between low-resolution and high-resolution hyperspectral image. All of these methods need two steps to achieve image reconstruction, i.e., the algorithm first improves the spatial resolution. To avoid spectral distortion, some constraint criteria are then employed to retain the spectral information. It is clear that the spatial resolution may be changed while maintaining the spectral information. Inspired by deep recursive residual network [32], Li et al. [22] propose grouped deep recursive residual network (GDRRN) to execute hyperspectral image SR task. When 2D convolution is employed, the above networks can only extract the spatial information of hyperspectral images (see Figure 1a). They do not use the information of spectral dimension, thus achieving poor performance. Since 3D convolution can extract spectral and spatial information at the same time (see Figure 1b), Mei et al. [26] present 3D full convolution neural network (3D-FCNN) that contains five layers. It explores the relationship of the spatial information and adjacent pixels between spectra. However, the method changes the size of the estimated hyperspectral image, which is not suitable for the purpose of image reconstruction. Yang et al. [27] design multi-scale wavelet 3D convolutional neural network (MW-3D-CNN). The network includes pre-processing and post-processing. Inspired by generative adversarial network (GAN), many hyperspectral image SR algorithms using GAN are proposed. Li et al. [28] design 3D-GAN-based hyperspectral image SR. Jiang et al. [34] propose a GAN contains spectral and spatial feature extraction section. Usually, GAN-based on SR is not easy to train. Furthermore, these networks either have many parameters or do not extract spatial and spectral features at the same time. Then, Li et al. [30] propose dual 1D-2D spatial-spectral convolutional neural network. It uses 1D and 2D convolution to extract spectral and spatial features, respectively, and fuses them by reshape operation. Although this method effectively solve the above issues, it lacks of more exploration of the spatial information of image.

3D Convolution
For natural image SR, the scholars usually employ 2D convolution to extract the features and obtain good performance [35,36]. As we introduced earlier, the hyperspectral image contains many continuous bands, which results in a significant characteristic that there is a great correlation between adjacent bands [37]. If we directly use 2D convolution to conduct hyperspectral image SR task, it will make it impossible to effectively exploit potential features between bands. Therefore, in order to make full use of this characteristic, we design network by using 3D convolution to analyze the spatial and spectral features of hyperspectral image in our paper.
Since 3D convolution takes into account the inter-frame motion information in the time dimension, it is widely used in video classification [38], action recognition [39] and other fields. Unlike 2D convolution, the 3D convolution operation is implemented by convolving a 3D kernel with feature maps. Intuitively, the number of parameters of the training network using 3D convolution is an order of magnitude more than that of the 2D convolution. To address this problem, Xie et al. [40] develop typical separable 3D CNNs (S3D) model to accelerate video classification. In this model, the standard 3D convolution is replaced by spatial and temporal separable 3D convolution (see Figure 2), which demonstrates that this way can effectively reduce the number of parameters while still maintain good performance.

Spatial 2D
Spatial 2D Figure 2. The illustration that standard 3D convolution can be separated into two parts: spatial convolution and temporal convolution.

Network Structure
In this section, we will detail the overall architecture of our MCNet, whose flowchart is shown in Figure 3. As can be seen from this figure, our method mainly consists of three parts: initial feature extraction (IFE) sub-network, deep feature extraction (DFE) sub-network, and image reconstruction (IR) sub-network. Let I LR ∈ R L×W×H and I SR represent the input low-resolution hyperspectral image and the output reconstructed hyperspectral image, where W and H are the width and height of each band, and L represents the total number of the bands in hyperspectral image. As we said earlier, 3D convolution can analyze information other than spatial dimensions. Therefore, in this paper, we use 3D convolution to extract spatial and spectral information from hyperspectral image. Since the size of the input low-resolution image is L × W × H, in order to employ 3D convolution, we need to reshape I LR into four dimensions (1 × L × W × H) at the beginning of the network. Then, a standard 3D convolution is applied to extract shallow features about I LR , i.e., where Reshape(·) is the function that changes the size of feature maps, and f c (.) denotes 3D convolution operation. The initial features of F 0 is fed into spatial-spectral residual module, which is described in detail in Section 3.2. After D residual modules and global skip connection, the deep feature maps F D are denoted as where M d (·) denotes the operation of the d-th residual module. With respect to the impact of the number of residual module D in our network, we will analyze it in Section 4.4.1. For IR sub-network, we use transposed convolution layer to upsample these feature maps to the desired scale via scale factor r, which is followed by a convolution layer. After reshaping, the output size becomes L × W × H. Finally, the output of MCNet can be obtained by where f up (·) is the function for upsampling.

Mixed Convolutional Module
The architecture of mixed convolutional module (MCM) is illustrated in Figure 4. As provided in this figure, the module mainly contains four 3D units, three 2D units, and local feature fusion. In the d-th MRM, suppose F d−1 and F d are the input and output feature maps, respectively. Under the local residual connection, the output F d of the d-th MCM can be defined as where f 3D (·) is the function of 3D unit. Next, we will present the details about the proposed two units and local feature fusion.
Conv Block 2D Unit Figure 4. Architecture of the d-th mixed convolutional module (MCM). The module mainly contains four 3D units, three 2D units, and local feature fusion. The feature maps from F d−1 are first fed into the first 3D unit. After two 3D units, the output feature maps of each 3D unit is reshaped. These feature maps are fed into 2D unit, respectively. Then, the feature output from 2D units of different depths are concatenated together. More effective features are attached to 3D unit after local residual learning. Finally, the output of the module F d is obtained.

3D Unit
As we said in Section 2, the previous works use spatial and temporal separable 3D convolution to represent the standard 3D convolution for video classification, i.e, the size of the filter k × k × k is modified as k × 1 × 1 and 1 × k × k, which has been proven to perform better [40]. Therefore, to reduce unaffordable memory usage and training time, in our paper, we use this method to replace the standard 3D convolution in 3D unit. Please note that the temporal information refers to spectral information for hyperspectral image.
With respect to 3D unit (see Figure 5a), the filter 1 × k × k is adopted to first extract the spatial features of each band, and the filter k × 1 × 1 is used to extract the features between spectra. After each convolution operation, we add the rectified linear unit (ReLU). Through the local skip connection, the output of n-th 3D unit can be formulated as where σ denotes the ReLU activation function. In this way, it does not just effectively mine the potential information between spectra, but also speeds up the implementation of the algorithm.

2D Unit
If the network more focus on the analysis of spectral and spatial information by using 3D convolution, it would significantly increase the parameters of the network. This makes it impossible to design a deeper network. The main purpose of using spectral information is to improve spatial information. From this point of view, it is not necessary for all layers of the designed network to analyze the information between spectra. We need to more focus on the spatial features of image and reduce the feature analysis between spectra. Therefore, we hope that the designed network not only explore more spatial information but also reduce the parameters. Based on this motivation, we add 2D unit after each 3D unit to mine spatial features. Now we present the proposed 2D unit, which is shown in Figure 5b. Specifically, in order to use 2D convolution, the output feature maps of 3D unit are first reshaped from where C is the number of channel, and N denotes batch size. Then, two 2D convolutions and ReLU activation function are added in this unit. Finally, the feature maps after these operations are reshaped into its original size. The output of n-th 2D unit can be obtained by By making use of the correlation between 3D and 2D feature maps, 3D and 2D convolution can effectively share spatial information. In addition, 2D spatial features are relatively easy to learn. By doing so, there are two main benefits. On the one hand, it can promote the learning of 3D features. One the other hand, compared with using only 3D convolution, the 2D unit can greatly reduce the parameters of the network. Furthermore, it also enables the network to more mine spatial features of hyperspectral image, while the spectral information can be extracted.

Local Feature Fusion
To make the network learn more useful information, we design local feature fusion strategy (see Figure 4) to adaptively retain the cumulative features from 2D unit. It enables the network can fully extract hyperspectral image features. Specifically, the features from different 2D units are first concatenated to learn fusion information. To do a local residual learning between the fused result and input F d−1 , it is necessary to reduce the number of feature maps. Thus, we add a convolution layer whose filter size is 1 × 1 × 1 to adaptively retain valid information. Besides, we also set the ReLU activation function after convolution. As a result, the output of local feature fusion F d, f is formulated as where Concat denotes the concatenation operation.

Skip Connections
As the depth of the network increases, the weakening of information flow and the disappearance of gradient hinder the training of the network. Recently, there are many ways to solve these problems. For instance, He et al. [41] first use skip connection between layers so as to improve the information flow and make it easier to train. To fully explore the advantages of skip connection, Huang et al. [42] propose DenseNet. The network has the advantages of strengthening feature propagation, supporting feature reuse, and reducing the number of parameters.
For SR task, the input low-resolution image is greatly similar to the output high-resolution image, i.e., the low-frequency information carried by the low-resolution image is similar to that of the high-resolution image [43]. According to this characteristic, the researchers use dense connections to enhance the information flow of the whole network and alleviate the disappearance of the gradient for natural image SR, thus effectively improving the performance of the algorithm. Therefore, we add several global residual connections in our network. Since the shallow network can retain more edge or texture information of hyperspectral image, the feature maps from IFE are fed into the the back of each module, which can enhance the performance of the entire network.

Network Learning
For network training, the MCNet is optimized by minimizing the difference between reconstructed hyperspectral image I SR and corresponding ground-truth hyperspectral image I HR . Mean square error (MSE) is often used as loss function to study the parameters of the network for hyperspectral image SR algorithms based on deep learning [25]. Additionally, some methods design two terms in loss function to minimize the difference, including MSE and spectral angle mapping (SAM) [22,44]. In fact, these loss functions do not make the network converge better and obtain poor results, which is proved in the experiment section. For natural image SR, as far as we know, many networks in recent years usually use L1 as loss function, and the experiments also demonstrate that the L1 can obtain more powerful performance and convergence [17]. Therefore, in this paper, we refer to the natural image SR method and adopt L1 as the loss function of our designed network. The loss function of MCNet is where M is the number of training patches and θ denotes the parameter set of the MCNet network.

Results
To verify the effectiveness of the proposed MCNet, in this section, we first introduce three public datasets. Then, the implementation details and evaluation indexes are described. Finally, we assess the performance of our MCNet by comparisons to the state-of-the-art methods.

Datasets
(a) CAVE dataset: The CAVE dataset (http://www1.cs.columbia.edu/CAVE/databases/ multispectral/ Access date: 29 April 2020) is gathered by cooled CCD camera at a 10 nm step from 400 nm to 700 nm (31 bands) [45]. The dataset contains 31 scenes, divided into 5 sections: real and fake, skin and hair, paints, food and drinks, and stuff. The size of all hyperspectral image is 512 × 512 × 31 in this dataset. Each band is stored as a 16-bit grayscale PNG image.
(b) Harvard dataset: The Harvard dataset (http://vision.seas.harvard.edu/hyperspec/explore. html Access date: 29 April 2020) is obtained by Nuance FX, CRI Inc. camera in the wavelength range of 400 nm to 700 nm. [46]. The dataset consists of 77 hyperspectral images of real-world indoor or outdoor scenes under daylight illumination. The size of each hyperspectral image is 1040 × 1392 × 31 in this dataset. Unlike CAVE dataset, this dataset is stored as .mat file.
(c) Foster dataset: The Foster dataset (https://personalpages.manchester.ac.uk/staff/d.h.foster/ Local_Illumination_HSIs/Local_Illumination_HSIs_2015.html Access date: 29 April 2020) is collected using a low-noise Peltier-cooled digital camera (Hamamatsu, model C4742-95-12ER) [47]. The dataset includes 30 images from the Minho region of Portugal during late spring and summer of 2002 and 2003. Each hyperspectral image has 33 bands with the size of 1204 × 1344 pixels. Similarly, the dataset is also stored as .mat file. Some RGB images corresponding to hyperspectral images are shown in Figure 6.

Implementation Details
As mentioned earlier, different datasets are gathered by different hyperspectral cameras, so we need to train and test each dataset individually, which is different from the natural image SR. In our work, 80% of the samples are randomly selected as training set, and the rest are used for testing.
For the training phase, since there are too few images in these datasets for deep learning algorithm, we augment the training data by randomly selecting 24 patches. Each patch is flipped horizontally, rotated (90 • , 180 • , and 270 • ), and scaled (1, 0.75, and 0.5). According to scale factor r, these patches are downsampled as low-resolution hyperspectral images with the size of 32 × 32 × L by bicubic interpolation [48]. Before feeding the mini-batch into our network, we subtract the average value of the entire training images for patches. In our work, we set the size of filter in 3D unit as 3 × 1 × 1 and 1 × 3 × 3 in each convolution layer expect those for initial feature extraction and image reconstruction (the size of filter is set to 3 × 3 × 3). The size of the filter in 2D unit is set to 3 × 3. The number of filter for all layer in our network is 64. We initialize each convolutional filter using [49]. The ADAM optimizer with β 1 = 0.9, β 2 = 0.999 is employed to train our network. The learning rate is initialized as 10 −4 for all layers, which decreases by a half at every 35 epochs.
For the test phase, in order to improve the efficiency of the test, we only use the top left 512 × 512 region of each test image for evaluation. Our method is conducted using the PyTorch framework with NVIDIA GeForce GTX 1080 GPU.

Evaluation Metrics
To qualitatively measure the proposed MCNet, three evaluation methods are employed to verify the effectiveness of the algorithm, including peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and spectral angle mapping (SAM). They are defined as 2µ l I SR µ l I HR + c 1 2σ l I SR I HR + c 2 (µ l I SR ) 2 + (µ l I HR ) 2 + c 1 (σ l I SR ) 2 + ((σ l I HR ) 2 + c 2 (11) SAM = arccos < I SR , I HR > ||I SR || 2 ||I HR || 2 (12) where MAX l is the maximal pixel value for l-th band, µ l I SR and µ l I HR denote the mean of I SR and I HR for l-th band, respectively, σ l I SR and σ l I HR are the variance of I SR and I HR for l-th band, σ l I SR I HR is the covariance of I SR and I HR for l-th band, c 1 and c 2 are two constants, < ·, · > represents the dot product operation, and || · || 2 is l2 norm operation.
In general, the larger the PSNR and SSIM is and the smaller the SAM is, the better the performance of the reconstructed hyperspectral image is.

Model Analysis
In this section, we conduct sufficient experiments, including study of D module and ablation study analysis. To make a simple and fair comparison, we analyze the results for scale factor ×2 on CAVE dataset.

Study of D Module
The structure of our proposed MCNet is determined by the number of the mixed convolutional module D. Thus, we set the range of D from 2 to 5 to analyze the effect of the parameter D on the performance using three evaluation metrics. The results are displayed in Table 1. It can be seen that when D is set with different values, all three indicators have a certain degree of change. Specifically, the values of SAM and SSIM remain basically the same. Compared with these two results, the values of PSNR increase significantly when D < 5. However, the value of each evaluation index has been decreased when D is set to 5, especially for PSNR. In our view, there are two main reasons for this phenomenon. The one is the increase of network parameters caused by the use of more 3D convolution in the network. The other is that the network becomes deeper. These are not easy to train the network. In summary, we empirically set the parameter D to 4 in our paper.  Table 2 shows the ablation study on the impacts of 2D unit (2U), local feature fusion (LFF), and global residual learning (GRL). We set the different combinations of components to analyze the performance of the proposed MCNet. To simply do fair comparison, our network with 4 modules is adopted to implement ablation investigation. First, there are no 2U, GRL, and LFF components, only 3D units are included in deep feature extraction (DFE) sub-network (the network is defined as baseline). It yields the worst performance. It mainly lacks adequate learning of effective features, which also shows that spectral and spatial features cannot be extracted well without these components. Thus, these components are required in our network. Then, we add one of these components to the baseline. The performance of the network is improved in PSNR and SAM. Accordingly, two of these components are added to the baseline. Evaluation indexes attain relatively better results than in previous evaluations. In short, the experiments demonstrate that each component can clearly enhance the performance of the network. This indicates that each component plays a key role in making the network easier to train. Finally, three components are attached to the baseline. The table shows that the results of three components are significantly better than the performance of only one or two, which reveals the effectiveness and benefits of the proposed components.

Comparisons with the State-of-the-Art Methods
In this section, we adopt three public hyperspectral image datasets to evaluate the effectiveness of our MCNet with five existing SR methods. They are Bicubic [48], GDRRN [22], 3D-FCNN [26], EDSR [20], SSRNet [29]. Table 3 depicts the quantitative evaluation of state-of-the-art SR algorithms by average PSNR/SSIM/SAM for different scale factors.
As shown in the table, our method can achieve better results than other algorithms using the CAVE dataset. Specifically, Bicubic produces the worst performance among these competitors. For the GDRRN algorithm, all the results are slightly higher than the worst Bicubic but lower than other methods. It is caused by the addition of a SAM item in the loss function. As a result, the network cannot optimize the difference between reconstructed and high-resolution image. Furthermore, the results of 3D-FCNN in PSNR and SSIM are lower than that of EDSR, but the performance in SAM of 3D-FCNN is obviously higher than that of EDSR, which is due to the fact that 3D-FCNN uses 3D convolution to extract the spectral features of hyperspectral image. Thus, this algorithm can void the spectral distortion of the reconstructed hyperspectral image well. However, the image obtained by 3D-FCNN lose part of the bands (the algorithm only obtains 23 bands on hyperspectral image with 31 bands), which is not suitable for image SR. For SSRNet algorithm, its results are better than that of the previous four methods. Compared with the existing SR approaches, our method obtains excellence performance. The proposed method is significantly superior to the scale factor ×4 of algorithm with the second performance (SSRNet) in terms of three evaluation metrics (+0.082 dB, +0.0007, and −0.005). Similarly, the MCNet outperforms other competitors on the Harvard dataset, except for SAM. Concretely, unlike on CAVE dataset, GDRRN and 3D-FCNN achieved approximately the same results, because the number of hyperspectral images on augmented Harvard dataset is more than that on CAVE dataset. This is more beneficial to network training with many parameters, such as EDSR and SSRNet. Moreover, in most cases, it also enables our approach to achieve higher performance on this dataset for scale factor ×4. Likewise, the proposed approach achieves good performance in comparison to existing state-of-the-art methods on Foster dataset, particularly in SSIM and SAM.
In Figures 7-9, we show visual comparisons with different algorithms for scale factor ×4 on three datasets. The figures only provide visual results of the 27-th band of three typical scenes. As revealed in the figure, the ground-truth is grey. So in order to observe the difference between reconstructed hyperspectral image and ground-truth clearly, the absolute error map between them is presented. In general, the bluer the absolute error map is, the better the reconstructed image is. Please note that each hyperspectral image is normalized. From these figures, we can see that the proposed MCNet obtains very low absolute error results. In some regions, especially for the edges of the image, our method generates shallow edge information or no edge information. It means our proposed MCNet generates more realistic visual results compared with other methods, which is consistent with our analysis in Table 3. We also visualize the spectral distortion of the reconstructed image by drawing spectral curves for three scenes, which are presented in Figures 10-12. Since all convolutions are not padding during reconstruction for 3D-FCNN, the actual output of the network is smaller than the input. We only show some of bands for this algorithm. To alleviate the problem caused by random selection, we selected three pixel positions ((20, 20), (100, 100), and (340, 340)) to analyze the distortion of the spectrum. As shown in Figure 10, the spectral curves of all competitors are basically consistent with that of ground-truth for image (fake_and_real_lemons). With respect to two images (imgd5 and Bom_Jesus_Bush) in Figures 11 and 12, it can be seen from these figures that the distortion for 3D-FCNN is the most severe. The distortion of the spectral curve obtained by Bicubic is relatively small compared with 3D-FCNN. Moreover, the curves of other methods have certain deviation from the corresponding ground-truth. However, the results of our method are much closer to the ground-truth in most cases, which proves that our algorithm attains higher spectral fidelity. Of course, in order to show more clearly the spectral degree of three pixel positions, we also show the spectral distortion comparisons in three scenes by calculating SAM (see Table 4). As displayed in this table, the values of SAM in our method are better than that of other algorithms in most cases. In summary, MCNet does not just outperform state-of-the-art SR algorithms through quantitative evaluation, but also yields more realistic visual results.

Application on Real Hyperspectral Image
In this section, we apply the MCNet to a real hyperspectral image dataset to demonstrate its applicability. The real hyperspectral images was collected by a progressive-scanning monochrome digital camera. This dataset (https://personalpages.manchester.ac.uk/staff/d.h.foster/Hyperspectral_ images_of_natural_scenes_02.html Access date: 29 April 2020) has 30 scenes, such as rocks and trees [50]. The size of each scene is different, but there are still 31 bands in each scene. In our work, the images of eight representative scenes that are proved in [50] are used to demonstrate its applicability. Due to the limitation of hardware, we only use the top left 260 × 260 region of each hyperspectral image for evaluation.
Because there is no reference image for evaluation, some traditional evaluation metrics (such as, PSNR and SSIM) cannot be used here. Thus, the universal non-reference hyperspectral image quality evaluation methods (i.e., NIQE [51]) are adopted to evaluate the performance of the reconstruction. Generally, the higher values of NIQE mean a better visual quality. Table 5 shows the no-reference image quality assessment of existing SR methods. It can seen from the table that our method also achieves good results in real hyperspectral image dataset. This is consistent with our results in Table 3. It also demonstrates that the proposed algorithm has strong applicability. Since there is no reference image, absolute error map cannot be displayed. Therefore, we only provide visual results of the 27-th band in Figure 13. One can observe that our method generates better sharper edges and clearer structures than other algorithms.

Discussions
In this section, we discuss the impact on the performance of the algorithm from the following two parts: loss function and study of 3D unit. Similarly, the experiments is implemented on CAVE dataset for scale factor ×2 to illustrate the influence.

Loss Function Analysis
To demonstrate the effect of different loss functions, the loss functions of [25,44], and L1 in our work are employed to train MCNet. The evaluation results are shown in Table 6. When adding SAM in loss function, it is clear that the spatial resolution has changed, and the spectral distortion has become more serious. Moreover, the loss function containing MSE and SAM gets a lower PSNR value, which is mainly due to the fact that the loss function weakens the performance of spatial resolution. As seen from this table, L1 in our paper can achieve the best performance than other loss functions for three indexes. It verifies our method can effectively optimize the difference between I SR and I HR using L1.

Efficiency Study of 3D Unit
In this section, we study the efficiency of the proposed 3D unit using different types in module, including standard 3D convolution and separable 3D convolution. The one is that we use 3D unit with separable 3D convolution, the other is standard 3D convolution that has removed ReLU activation function. Please note that the convolution operations in initial feature extraction and image reconstruction are not replaced by separable 3D convolution in our network. The comparison results are shown in Table 7. Obviously, our proposed 3D unit can greatly reduce parameters, which can effectively reduce memory footprint. With respect to the results of PSNR, using standard 3D convolution is lower than that of separable 3D convolution. We think that there are too many parameters of the network, which makes the network more difficult to train, resulting in a decline in performance. Moreover, the training time of separable 3D convolution is lower than the standard 3D convolution, which mainly benefits from the reduction of the number of parameters. Generally speaking, the two methods are adopted to perform SR task, and the results of the proposed algorithms are approximately the same, expect for training time. This also verifies the effectiveness of the proposed algorithm when a small number of parameters are used.

Conclusions
When the spectral information can be extracted, most existing models do not pay much attention to the mining of spatial information of hyperspectral images. To deal with this issue, in our paper, we develop a mixed 2D/3D convolutional network (MCNet) to reconstruct hyperspectral image, claiming the following contributions: (1) we propose a novel mixed convolutional module (MCM) to mine the potential features by 2D/3D convolution instead of one convolution; (2) To reduce the parameters for the designed network, we employ separable 3D convolution to extract spatial and spectral features respectively, thus reducing unaffordable memory usage; and (3) we design local feature fusion strategy to make full use of all the hierarchical features in each 2D unit after changing the size of feature maps. Extensive benchmark evaluations well demonstrate that our MCNet does not just outperform state-of-the-art SR algorithms, but also yields more realistic visual results.
In the future, we plan to improve in two aspects. First, in the mixed convolution module (MCM), the network does not effectively use the results of 2D unit, but only concatenates this information for analysis. Therefore, this can make full use of each 2D unit to optimize the structure of the network. Second, comparing with 2D convolution, the use of 3D convolution still results in a significant increase the number of parameters. From this point of view, the network can increase more 2D units and reduce 3D units, thus effectively reducing the number of parameters.