Residual Augmented Attentional U-Shaped Network for Spectral Reconstruction from RGB Images

: Deep convolutional neural networks (CNNs) have been successfully applied to spectral reconstruction (SR) and acquired superior performance. Nevertheless, the existing CNN-based SR approaches integrate hierarchical features from different layers indiscriminately, lacking an investigation of the relationships of intermediate feature maps, which limits the learning power of CNNs. To tackle this problem, we propose a deep residual augmented attentional u-shape network (RA 2 UN) with several double improved residual blocks (DIRB) instead of paired plain convolutional units. Speciﬁcally, a trainable spatial augmented attention (SAA) module is developed to bridge the encoder and decoder to emphasize the features in the informative regions. Furthermore, we present a novel channel augmented attention (CAA) module embedded in the DIRB to rescale adaptively and enhance residual learning by using ﬁrst-order and second-order statistics for stronger feature representations. Finally, a boundary-aware constraint is employed to focus on the salient edge information and recover more accurate high-frequency details. Experimental results on four benchmark datasets demonstrate that the proposed RA 2 UN network outperforms the state-of-the-art SR methods under quantitative measurements and perceptual comparison.


Introduction
Hyperspectral imaging systems can record the actual scene spectra over a large set of narrow spectral bands [1]. In contrast to the ordinary cameras record only reflectance or transmittance of three spectral bands (i.e., Red, Green, and Blue), hyperspectral spectrometers can encode hyperspectral images (HSIs) by obtaining continuous spectrums on each pixel of the object. The abundant spectral signatures are beneficial to many computer vision tasks, such as face recognition [2], image classification [3,4] and object tracking [5], etc.
Traditional scanning HSIs acquisition systems rely on either 1D line or 2D plane scanning (e.g., whiskbroom [6], pushbroom [7] or variable-filter technology [8]) to encode spectral information of the scene. Whiskbroom imaging devices apply mirrors and fiber optics to collect reflected hyperspectral signals point by point. The subsequent pushbroom HSIs acquisition systems capture HSIs with dispersive optical elements and light-sensitive sensors in a line-by-line scanning manner. As for the variable-filter imaging equipment, it senses each scene point multiple times, each time in a different spectral band. In fact, the scanning operation of these devices is extremely time-consuming, which severely limits the application of HSIs under dynamic conditions.
To make HSIs acquisition of dynamic scenes available, the scan-free or snapshot hyperspectral technologies have been explored, e.g., coded aperture snapshot spectral imagers [9], mosaic [10], and light-field [11], etc. Computed-tomography imaging spectrometer converts a three-dimensional object cube into multiplexed two-dimensional projections and these data can be used later to reconstruct the hyperspectral cube computation-ally [12,13]. Coded aperture snapshot spectral imager uses compressed sensing advances to achieve snapshot spectral imaging and an iterative algorithm is used to reconstruct the data cube [9,14]. A novel hyperspectral imaging system combines a stereo camera to perform the accurate HSIs measurements through the geometrical alignment, radiometric calibration and normalization [10]. However, these systems depend on post-processing with a huge computational complexity and record HSIs with decreased spatial and spectral resolution. Meanwhile, the deployments of these facilities remain prohibitively expensive and complicated.
Due to the limitations of scanning and snapshot hyperspectral systems, as an alternative solution, spectral reconstruction from ubiquitous RGB images has attracted extensive attention and research, i.e., given an RGB image, the corresponding HSI with higher spectral resolution can be recovered via fulfilling a three-to-many mapping directly. Obviously, SR is an ill-posed transition problem. Early work on SR leverages sparse coding or shallow learning models to rebuild HSI data [15][16][17][18][19]. Nguyen et al. [15] trained a shallow radial basis function network that leveraged RGB white-balancing to normalize the scene illuminations to further recover the scene reflectance spectra. Later, Robles-Kelly [16] extracted a set of reflectance properties from the training set and obtained convolutional features using sparse coding to perform spectral reconstruction. Typically, Arad [17] and Aeschbacher et al. [19] exploited potential HSIs priors to create an over-complete sparse dictionary of hyperspectral signatures and corresponding RGB projections, which facilitated the following reconstruction of the HSIs. More recently, with the aid of the low-rank constraints, Zhang et al. [20] proposed to make full use of the high-dimensionality structure of the desired HSI to boost the reconstruction quality. Unfortunately, these methods only model low-level and simple correlation between RGB images and hyperspectral signals, which limits their expression ability and leads to poor performance in challenging situations. Accordingly, it is indispensable to further improve the results of the reconstructed HSIs for SR.
Recently, witnessing the great success of CNNs in the field of hyperspectral spatial super-resolution [21,22], numerous CNN-based algorithms have been widely explored in the SR task [23][24][25][26][27][28]. For example, Galliani et al. [23] modified a high-performance network originally designed for semantic segmentation to learn the statistics of natural image spectra and generated finely resolved HSIs from the RGB inputs. This is a milestone work, since it is the first time to introduce deep learning into the SR task. To promote the research of SR, NTIRE 2018 challenge on spectral reconstruction from RGB images is organized, which is the first SR challenge [29]. Meanwhile, a great quantity of excellent approaches have been proposed in this competition [30][31][32][33][34]. Impressively, Shi et al. [34] designed a deep HSCNN-R network consisting of multiple residual blocks and acquired promising performance, which was developed from their previous HSCNN model [25]. Stiebel et al. [30] investigated a lightweight Unet and added a simple pre-processing layer to enhance the quality of recovery in a real world scenario. Not long ago, the second SR challenge, NTIRE 2020 on spectral reconstruction from RGB images [35], has been successfully held and a new data set is released, which further promote the development of SR methods based on CNNs [36][37][38][39][40][41] as well as more recent works [42][43][44][45]. To explore the interdependencies among intermediate features and the camera spectral sensitivity prior, Li et al. [36] proposed an adaptive weighted attention network and incorporated the discrepancies of the RGB images and HSIs into the loss function. As the winning method on the "Real World" track of the second SR competition, Zhao et al. [37] organized a 4-level hierarchical regression network with pixelShuffle layer as inter-level interaction. Hang et al. [44] attempted to design a decomposition model to reconstruct HSIs and a selfsupervised network to fine-tune the reconstruction results. Li et al. [45] presented a hybrid 2D-3D deep residual attentional network to take fully advantage of the spatial-spectral context information. These two SR challenges are divided into the "Clean" and "Real World" tracks. The "Clean" track aims to recover HSIs from the noise-free RGB images created by a known camera response function, while the "Real World" one requires participants to rebuild the HSIs from JPEG-compression RGB images obtained by an unknown camera response function. It is worth noting that the camera response functions for the same tracks of the two challenges are different. Also, to provide a more accurate simulation of physical camera systems, the NTIRE2020 "Real World" track is updated with additional simulated camera noise and demosaicing operation.
Attention mechanisms have been a useful tool in a variety of tasks, for instance, image captioning [46], classification [47,48], single image super-resolution [49][50][51], and person re-identification [52]. Chen et al. [46] proposed a SCA-CNN that incorporated spatial and channel-wise attention for image captioning. Dai et al. [50] presented a deep second-order attention network by exploring second-order statistics of features rather than first-order ones (e.g., global average pooling) [47]. Zhang et al. [53] proposed an effective relation-aware global attention module which captured the global structural information for better attention learning. Only a few very recent methods for SR [36,37,45] considered channel-wise attention mechanism using first-order statistics.
Compared with the previous sparse recovery and shallow mapping methods, the endto-end training paradigm and discriminant representational learning of CNNs bring considerable improvements of SR. However, the existing CNN-based SR approaches only devote to realizing the RGB-to-HSI mapping by the means of designing the deeper and wider network frameworks, which integrates hierarchical features from different layers without distinction and fails to explore the feature correlations of intermediate layers, thus hindering the expression capacity of CNNs. Actually, the importance of the information of all spatial regions of the feature map is different in the SR task. The feature response among channels also plays a different role for the SR performance. Additionally, most of CNN-based SR models do not consider the problem of spectral aliasing at the edge position, thus resulting in relatively-low performance.
To address these issues, a deep residual augmented attentional u-shape network (RA 2 UN) is proposed for SR. Concretely, the backbone of the proposed network is stacked with several double improved residual blocks (DIRB) rather than paired plain convolutional units to extract increasingly abstract feature representations through powerful residual learning. Moreover, we develop a novel spatial augmented attention (SAA) module to bridge the encoder and decoder, which is employed to highlight the features in the informative regions selectively and boost the spatial feature representations. To model interdependencies among channels of intermediate feature maps, a trainable channel augmented attention (CAA) module embedded in the DIRB is presented to adaptively recalibrate channel-wise feature responses by exploiting first-order statistics and secondorder ones. Such CAA modules make the network dynamically focus on useful features and further strengthen intrinsic residual learning of DIRBs. Finally, we establish a boundaryaware constraint to guide network to pay close attention to salient information in boundary localization, which can alleviate spectral aliasing at the edge position and recover more accurate edge details.
In summary, the main contributions of this paper can be depicted as below: • We propose a novel RA 2 UN network constituted of several DIRB blocks instead of paired plain convolutional units for SR, which can extract increasingly abstract feature representations through powerful residual learning. Experimental results on four established benchmarks demonstrate that the proposed RA 2 UN network outperforms the state-of-the-art SR methods under quantitative measurements and perceptual comparison. • A trainable SAA module is developed to bridge the encoder and decoder to emphasize the features in the informative regions selectively, which can strengthen the interaction and fusion between the low-level and high-level features effectively and further boost the spatial feature representations. • To model interdependencies among channels of intermediate feature maps, we present a novel CAA module embedded in the DIRB to adaptively recalibrate channel-wise feature responses and enhance residual learning by using first-order and second-order statistics for stronger feature expression. • A boundary-aware constraint is established to guide the network to focus on the salient edge information, which is helpful to alleviate spectral aliasing at the edge position and preserve more accurate high-frequency details.

Materials and Methods
2.1. The Proposed RA 2 UN Network Figure 1 gives an illustration of our proposed RA 2 UN network. The backbone architecture mainly consists of several DIRB blocks. The SAA module is bridged the different DIRB counterparts between encoder and decoder and the CAA one is embedded in each DIRB. As for each DIRB, batch normalization layers are not performed, since the normalization operation can prevent the network's power to learn spatial dependencies and spectral distribution. Meanwhile, we adopt Parametric Rectified Linear Unit (PReLU) instead of ReLU as activation function to introduce more nonlinear representation and obtain stronger robustness. The entire DIRB is formulated as where x and z denote the input and output of the DIRB block. y is the output of the first residual unit of the DIRB block. W l,1 and W l,1 represent the weight matrixes of the first and second residual units of the l-th DIRB block. R(·) denotes the residual mapping to be learned which comprises two convolutional layers and one PReLU function. ρ is the PReLU function. Our proposed RA 2 UN keeps the same spatial resolution of feature maps throughout the proposed model, which can maintain plentiful spatial details information for recovering the accurate spectrum from the RGB inputs in the network. The specific parameters settings for the backbone frameworks are given in Table 1. It can be seen that the output size of each DIRB of our RA 2 UN is not decreased in the encoding and decoding parts, i.e., we remove the down-sampling operation, which can loss partial spatial details and fail to remain the original pixel information as the network goes deeper, further reducing the accuracy of SR inevitably. In the encoder section, a simple convolutional layer is firstly employed to extract shallow feature from input images. Then several DIRBs are stacked for deep features extraction. Finally, we perform the final reconstruction part via one convolutional layer.  Table 1.

Spatial Augmented Attention Module
In general, the importance of the information of all spatial regions of the feature map is different in the SR task. To focus more attention on the features in the informative regions, a SAA module is designed between the encoder and the decoder, which can boost the interaction and fusion between the low-level and high-level features effectively. The specific diagram of SAA module is displayed in Figure 2. Our proposed SAA module consists of paired symmetric and asymmetric convolutional groups. The asymmetric convolutions refer to use 1D horizontal and vertical kernels (i.e., 1 × 3 and 3 × 1 sizes), which not only strengthen the square convolution kernels but also capture multi-direction contextual information to obtain discriminative spatial dependencies. Given an intermediate feature map denoted as F = [f 1 , f 2 , · · · , f c , · · · , f C ] containing C feature maps with spatial size of H × W, we firstly feed F to the parallel paired symmetric and asymmetric convolutional groups where ρ denotes the PReLU activation function. Conv 1×3 1,1 (·), Conv 3×1 2,1 (·) and Conv 3×3 3,1 (·) project the feature F ∈ R C×H×W to a lower size R C/t×H×W along the channel dimension. Then the next convolution layers Conv 3×1 1,2 (·), Conv 1×3 2,2 (·) and Conv 3×3 3,2 (·) map the lowdimensional features to the multi-direction spatial feature descriptors C 1 , C 2 , C 3 ∈ R 1×H×W , which contain rich contextual information. Besides, this design increases only a small amount of parameters and computational burden. To compute the spatial attention, the feature descriptors are summed and normalized to [0, 1] through a sigmoid activation σ where A s (F) ∈ R 1×H×W represents the spatial attention, which encodes the degree of importance for the spatial positions of the original feature F and determines which spatial locations should be emphasized. Finally, we perform the element-wise multiplication ⊗ between A s (F) and F F s = A s (F) ⊗ F where F s is the refined feature. During the processing, the spatial attention values are broadcasted along the channel-wise direction. Such SAA module is bridged the encoder and decoder to highlight the features in the important regions selectively and boost the spatial feature representations.

Channel Augmented Attention Module
In contrast to the preceding SAA module extracting the inter-spatial relationships of features, our presented CAA module attempts to explore inter-channel dependencies of features for SR. To obtain more powerful learning capability of the network, we present a novel CAA module to model interdependencies between channels by using first-order and second-order statistics jointly for stronger feature representations (see Figure 3). We first aggregate spatial information of the feature map F ∈ R C×H×W (F = [f 1 , f 2 , · · · , f c , · · · , f C ], f c ∈ R H×W ) by using global average pooling where s 1st c denotes the c-th element of the first-order channel descriptor S 1st ∈ R C and f c (i, j) is the response at location (i, j) of the c-th feature map f c . As for the secondorder channel descriptor, we reshape the feature map F ∈ R C×H×W to a feature matrix D ∈ R C×n , n = H × W and compute the sample covariance matrix where I = 1 n I − 1 n 1 , and X ∈ R C×C , X = [x 1 , x 2 , · · · , x c , · · · , x C ], x c ∈ R 1×C . I and 1 represent the n × n identity matrix and matrix of all ones. Then the c-th dimension of the second-order statistics S 2nd ∈ R C is formulized as where s 2nd c denotes the c-th element of the second-order channel descriptor S 2nd ∈ R C and x c (i) is the i-th value of the c-th feature map x c . To make use of the aggregated information S 1st and S 2nd , both descriptors are fed into a shared multi-layer perceptron (MLP) with a sigmoid function to generate the channel attention. The MLP is constituted of two fully connected (FC) layers and a non-linearity PReLU function, where the output dimension of the first FC layer is R C/r and the output size of the second one is R C . r is the reduction ratio. In summary, the channel attention map is indicated as where FC 1 (·) and FC 2 (·) are the weight set of two FC layers. A c (F) ∈ R C denotes the channel attention recording the importance and interdependences among channels, which is to rescale the original input feature F where ⊗ is element-wise multiplication and the channel attention values can be copied along the spatial dimension according to the broadcast mechanism. Inserted into the DIRB block, the CAA module can recalibrate channel-wise feature responses adaptively and enhance residual learning.

Boundary-Aware Constraint
In the process of hyperspectral imaging, the spectral aliasing of the edge position is easy to occur, so that the reconstruction accuracy of boundary spectrum is low. To alleviate the spectral aliasing and recover more accurate high-frequency details of HSIs, we establish a boundary-aware constraint to guide the training process in the proposed RA 2 UN: where l m represents the mean relative absolute error (MRAE) loss term to minimize the numerical error between ground truths and the reconstructed results. l b denotes the boundaryaware constraint component to lead the network to focus on the salient edge information simultaneously. τ is a weighted parameter. N is the total number of pixels. I in the x and y directions, respectively. In order to better observe the effect of edge extraction, we visualize several example images in Figure 4. The first row shows several original images from the NTIRE2020 dataset. The second row displays the effect of edge extraction. From the mathematical perspective, compared with the single MRAE loss term l m , the compound loss function l can make the space of the possible three-to-many mapping functions smaller for the ill-posed SR problem and avoid falling into a local minimum to obtain more accurate spectral recovery, which will be demonstrated in Section 4.1. Finally, τ is empirically set to 1.0 in the proposed network.

Datasets and Implementations
In this paper, we evaluate the proposed RA 2 UN on four benchmark datasets, i.e., NTIRE2018 "Clean" and "Real World" tracks, NTIRE2020 "Clean" and "Real World" tracks. Following the competition instructions, the NTIRE2018 dataset contains 256 natural HSIs for official training set and 5 + 10 additional images for official validation set and testing set with the size of 1392 × 1300. All images have 31 spectral bands (400-700 nm at roughly 10nm increments). The NTIRE2020 dataset consists of 450 images for official training set, 10 images for official validation set and 20 images for official testing set with 31 bands from 400 nm to 700 nm at 10 nm steps. Each band is the size of 512 × 482. The NTIRE2020 datasets are collected with a Specim IQ mobile hyperspectral camera. The Specim IQ camera is a stand-alone, battery-powered, push-broom spectral imaging system, the size of a conventional SLR camera (207 × 91 × 74 mm) which can operate independently without the need for an external power source or computer controller. The NTIRE2018 datasets are acquired using a Specim PS Kappa DX4 hyperspectral camera and a rotary stage for spatial scanning.
For the dataset settings, due to the confidentiality of ground truth HSIs for the official testing set of both SR contests, we choose the official validation as the final testing set and randomly select several images from the official training set as the final validation set in this paper. The rest of the official training set is adopted as the final training set. Specifically, the NTIRE2020 final validation set contains 10 HSIs including "ARAD_HS_0079", "ARAD_HS_0089", "ARAD_HS_0255", "ARAD_HS_0304", "ARAD_HS_0363", "ARAD_HS_0372", "ARAD_HS_0387", "ARAD_HS_0422", "ARAD_ HS_0434" and "ARAD_HS_0446". The NTIRE2018 final validation set chooses 5 HSIs including "BGU_HS_00001", "BGU_HS_00036", "BGU_HS_00204", "BGU_HS_00209" and "BGU_HS_00225".
During the training process, we crop 64 × 64 RGB and HSI sample pairs from the original NTIRE2020 and NTIRE2018 datasets. The batch size of our model is 16 and the parameter optimization algorithm chooses Adam [55] with β 1 = 0.9, β 2 = 0.99 and = 10 −8 . The parameter t value of the SAA module is 4 and reduction ratio r of CAA module is 16. The learning rate is initialized as 1.2 × 10 −4 and the polynomial function is set as the decay policy with power = 1.5. We stop network training at 100 epochs and the proposed RA 2 UN network has been implemented on the Pytorch framework on an NVIDIA 2080Ti GPU.

Evaluation Metrics
To objectively test the results of our proposed method on the NTIRE2020 and NTIRE2018 datasets, the mean relative absolute error (MRAE), root mean square error (RMSE), and spectral angle mapper (SAM) are adopted as metrics. The MRAE and RMSE are provided by the challenge, where MRAE is chosen as the ranking criterion rather than RMSE to avoid overweighting errors in the higher brightness region of the test image. The SAM is employed to measure the spectral quality. The MRAE, RMSE and SAM are defined as follows HSI − I where I SR of the ground truth and the spectral reconstructed HSI. || · || is l2 norm operation. N is the total number of pixels and M is the total number of spectral vectors. A smaller MRAE, RMSE or SAM indicates better performance.

Discussion on the Proposed RA 2 UN: Ablation Study
In order to demonstrate the effectiveness of the SAA module, the CAA module and the boundary-aware constraint, we conduct the ablation study on the NTIRE2020 "Clean" track dataset. The results are summarized in Table 2. R a refer to the baseline network without any attention module, which is trained by individual MRAE loss term l h . In Table  2, the baseline result reaches to MRAE = 0.03668. Spatial Augmented Attention Module. Firstly, we only add the SAA module to basic model in R b and acquire the decline in MRAE. It implies that the SAA module is helpful to emphasize the features in the important regions and boost the spatial feature representations. Then the results of R e and R f further prove the effectiveness of the SAA module, based on that the CAA module is employed or the boundary-aware constraint is established.
Channel Augmented Attention Module. As elaborated in Section 2.3, a CAA module is developed to explore feature interdependencies among channels. Compared with the baseline result, R c achieves 7.42% decrease in the MRAE value. The reason may be that CAA module can recalibrate channel-wise feature responses adaptively and realize powerful learning capability of the network. Compared with the results from R b and R d , the results of R e and R g further demonstrate the superiority of the CAA module, respectively.
Boundary-aware Constraint. In contrast to the baseline experiment R a , R d is optimized by stochastic gradient descent algorithm with the MRAE loss term l h and the boundary-aware constraint l b . The result of R d indicates that the boundary-aware constraint is helpful to recover more accurate HSIs. Furthermore, other results of R f , R g and R h all verify the effectiveness of the boundary-aware constraint. In particular, we can get the best MRAE value with the two modules and the boundary-aware constraint in R h .

Results of SR and Analysis
In this study, we compare the proposed RA 2 UN against 6 existing methods including Arad [17], Galliani [23], Yan [26], Stiebel [30], HSCNN-R [34] and HRNet [37]. Among them, the Arad is an early SR approach based on sparse recovery, while the others are based on CNNs. For a fair comparison, all models retrain on the final training set, save on the final validation set and evaluate on the final testing set for the two tracks of the NTIRE2020 and NTIRE2018 datasets. The quantitative results of final test set of NTIRE2020 and NTIRE2018 "Clean" and "Real World" tracks are listed in Tables 3 and 4. Since the camera response function is unknown, Arad is only suitable for measuring on "Clean" tracks. It can be seen that our RA 2 UN performs the best results under MRAE, RMSE and SAM metrics on all the tracks. As for the ranking metrics MRAE, the proposed method achieves relative reduction of 14.02%, 6.89%, 14.21% and 1.27% over the second best results on corresponding established datasets. In addition, we can obtain the smallest SAM values, which indicate that our reconstructed HSIs contain better spectral quality.
Also, we show the visual comparison of the five selected bands on different example images of the final test set in Figures 5-8. The ground truth, our results and error images are displayed from top to bottom. The error images are the heat maps of MRAE between the ground truth and the recovered HSI. The bluer the displayed color, the better the reconstructed spectrum. As can be seen, our approach yields better recovery results and have less reconstruction error than other competitors. Besides, the spectral response curves of four selected spatial points are painted in Figure 9. The red line is our result and the black one denotes the groundtruth spectrum. The rest are the results of the comparison methods. Obviously, the reconstructed results of RA 2 UN are much closer to the groundtruth spectrum than the others. Stiebel [30] HSCNN-R [34] Ours Figure 6. Visual comparison of the five selected bands on "ARAD_HS_0451" image from the final testing set of NTIRE2020 "Real World" track. The best view on the screen. HRNet [37] HSCNN-R [34] Ours Figure 7. Visual comparison of the five selected bands on "BGU_HS_00265" image from the final testing set of NTIRE2018 "Clean" track. The best view on the screen. HRNet [37] Ours Figure 8. Visual comparison of the five selected bands on "BGU_HS_00259" image from the final testing set of NTIRE2018 "Real World" track. The best view on the screen.
(a) (b) (c) (d) Figure 9. Spectral response curves of selected several spatial points from the reconstructed HSIs. (a,b) are for the NTIRE2020 "Clean" and "Real World" tracks respectively. (c,d) are for the NTIRE2018 "Clean" and "Real World" track respectively.

Conclusions
In this paper, we propose a novel RA 2 UN network for SR. Concretely, the backbone of RA 2 UN network consists of several DIRB blocks instead of paired plain convolutional units. To boost the spatial feature representations, a trainable SAA module is developed to highlight the features in the important regions selectively. Furthermore, we present a novel CAA module to adaptively recalibrate channel-wise feature responses by exploiting first-order statistics and second-order ones for enhance learning capacity of the network. To find a better solution, an additional boundary-aware constraint is built to guide network to learn salient information in edge localization and recover more accurate details. Extensive experiments on challenging benchmarks demonstrate the superiority of our RA 2 UN network in terms of numerical and visual measurements.
Author Contributions: J.L. and C.W. conceived and designed the study; W.X. performed the experiments; R.S. shared part of the experiment data; J.L. and Y.L. analyzed the data; C.W. and J.L. wrote the paper. R.S. and W.X. reviewed and edited the manuscript. All authors have read and agreed to the published version of the manuscript.