SRT: A Spectral Reconstruction Network for GF-1 PMS Data Based on Transformer and ResNet

: The time of acquiring remote sensing data was halved after the joint operation of Gao Fen-6 (GF-6) and Gao Fen-1 (GF-1) satellites. Meanwhile, GF-6 added four bands, including the “red-edge” band that can effectively reﬂect the unique spectral characteristics of crops. However, GF-1 data do not contain these bands, which greatly limits their application to crop-related joint monitoring. In this paper, we propose a spectral reconstruction network (SRT) based on Transformer and ResNet to reconstruct the missing bands of GF-1. SRT is composed of three modules: (1) The transformer feature extraction module (TFEM) fully extracts the correlation features between spectra. (2) The residual dense module (RDM) reconstructs local features and avoids the vanishing gradient problem. (3) The residual global construction module (RGM) reconstructs global features and preserves texture details. Compared with competing methods, such as AWAN, HRNet, HSCNN-D, and M2HNet, the proposed method proved to have higher accuracy by a margin of the mean relative absolute error (MRAE) and root mean squared error (RMSE) of 0.022 and 0.009, respectively. It also achieved the best accuracy in supervised classiﬁcation based on support vector machine (SVM) and spectral angle mapper (SAM).


Introduction
The GF-6 was successfully launched in 2018 as China's first medium-high-resolution agricultural observation satellite, which cooperated with GF-1, China's first high-resolution earth observation satellite that was launched in 2013. It can not only reduce the time of remote sensing data acquisition from 4 days to 2, but also significantly improve the ability to monitor agriculture, forestry, grassland, and other resources, providing remote sensing data support for agricultural and rural development, ecological civilization construction [1], and other significant needs. GF-6 also realized the localization of the 8-band CMOS detector and added the red-edge band that can effectively reflect the unique spectral characteristics of crops [2,3].
However, GF-1 was launched earlier and is mission-oriented differently, so it only contains four multispectral bands. Compared with the GF-6 satellite in Table 1, GF-1 lacks four bands (purple, yellow, red-edge I, and red-edge II bands), which greatly constrains its development for crop-related joint monitoring. So, we try to find a spectral reconstruction method to reconstruct the lacking four bands.
In addition, it should be pointed out that most studies on spectral reconstruction focus on visible three bands (red, green, and blue) images, while remote sensing images usually contain at least four bands (red, green, blue, and nir). This results in the lack of one essential nir band as the input, which does not make full use of the original information, thereby leading to a waste of information. There are already some studies of remote sensing spectral reconstruction considering this problem [15,16]. Few studies have been conducted on large-scale and highly complex scenarios such as satellite remote sensing. On the contrary, most of them have only done performed research in a relatively small area [15]. Most deep learning methods adopt a lot of up-sampling, down-sampling, and nonlocal attention structure for ground images. Due to the large-scale, numerous, and complex ground objects of remote sensing images, these structures are difficult to play an excellent effect in the spectral reconstruction of remote sensing images [16].
To better adapt to the spectral reconstruction of remote sensing images, we propose a more suitable spectral reconstruction network (SRT) for GF-1 panchromatic and multispectral sensor (PMS) data based on Transformer and ResNet. This network includes a TFEM, the RDM, and the RGM. The first module contributes to the extraction of correlation characteristics between spectra. To avoid the vanishing gradient problem, the second module reconstructs these features nonlinearly at the local features. The third module, mainly used for the global reconstruction of these features, prevents loss of texture details. The main contributions of this article are summarized as follows: • We propose a spectral reconstruction network. The network trains on GF-6 wide field view (WFV) images to reconstruct the four lacking bands of GF-1 PMS images, which significantly increases the classification capability of GF-1.

•
We produce a large-scale dataset that covers a wide area and is rich in land types. It basically meets the ground object information required for spectral reconstruction. • In order to evaluate the generalization ability of our model, we compare it with other models in image similarity and classification accuracy, and conclude that our model has the best result. The remaining part of this article is organized as follows: Section 2 describes the related works of spectral reconstruction methods. We present the network of SRT in Section 3. Section 4 presents our results, including the dataset description, the experimental part, and its analysis. Section 5 is the conclusion.

Related Works
Due to the limitations of the hardware resources (bandwidth and sensors), researchers have had to make trade-offs in the temporal, spatial, and spectral dimensions of remote sensing images. With the problem of low spectral dimension, researchers mainly used prin-cipal component analysis (PCA) [17,18], Wiener estimation (WEN) [19], and pseudoinverse (PI) [20,21] to construct a spectral mapping matrix. In recent years, spectral reconstruction methods have been divided into two branches: prior-driven and data-driven methods.
The first type is mainly based on sparse dictionary learning, which aims to extract the most important spectral mapping features. It can represent as much knowledge as possible with as few resources as possible, and this representation has the added benefit of being computationally fast. For example, Arad and Ben-Shahar [4] were the first to apply an overcomplete dictionary to recover hyperspectral images from RGB. Jonas et al. [5] used the A+ algorithm to improve Arad's approach to the sparse dictionary. The A+ algorithm directly constructs the mapping from RGB to hyperspectral at the local anchor point, and the running speed of the algorithm is significantly improved. The sparse dictionary method only considers the sparsity of spectral information and does not use local linearity. The disadvantage is that the reconstruction is inaccurate, and the reconstructed image has metamerism [22]. Li et al. [7] proposed a locally linear embedding sparse dictionary method to improve the representation ability of sparse coding. In order to improve the representation ability of the sparse dictionary, this method only selects the local best samples and introduces texture information in the reconstruction, reducing the metamerism. Geng et al. [8] proposed a spectral reconstruction method that preserves contextual information. Gao et al. [9] performed spectral enhancement of multispectral images by jointly learning low-rank dictionary pairs from overlapping regions.
The second type is mainly based on deep learning. With the development of deep learning, a large number of excellent models have gradually replaced the first method owing to its powerful generalization ability. However, compared to the first one, deep learning usually requires enormous amounts of data, and the training process takes a lot of computational time. However, with the increase in computing power, deep learning becomes much more effective, and the related methods are used by more and more researchers. Xiong et al. [10] proposed a deep learning framework for recovering spectral information from spectrally undersampled images. Koundinya et al. [12] compared 2D and 3D kernel-based CNN for spectral reconstruction. Alvarez-Gila et al. [11] posed spectral reconstruction as an image-to-image mapping problem and proposed a generative adversarial networks for spatial context-aware spectral image reconstruction. In the NTIRE 2018 [23] first spectral reconstruction challenge, the entries of Shi et al. [13] ranked in first (HSCNN-D) and second (HSCNN-R) place on both the "Clean" and "Real World" tracks. The main difference between the two networks is that the former adopts a series method for feature fusion, while the latter is an addition method. The series method can learn the mapping relationship between spectra very well. Respectively considering shallow feature extraction and deep feature extraction, Li et al. [24] proposed an adaptive weighted attention network, which obtained the first rank on the "Clean" track. Zhao et al. [14] proposed a hierarchical regression network (HRNet) that obtained first place on the "Real World" track; it is a 4-level multi-scale structure that uses down-sampling and up-sampling to extract spectral features. In the processing of remote sensing images, Deng et al. [15] proposed a more suitable network (M2H-Net) for remote sensing to meet the needs of multiple bands and complex scenes. Li and Gu [16] proposed a progressive spatial-spectral joint network for hyperspectral image reconstruction. Figure 1 shows the architecture of SRT. In training, the model inputs red, blue, green, and nir bands of GF-6 WFV, and the remaining purple, yellow, red-edge I, and red-edge II bands are used as labels. The overall structure includes TFEM, RDM, RGM, convolution operations, and other related operations. The whole SRT is an end-to-end structure, which can be divided into three parts:

SRT Architecture
The TFEM is used to extract correlation between spectra by self-attention mechanism. 2.
The RDM, which can fully learn and reconstruct these local features to prevent gradient vanishing in training.

3.
The RGM is able to reconstruct these global features. Considering the model is ultimately used for GF-1 PMS (8 m) images, it doubles the spatial resolution compared to the trained GF-6 WFV (16 m) images. This module can prevent losing the texture details in the training or inference process.

TFEM
Google first proposed the Transformer architecture in June 2017 [25]. The impact on the whole natural language processing (NLP) field has been tremendous. In just four years since it was proposed, Transformer has become the dominant model in NLP [26]. Since 2020, it has started to shine in the field of computer vision (CV): image classification (ViT [27], DeiT [28]), object detection (DETR [29], Deformable DETR [30]), semantic segmentation (SETR [31], MedT [32]), image generation (GANsformer [33]) and so on. He et al. [34] showed scalable self-supervised learners for CV (masked autoencoders, MAE). Once again, Transformer shined in the CV. Inspired by the development of Transformer, we try to use Transformer as the backbone of feature extraction for SRT to fully extract relevant features between spectra with the help of its effective attention mechanism. The architecture of TFEM is shown in Figure 2. Following ViT [27], we divide the remote sensing images into multiple small patches and serialize each patch through a linear projection of flattened patches so that a vision problem turns into a NLP problem. The module needs to add learnable position embedding parameters to maintain the spatial location information between the input patches. The Transformer encoder extracts spectral features from input sequences with the help of its multi-attention mechanism. In our experiment, considering Transformer is only used for feature extraction; we remove the learnable classification embedded in the ViT and use ConvTranspose to replace the MLP head to ensure that the model maps to the same dimension.

RDM
He et al. [35] proposed a residual learning framework (ResNet) to ease the training of networks that are substantially deeper than those used previously. Based on ResNet, DenseNet makes each layer connect to all previous layers, it [36] is a new network framework that enriches the CNN network system from LeNet [37] to the present ones. It connects all layers to ensure maximum exchange of spectral information flow in the network. In addition, DenseNet also has the advantage that it requires fewer parameters for the same performance or the same number of layers. This is because it has a direct connection to all the previous layers, so it does not have to relearn some of the features that have already been learned.
The RDM contains four residual dense blocks which is shown in Figure 3, and a long skip connection is added in front of the module to prevent the vanishing gradient problem in the network. The spectral reconstruction model of the residual network and dense network can alleviate the vanishing gradient problem during training and ensure more accurate results.

RGM
The RGM references SE-ResNet [38] and HRNet [14] which is shown in Figure 4. Average pooling can bias the features of the image toward the overall characteristics and prevent the loss of too much high-dimensional information. The final convolution layer is used for channel number mapping, and the global residual is used to preserve spatial details in the image of different spatial resolutions.

Loss Function
We use the mean relative absolute error (MRAE, Equation (1)) as the loss function, due to the reflectance of the same object on the ground, varies greatly in different bands.
It replaces the absolute difference of the mean square error (MSE, Equation (2)), with the mean relative absolute error to achieve adaptive error adjustment according to each band. In a way, it can effectively reduce the high errors caused by different reflectance and demonstrate the accuracy of the reconstructed network more visually. In the validation set, we measure the metric of the models by peak signal-to-noise ratio (PSNR [39], Equation (3)), and save the best model.
where P gt i is the gray-scale value of the ith pixel in the reference image, P rec i is the reconstructed gray-scale value of the ith pixel, and n is the number of pixels in the image.
where MAX I is the maximum value of the gray-scale value. All data in this experiment is normalized, MAX I is 1.

Network Training and Parameter Settings
The parameters of the Transformer encoder are set by default, and the network hyperparameters are set according to Table 2. The size of each convolution kernel in the network is 3 × 3. For the optimizer, we choose Adam. The computer configuration in this study: CPU is Intel (R) Xeon (R) Gold 6148, GPU is Telsa V100 16 G, and RAM is 16 G. Paddle2.2 was chosen as the development environment.

Experiments
The experiment evaluates the quality of the spectral reconstruction by accuracy and classification. Furthermore, AWAN, HRNet, HSCNN-D, and remote sensing image reconstruction (M2H-Net) are the four outstanding methods that are selected to compare with our model SRT, SRT*, and the former three are spectral reconstruction champion methods in the NTIRE challenge. SRT* removes the RGM compared to SRT to test the effect of the module.

Dataset Description
We use image scenes from GF-1 PMS and GF-6 WFV. The data acquisition for the study areas is shown in Figure 5. We select nine GF-6 WFV images to form the dataset, six for training and three for testing. The dataset covers a wide range of land types and provides sufficient feature information for the spectral reconstruction of GF-1 PMS. We randomly divide the training images into 13,500 overlapping patches of 128 × 128 pixels, 90% of them for training and the rest for validation. The testing ones are divided into 2000 overlapping patches of 128 × 128 pixels.
The image shown in Area1 is the Songhua River, located in Yilan, Heilongjiang. It is a cropped GF-6 WFV test image that contains abundant information on water, vegetation, tree, and so on. The size of it is 2275 × 2174.
Area2, imaged by GF-1 on 11 Aprill 2016, is located in Tengzhou, Shandong, and contains ample information on building, vegetation, and road. The size of its image is 2500 × 2322.
Area3, imaged by GF-1 on 21 June 2018, is located in Nenjiang, Heilongjiang, and contains rich vegetation, bare land, and tree. The size of its image is 3254 × 3145.
The preprocessing of GF-1 PMS and GF-6 WFV images includes radiometric correction and atmospheric correction in ENVI 5.3. The parameters for the correlation correction are obtained from China Resource Satellite Application Center [40]. Table 3 lists the detailed number of pixels of the training and testing samples for classification in the three areas. Each of them is manually annotated into six classes in ENVI 5.3 software (Exelis Inc., Boulder, CO, USA) to test the classification ability of the reconstructed images, as is shown in Figure 6. Table 3. Details of the ground truth in Area1-3.

Evaluation Metrics
We use five indicators to evaluate the different methods, including RMSE, MRAE (Equation (1)), PSNR (Equation (3)), spectral angle mapper (SAM [41]), and structural similarity (SSIM [42]). The formulas of RMSE, SAM, and SSIM are given as follows: where P gt i is the gray-scale value of the ith pixel in the reference image, P rec c is the reconstructed gray-scale value of the ith pixel, and n is the number of pixels in the image.
SSIM(gt, rec) = 2µ gt µ rec + C 1 2σ gtrec + C 2 µ 2 gt + µ 2 rec + C 1 σ 2 gt + σ 2 rec + C 2 (6) where µ gt is the average value of the reference image, µ rec is the average value of the reconstructed image, σ gtrec is the covariance of the reference image and the reconstructed image, σ gt is the standard deviation of the reference image, σ rec is the standard deviation of the reconstructed image, and C 1 = (k 1 L) 2 and C 2 = (k 2 L) 2 are constants used to maintain stability. L is the dynamic range of the pixel values and k 1 is set to 0.01 and k 2 to 0.03. Classification is an essential application of remote sensing images, and we use SVM and SAM classification to test the classification performance of images. SVM can solve linear and non-linear classification problems well, with fewer support vectors to determine the classification surface, and is not sensitive to the number of samples and spectral dimensionality. SAM measures the similarity between spectra by treating both spectra as vectors and calculating the spectral angle between them. Therefore, it is sensitive to samples and spectral dimensionality.
For the testing of GF-1 PMS images, we cannot use the above indicators to evaluate the four generated bands, except for the classification accuracy. The assessment steps include the following: First, input the original image to the model after radiometric calibration and atmospheric correction. Then, classify the outputs by SVM and SAM methods. Finally, compare the overall accuracy (OA), kappa coefficient (Kappa), and accuracy for every class of all the methods with each other. Table 4 shows the accuracy assessment of the reconstructed GF-6 WFV images on the dataset. Overall, the PSNR and SSIM of the four bands are all high, not less than 38.92 and 0.970, respectively. Similarly, MARE, SAM, and RMSE are all relatively low, indicating that the overall accuracy of the reconstruction is high. Among the six methods, the results of the AWAN, HSCNN-D, and M2HNet methods are similar. HRNet, SRT, and SRT* are much better than the other three methods in PSNR, MRAE, and SAM. The SRT outperforms HRNet on the dataset, demonstrating that our TFEM outperforms the multi-scale feature extraction of HRNet. In addition, SRT* lacks the RGM compared to SRT and is slightly worse than SRT in some indicators, but still has some advantages compared to other methods.

Similarity-Based Evaluation
Compared with the scatter plot in Figure 7, it turns out that the inference results of bands 5 and 6 have larger areas of scattering compared to bands 7 and 8, which indicates that the reconstruction is less relevant. It is also reflected by the PSNR metric on Table 4. The larger the PSNR is, the smaller the scattering region and the strongest the correlation between the predicted band and the original one. The PSNR of band 7 in Table 4 is the highest, and the scatter region of band 7 in Figure 7 is the smallest. Therefore, we can conclude that the reconstruction accuracy of band 7 is the best. It can be seen from the scatter plot that the reconstruction accuracy of each band is different. Using MRAE as the loss function compared to RMSE can well avoid the training of the band-dominant model with large errors.

Classification-Based Evaluation
For GF-6 WFV images, we evaluate the confusion matrix by the classification results of the original image and the predicted one. Table 5 shows the evaluation results of the SVM classification. Among them, both the OA and KAPPA coefficients of SRT are the highest, 3.3% and 4.2% higher than AWAN, respectively. In the classification result of vegetation, the SRT classification result is 6.3% higher than the second-highest M2HNet. In Figure 8, we can see that the water classification result of M2HNet is significantly different from the reference image.   Table 6 shows the evaluation results of SAM classification, and the SRT results are still the best. Its errors in the OA and Kappa coefficients with the original image classification are only 0.5% and 0.24%. It indicates that the spectral reconstruction capability of SRT is optimal among other methods. For GF-1 PMS images, our classification results should be higher than GF-1 (8 m spatial resolution, four bands). Table 7 shows the accuracy metrics for SVM classification in Area2. Most methods improve the classification evaluation metrics, with SRT improving OA and KAPPA by 2.1% and 4.3%, respectively. Except in the two classes of tree and road, the classification accuracy of SRT is higher than the original GF-1 PMS for other classes.  Table 8 shows the evaluation results for the SAM classification, where all the methods are still higher than the original results, and the SRT method is the best. Additionally, the results in Figure 9 show that the accuracy of the SVM is higher than the SAM, especially for urban scenes.   Table 9 shows the classification accuracy of Area3. Compared to the GF-1 image classification results, it can improve the OA and Kappa of SRT by 2.41% and 2.0%, respectively. Most classes' accuracies are better than before. Except for water and bare land, the classification accuracy of SRT is higher than that of other methods for other classes. As shown in Table 10, SRT remains the highest. However, the SAM classification accuracy of all methods in Area3 is much lower than that of SVM. The original image's OA and Kappa coefficients of the SAM classification are lower than the SVM, with differences as high as 8.8% and 16.7%, respectively. From Figure 10, it also can be seen that the difference between SVM and SAM results classification. SAM classification does not classify the build area well, it divides a small part of bare land into water and divides bare land into two lots of tree. This vast difference may result from the lower spectral dimensions, while the SAM method is more sensitive to the spectrum, so the classification accuracy of SAM is lower than before.  Tables 4-10 show that both the SRT and SRT* outperform other methods in terms of overall accuracy, which indicates that the TFEM has a significant advantage in performing spectral feature extraction. The SRT results are still the best in terms of SVM and SAM. By comparing the results of SRT and SRT*, we find that SRT needs to use RGM to prevent the model from losing some details during GF-1 PMS image inference. In addition, under the condition of the same samples, the classification result of SAM is lower than that of SVM. We think that the main reason is that the number of image bands used for classification is too small compared to hyperspectral images, which cannot exert the performance of SAM.
Our method has a robust spectral reconstruction capability, and the reconstructed bands can improve the classification capability of GF-1 PMS images. Table 11 shows the parameters, GFLOPs (giga floating-point operations per second), and the running time of all test methods on an input image of 4 × 128 × 128 pixels. Comparing the parameter quantities of SRT and SRT*, it can be found that the parameter quantity of RGM is only 0.08 M, and the GFLOPs and running time increase by 1.21 and 0.02 s, respectively. In addition, the SRT method is only higher than HSCNN-D in the number of parameters and lower than the other three methods. Although the parameter quantity of HSCNN-D is small, the running time is very long, much higher than 0.27 s of SRT, mainly due to the series structure of HSCNN-D, the number of network layers is deepened, and the network operation takes a lot of time.

Conclusions
This article proposes a Transformer-and ResNet-based network (SRT) to reconstruct GF-1 PMS images from GF-6 WFV. SRT consists of three parts: the TFEM, the RDM, and the RGM. The TFEM learns correlation between spectra by the attention mechanism. We use the RDM to reconstruct these relevant features locally and apply the RGM to globally reconstruct.
To ensure the model's generalization, we produce a wide-range, land-type-rich band mapping dataset and test the accuracy in similarity and classification. Meanwhile, to verify whether the knowledge learned from the GF-6 WFV images can be applied to the GF-1 PMS images with inconsistent spatial resolution, we refer to the method of Deng [15] and Li [16]. We believe that the reconstructed band can improve the classification ability of the original image and test it on the Area2 (city is the main scene) and Area3 (farmland is the main scene) GF-1 PMS images. The results show that SRT performs well on both the testing set and the classification accuracy of Area1, Area2, and Area3 compared to other spectral reconstruction methods. The classification accuracy of the reconstructed 8-band images is significantly higher than that of the original 4-band GF-1 PMS images.
In future work, our method still has the following aspects worth expanding on and improving: (1) The structure of the model needs to be improved. Although the parameter quantity of SRT decreases, the detection time does increase slightly. (2) Can it be extended to other satellites, such as GaoFen-2 and GaoFen-4?