Two-Path Network with Feedback Connections for Pan-Sharpening in Remote Sensing

: High-resolution multi-spectral images are desired for applications in remote sensing. However, multi-spectral images can only be provided in low resolutions by optical remote sensing satellites. The technique of pan-sharpening wants to generate high-resolution multi-spectral (MS) images based on a panchromatic (PAN) image and the low-resolution counterpart. The conventional deep learning based pan-sharpening methods process the panchromatic and the low-resolution image in a feedforward manner where shallow layers fail to access useful information from deep layers. To make full use of the powerful deep features that have strong representation ability, we propose a two-path network with feedback connections, through which the deep features can be rerouted for reﬁning the shallow features in a feedback manner. Speciﬁcally, we leverage the structure of a recurrent neural network to pass the feedback information. Besides, a power feature extraction block with multiple projection pairs is designed to handle the feedback information and to produce power deep features. Extensive experimental results show the effectiveness of our proposed method.


Introduction
Precisely monitoring based on multi-spectral (MS) images is an essential application in remote sensing.To meet the need of high spatial and high spectral resolutions, the sensor needs to receive enough radiation energy and to collect enough data.The size of a MS detector is usually larger than that of a PAN detector to receive the same amount of radiation energy.The resolution of the MS sensor is lower than that of the PAN sensor [1].Besides, a high resolution MS image requires significantly larger storage consumption than a high resolution PAN image bundled with a low resolution MS image, which is also not convenient to transmit.To attain satisfying high-resolution multi-spectral images for accurate monitoring, pan-sharpening is one encouraging method in contrast to expensively upgrading optical satellites.The technique of pan-sharpening is to output a high-resolution multi-spectral (HRMS) image based on a high spatial resolution panchromatic (PAN) image and a low spatial resolution multi-spectral (LRMS) image [2].After pan-sharpening, the pan-sharpened image has the same spatial size as the single band PAN image.
The conventional pan-sharpening algorithms can be generally divided into three major categories: (1) component substitution (2) multi-resolution analysis (3) regularization methods.The first category, component substitution methods [8][9][10][11], assumes the information about geometric detail is in the structural part, which can be obtained by transforming the LRMS image into a proper domain [12].Then, the structural part is totally substituted or partially substituted by the corresponding part of the PAN image.Finally, the pan-sharpened image is obtained by a corresponding inverse transformation.The multi-resolution analysis approaches [13][14][15] add detail information from the PAN image to produce high resolution MS image.The regularization methods [16,17] focus on building an energy function with strict constraints based on some reasonable prior assumptions, such as sparse coding and variational models.The conventional pan-sharpening algorithms easily cause different kinds of distortions, leading to severe quality degradation [18].In component substitution methods, the spectral characteristics of the MS image is different from those of the PAN image.Therefore, both spatial details and spectral distortions are introduced into the pan-sharpened image.In the multi-resolution analysis approaches, spatial distortions may be introduced by textures substitution or aliasing effects [19].In the regularization methods, the performance of the pan-sharpening depends largely on the energy function.However, it is challenging to build an appropriate energy function.Fortunately, as deep learning develops rapidly [20][21][22][23], convolutional neural networks (CNNs) have been introduced to pan-sharpening, offering new solutions to the aforementioned problems.
Inspired by Super-Resolution Convolutional Neural Network (SRCNN) [22], Masi et al. [7] introduced a very shallow convolutional neural network into pan-sharpening, which only has three layers.A three-layer network was proposed by Zhong et al. [24] to upsample the MS image.Then the upsampled MS image was fused with the PAN image by Gram-Schmidt transformation.Wei et al. [6] proposed a deep residual pan-sharpening neural network to boost the accuracy of pan-sharpening.Instead of simply taking ideas of single image super-resolution as references, Liu et al. [5] proposed a two-stream fusion network to process the MS image and the PAN image independently, then reconstructed the high resolution image from the fused features.In [25], the MS image and the PAN image are also processed separately by a bidirectional pyramid network.However, those aforementioned pan-sharpening methods which are based on deep learning transmit features in a feedforward manner.The shallow layers fail to gain powerful features from the deep layers, which are helpful for reconstruction.
The shallow layers can only extract low-level features, lacking enough contextual information and receptive fields.However, these less powerful features will must be reused in the subsequent layers, which limits the reconstruction performance of the network.Inspired by [26,27], both of which transmit deep features back to the shallow layers to refining the low-level representations, we propose a two-path pan-sharpening network with feedback connections (TPNwFB), which enables deep features to flow back to shallow layers in a top-to-bottom manner.Specifically, TPNwFB is essentially a recurrent neural network, which has a special feature extraction block (FEB) which can extract powerful deep representations and process the feedback information from the previous time step.As suggested by [27,28], the special FEB is composed of multiple pairs of up-sampling and down-sampling layers.There are also dense connections among layers to achieve feature reuse.The details of FEB can be found in Section 3.4.The iterative up-and down-sampling can achieve back-projection mechanism [29], which enables the network to generate powerful features by learning various up-and down-sampling operators.The dense skip connections allow the reuse of features from preceding layers, avoiding the repetitive learning of redundant features.We simply use the hidden states of an unfolded RNN, i.e., the output of the FEB at each time step, to realize the feedback mechanism.The powerful output of the FEB at a certain time step flows into the next time step to improve the less powerful low-level features.Besides, to ensure that the hidden state at each time step carries the useful information for reconstructing better HRMS image, we attach the loss to every time step.Both objective and subjective results demonstrate the superiority of the proposed TPNwFB against other state-of-the-art pan-sharpening methods.In summary, our main contributions are listed as follows.
1. We propose a two-path feedback network to extract features from the MS image and the PAN image separately and to achieve feedback mechanism which can carry powerful deep features to improve the poor shallow features in a feedback manner for better reconstruction performance.2. We attach losses to each time step to supervise the output of the network.In this way, the feedback deep features contain useful information, which comes from the coarsely-reconstructed HRMS at early time steps, to reconstruct better HRMS image at late time steps.

Datasets
To verify the effectiveness of we proposed method, we compare our proposed method with other pan-sharpening methods on five widely used datasets: Spot-6, Pléiades, IKONOS, QuickBird and WorldView-2.
Their characteristics are described in Table 1.Note that, for WorldView-2 dataset, The MS images have 8 bands which are named as Red, Green, Blue, Red Edge, Coastal, Yellow, near-infrared-1, near-infrared-2, respectively.To maintain consistency with the other datasets, Red, Green, Blue and near-infrared-1 band are used for evaluation in the experiments.As for our reference images in evaluation, we follow the Wald's protocol [30].We use bicubic interpolation to downsample the original MS and PAN images by a scale factor of 4 and feed the downsampled images into the network.We consider the original MS image as a reference.

Feedback Mechanism
The feedback mechanism commonly exists in the human visual system, which is able to carry information from high-level parts to low-level parts [31].Lately, many works have made efforts to introduce feedback mechanism [26,27,29,[32][33][34].For single image super-resolution, Haris et al. [29] achieved iterative error feedback based on back-projection theory by up-projection and down-projection units.Han et al. [34] achieved delayed feedback mechanism by a dual-state RNN to transmit information between the two recurrent states.
The most relevant work to ours is [27], which elaborately designed a feedback block to extract powerful high-level representations for low-level computer vision tasks and transmitted the high-level representations to refine the low-level features.To introduce the feedback mechanism to pan-sharpening, we design a two-path pan-sharpening network with feedback connections (TPNwFB), which can process the PAN image and the MS image in two separate paths, and thus TPNwFB is a better choice for pan-sharpening.

Methods
In this part, the implementation details, the evaluation metrics, the network structure of the proposed TPNwFB and the loss function are described in detail.

Implementation Details
As suggested by Liu [5], we test our proposed network on the five datasets mentioned in Section 2.1, separately.We adopt the same data augmentation as [35] does.Following Wald's protocol [30], there are 30,000 training samples for each dataset.We adopt PReLU [36] as our activation function attached after every convolutional layer and deconvolutional layer but the last layer in the network at each time step.We take the pan-sharpened image I T out , which is from the last time step as our pan-sharpened result.The proposed network is implemented by PyTorch [37] and trained on one NVIDIA RTX 2080Ti GPU.Adam optimizer [38] is employed to optimize the network with the initial learning rate 0.0001 and the momentum of 0.9.The mini-batch size is set to 4 and the size of image patches is set to 64 × 64.

•
SAM.The spectral angle mapper (SAM) [39] evaluates the spectral distortions of the pan-sharpened image.It is defined as: where x 1 and x 2 are two spectral vectors.

•
CC.The correlation coefficient (CC) [5] is used to evaluate the spectral quality of the pan-sharpened images.The CC is calculated by a pan-sharpened image I and the corresponding reference image Y.

CC =
Cov(I, Y) where Cov(I, Y) is the covariance between I and Y, and Var(n) denotes the variance of n.
The quality-index Q 4 [40] is the extension of the Q index [30].Q 4 is defined as: where z 1 and z 2 are two quaternions formed by the spectral vectors of the MS image.Cov(z 1 , z 2 ) is the covariance between z 1 and z 2 , Var(n) denotes the variance of n, and Mean(m) denotes the mean of m. • RMSE.The root mean square error (RMSE) [25] is a frequently used measure of the differences between the pan-sharpened image I and the reference image Y.
where w and h are the width and height of the pan-sharpened image.• RASE.The relative average spectral error (RASE) [41] estimates the global spectral quality of the pan-sharpened image.It is defined as: where RMSE(B i ) is the root mean square error between the i-th band of the pan-sharpened image and the i-th band of the reference image.M is the mean value of the N spectral bands The relative global dimensional synthesis error (ERGAS) [42] is a commonly used index to measure the global quality.ERGAS is computed as the following expression: where h is the resolution of the pan-sharpened image, and l is the resolution of the low spatial resolution image.RMSE(B i ) is the root mean square error between the i-th band of the pan-sharpened image and the i-th band of the reference image.Mean(B i ) is the mean value of the i-th band of the low-resolution MS image.N is the number of the spectral bands.

Network Structure
To achieve feedback mechanism, we need to carry back useful deep features to refine the less powerful shallow features.Therefore, there are three necessary parts to achieve the feedback mechanism: (1) leveraging the recurrent structure to achieve iterative process.The iterative process allows powerful deep features to flow back to modify the poor low-level features.( 2) providing the low-resolution MS image at each time step.This supplies the low-level features at each time step, which need improving.(3) attaching the loss to force the network reconstruct the HRMS image at each time step.This can ensure that the feedback features contains useful information from the coarsely-reconstructed HRMS image for reconstructing better HRMS image.As illustrated in the Figure 1, the proposed TPNwFB can be unfolded into T time steps.Time steps are placed in a chronological order for a clear illustration.To enforce the feedback information in TPNwFB to carry useful information for improving the low-level features, we attach the loss function to the output of the network at every time step.The discussion about the loss function for TPNwFB can be found in Section 3.5.The network at each time step t can be roughly divided into three parts: (1) two-path block (to extract features from the MS image and the PAN image separately), ( 2) feature extraction block (to generate powerful deep features through various upsampling-downsampling pairs and dense skip connections) and (3) reconstruction block (to reconstruct HRMS image).Note that the parameters are shared across all time steps to keep the network consistent.With the global skip connection at each time step, the network, at time step t, is to recover a residual image I t res , which is the difference between the HRMS image and the upsampled LRMS image.We denote a convolutional layer with kernel size s × s as Conv s,n (•), where n is the number of kernels.The output of a convolutional layer has the same spatial size with the input unless we say otherwise.Similarly, Deconv s,n (•) denotes a deconvolutional layer with n filters and kernel size s × s.
The two-path block consists of two sub-networks to extract features from the MS image and the PAN image, respectively.The path that regards a four-band MS image as the input is denoted as "the MS path".The other path that regards a single-band PAN image as the input is denoted as "the PAN path".The MS path consists of one Conv 3,256 (•) layer and one Conv 1,64 (•) layer to extract features F MS from the MS image: where I MS is the MS image.The PAN path contains two successive Conv 3,64 (•) layers with different parameters.The two convolutional layers in the PAN path have the stride of 2 to down-sampling the PAN image and extract features F PAN from the PAN image: where I PAN is the PAN image.Finally, we concatenate F MS and F PAN to form the low-level features F t i at time step t: where refers to the concatenation operation.Then F t i are used as the input of the following feature extraction block.Note that F 1 i are considered as the initial feedback information F 0 f b .The feature extraction block at t-th time step receives the feedback information F t−1 f b from the (t − 1)-th time step and the low-level features F t i from the t-th time step.F t f b represents the feedback information, which is also the output of the feature extraction block at t-th time step.The output of the feature extraction block at t-th time step can be formulated as follows.

LRMS
where f FEB (•, •) denotes the nested functions of the feature extraction block.The details of the feature extraction block are stated in Section 3.4.
The reconstruction part contains one Deconv 8,64 (•) layer with the stride of 4 and the padding of 2 and a Conv 3,4 (•) layer.The Deconv 8,64 (•) layer is used to upsample the low-resolution features with a scale factor ×4.The Conv 3,4 (•) layer is to output the I t res .The function of the reconstruction part can be formulated as: where f re (•) denotes the nested functions of the reconstruction part.Thus, the pan-sharpened image I t out , at time step t, can be obtained by: where f up (•) is the upsampling operation to upsample the LRMS image with a scale factor ×4.
The choice of the upsampling kernel can be arbitrary.In this paper, we simply choose the bilinear upsampling kernel.After Ttime steps, we will totally obtain T pan-sharpened images (I 1 out , I 2 out , . . ., I t out , . . .I T out ).

Feature Extraction Block
Figure 2 shows the feature extraction block (FEB).The FEB at the time step t receives the powerful deep features F t−1 f b from the (t − 1)-th time step and the low-level features F t i from the t-th time step.F t−1 f b are used to refine the low-level features F t i .Then, the feature extraction block generates more powerful deep features F t f b which are passed to the reconstruction block and the next time step.The FEB consists of G projection pairs with dense skip connections to link each pair.Each projection pair mainly has a deconvolutional layer to upsample the features and a convolutional layer to downsample the features.With multiple projection pairs, we iteratively up-and downsample the input features to achieve back-projection mechanism which enables the feature extraction block to generate more powerful features.
The feature extraction block at the t-th time step receives the low-level features F t i and the feedback information F t−1 f b .To refine the low-level features F t i with the F t−1 f b , we concatenate F t i with F t−1 f b and use a Conv 1,64 (•) layer to compress the concatenated features, generating the refined low-level features F t L0 : where denotes the concatenation operation.
The upsampled features and the downsampled features produced by the g-th projection pair at the t-th time step are denoted as F t Hg and F t Lg , respectively.F t Hg can be obtained by: where Deconv 8,64 (•) is a deconvolutional layer at the g-th projection pair with the kernel size of 8, the stride of 4 and the padding of 2. Correspondingly, F t Lg can be obtained by: where Conv 8,64 (•) is a convolutional layer at the g-th projection pair with the kernel size of 8, the stride of 4 and the padding of 2. Note that, except for the first projection pair, we add a Conv

Loss Function
The network structure is an important factor affecting the quality of the pan-sharpened image, and the loss function is another important factor.Many of previous single image super-resolution and pan-sharpening methods take the L 2 loss function to optimize the parameters of the network [7,22,43,44].However, the L 2 loss function may lead to unsatisfied artifacts at the flat areas due to that the L 2 loss function leads to a local minimum.In contrast, the L 1 loss function could obtain a better minimum.Besides, the L 1 loss function can preserve colors and luminance better than the L 2 loss function does [45].Therefore, we choose the L 1 loss function to optimize the parameters of the proposed network.Since we have T time steps in one iteration and we attach the L 1 loss to the output of every time step, we totally have T pan-sharpened images (I where Θ denotes the parameters in the network, and M denotes the samples numbers in each training batch.

Impacts of G and T
We study the impacts of the number of time steps (denoted as T for short) and the number of projection pairs in the feature extraction block (denoted as G for short).
At first, we study the impact of T with G fixed.As shown in Figure 3, the network with feedback connection(s) can achieve improvement on pan-sharpening performance against the one without feedback (T = 1, the yellow line in Figure 3).Besides, it can be seen that the pan-sharpening performance can be further improved as T increases.Therefore, the network benefits from the feedback information across time steps.
Then, we investigate the impact of G with T fixed.Figure 4 shows that we can achieve better performance on pan-sharpening with larger value of G because of the stronger feature extraction ability of a deeper network.Therefore, larger T or G both can lead to more satisfying results.For simplicity, we choose T = 4 and G = 6 for analysis in the following subsections.

Comparisons with Other Methods
In this subsection, to evaluate the effectiveness of our proposed method, we compare TPNwFB with other six pan-sharpening methods: LS [46], MMP [47], PRACS [19], SRDIP [48], ResTFNet-l 1 [5], BDPN [25].The objective results on the five datasets are reported in Tables 2-6, respectively.The pan-sharpening results of each dataset are obtained by averaging the results of all the test images.The best performance is shown in bold, and the second best performance is underlined.From those tables, we can observe that the proposed TPNwFB can outperform the contrastive pan-sharpening methods by a large margin on all evaluation indexes.Besides, the proposed TPNwFB can constantly give the best results on all datasets while the performance of other pan-sharpening methods varies between datasets.This indicates the superiority of our proposed methods.On all datasets, the proposed TPNwFB provides the best CC and RMSE results, which indicates the pan-sharpened image produced by TPNwFB is the closest to the reference image.That is to say we have successfully enhanced the spatial resolution.Moreover, with the best results on SAM and RASE, TPNwFB can keep good spectral quality after pan-sharpening.From the point of global quality (ERGAS and Q 4 ), TPNwFB also achieves the best performance.
We also provide some subjective results, as shown in Figures 5-7.From Figure 5, we can see that our proposed method gives more clearer details and sharper edges, which is crucial for accurately monitoring.In Figure 6, our proposed method does not introduce color distortion and bring fewer artifacts.The pan-sharpened image produced by TPNwFB presents the most clear outline of the building compared with other pan-sharpening methods.In Figure 7, our proposed method presents the most clear white line without color distortion.The aforementioned comparisons demonstrate the effectiveness and robustness of our proposed TPNwFB.

Discussions on Loss Functions
In this subsection, we compare the TPNwFB trained with l 1 loss function (denoted as TPNwFB-l 1 for short) with the TPNwFB trained with l 2 loss function (denoted as TPNwFB-l 2 for short).The subjective and objective results are shown in Figure 8.It can be seen the pan-sharpened image generated by TPNwFB-l 1 gives more clear details and sharper edges around water bodies when compared with the ones produced by TPNwFB-l 2 .The objective results also suggest that the spatial and spectral quality of the pan-sharpened image can be improved by the l 1 loss function.We choose l 1 loss function as our default loss function for analysis in the following experiments.

Discussions on Feedback Mechanism
To investigate the feedback mechanism in the proposed network, we trained a feedforward one, which is the counterpart of the TPNwFB.By disconnecting the loss from the 1st time step to the (T − 1)-th time step (the loss attached to the final T-th step is kept), the network is impossible to refine the low-level features with the useful information which carries a notion of the pan-sharpened image.Therefore, the feedback network TPNwFB degrades to its feedforward counterpart, which is denoted as TPN-FF.The TPN-FF still keeps the recurrent structure and can output four intermediate pan-sharpened images.Note that these four pan-sharpened images have no loss to supervise the performance.We then compare the SAM, CC and Q 4 values of all intermediate pan-sharpened images from TPNwFB and TPN-FF.The results are shown in Table 7. From Table 7, we have two observations.The first observation is that the feedback network TPNwFB outperforms TPN-FF at every time step.This indicates the proposed TPNwFB actually can benefit from the feedback connections instead of the recurrent network structure because both networks keep the recurrent structures.Another observation is that the proposed TPNwFB shows high pan-sharpening quality at early time steps due to the feedback mechanism.
To show how the feedback mechanism impacts the pan-sharpening performance, we visualize the output of the feature extraction block at each time step in TPNwFB and TPN-FF, as shown in Figure 9.The visualizations are the channel-wise mean of F t f b , which can represent the output of the feature extraction part at the time step t.With the global residual connections, we aim at recovering the residual image to predict high frequency components.From Figure 9, we can see that the feedback network TPNwFB can produce feature maps F t f b with more negative values compared with TPN-FF, showing powerful ability to suppress the smooth areas.This further leads to more high frequency components.Besides, the features produced by TPN-FF vary significantly from the first time step to the last time step: edges are outlined at early time steps and smooth areas are suppressed at late time steps.On the other hand, with feedback connections, the proposed TPNwFB can take a self-correcting process since it can obtain well-developed feature maps at early time steps.This different pattern indicates that TPNwFB can reroute deep features to refine shallow features.Consequently, the shallow layers can develop better representations at late time steps, improving the pan-sharpening performance.

Conclusions
In this paper, we propose a two-path network with feedback connections for pan-sharpening (TPNwFB).In the proposed TPNwFB, the PAN and the LRMS images are processed separately to make full use of both images.Besides, the powerful deep features, which contain useful information from coarsely reconstructed HRMS at early time steps, are carried back through feedback connections to refine the low-level features.The loss function is attached to every time step to ensure the feedback information contains a notion of the pan-sharpened image.Furthermore, the special feature extraction block is used to extract powerful deep features and to effectively handle the feedback information.With feedback connections, the proposed TPNwFB can take a self-correcting process since it can obtain well-developed feature maps ta early time steps.Extensive experiments have demonstrated the effectiveness of our proposed method on pan-sharpening.

2 Figure 1 .
Figure 1.The architecture of our proposed TPNwFB.The red arrows denote the feedback connections.denotesconcatenating the features from the LRMS image and the features from the PAN image.⊕ denotes element-wise addition.Conv_s × s denotes the convolutional layer with kernel size s × s.

Figure 3 .
Figure 3.The analysis of T with G fixed to 6.The figure gives the CC values on Spot-6 dataset.

Figure 4 .
Figure 4.The analysis of G with T fixed to 4. The figure gives the CC values on Spot-6 dataset.

Figure 8 .
Figure 8. Comparisons of TPNwFB networks trained with different loss functions.

Figure 9 .
Figure 9.The visualizations of F t f b at each time step in TPN-FF and TPNwFB.

Table 1 .
The spectral and spatial characteristic of PAN and MS iamges for five datasets.
1,64 layer before Deconv 8,64 (•) and Conv 8,64 (•) for feature fusion and computation efficiency.To fully exploit the features produced by each projection pair and match the size of the input low-level features F t+1 i at the next time step, we use a Conv 1,64 (•) layer to fuse all the downsampled features produced by projection pairs to generate the output F t f b of the feature extraction block at the t-th time step: 1 out , I 2 out , . . ., I T out ). T ground truth HRMS images (I 1 HRMS , I 2 HRMS , . . ., I T HRMS ) are paired with the T outputs in the proposed network.Note that (I 1 HRMS , I 2 HRMS , . . ., I T HRMS ) are the same with each other.The total loss function can be written as:

Table 2 .
The objective results on Spot-6 dataset.

Table 3 .
The objective results on Pléiades dataset.

Table 4 .
The objective results on IKONOS dataset.

Table 5 .
The objective results on WorldView-2 dataset.

Table 6 .
The objective results of on QuickBird dataset.

Table 7 .
The impacts of feedback mechanism on Spot-6 dataset.