PCDRN: Progressive Cascade Deep Residual Network for Pansharpening

Pansharpening is the process of fusing a low-resolution multispectral (LRMS) image with a high-resolution panchromatic (PAN) image. In the process of pansharpening, the LRMS image is often directly upsampled by a scale of 4, which may result in the loss of high-frequency details in the fused high-resolution multispectral (HRMS) image. To solve this problem, we put forward a novel progressive cascade deep residual network (PCDRN) with two residual subnetworks for pansharpening. The network adjusts the size of an MS image to the size of a PAN image twice and gradually fuses the LRMS image with the PAN image in a coarse-to-fine manner. To prevent an overly-smooth phenomenon and achieve high-quality fusion results, a multitask loss function is defined to train our network. Furthermore, to eliminate checkerboard artifacts in the fusion results, we employ a resize-convolution approach instead of transposed convolution for upsampling LRMS images. Experimental results on the Pléiades and WorldView-3 datasets prove that PCDRN exhibits superior performance compared to other popular pansharpening methods in terms of quantitative and visual assessments.


Introduction
Remote sensing satellites such as Pléiades, WorldView, and GeoEye provide low spatial resolution multispectral (LRMS) and high spatial resolution panchromatic (PAN) images. To obtain a fused high-resolution multispectral (HRMS) image by fusing an LRMS image with a PAN image of the same scene, pansharpening is considered a powerful image fusion technique. The HRMS image can effectively integrate the spectral characteristics of the LRMS image with the spatial information of the PAN image [1,2].
In recent decades, numerous approaches have been put forward for pansharpening. The conventional methods for pansharpening can be classified into three major categories: component substitution (CS)-based methods, multiresolution analysis (MRA)-based methods, and model-based methods. The CS-based methods primarily include the intensity-hue-saturation (IHS) method [3,4], the principal component analysis method (PCA) [5], and the Gram-Schmidt (GS) transform-based method [6]. Although CS-based methods can usually be quickly and easily implemented [7], obvious spectral distortions may be produced in the spectral domain of the fused image [1]. The MRA-based methods primarily include the Laplacian Pyramid method [8], à trous wavelet transform (ATWT) method [9], discrete wavelet transform (DWT) [10], and non-subsampled Contourlet transform (NSCT) [11]. MRA-based methods generally outperform CS-based methods in spectral preservation. However, they often lead to the problem of spatial distortion [12]. There are several

Residual Network
He et al. [22] proposed a residual learning network architecture, which is substantially deeper than the plain network. Figure 1 illustrates the structure of a residual block. It not only makes the network deeper but also overcomes the vanishing gradient problem of the plain network.
Formally, the residual block is represented as: where x and y are the input and output vectors of the layers considered, respectively. The function F(x, {W i }) denotes the residual mapping to be learned. As shown in Figure 1, F(x) is defined as: where W 1 denotes the weight of the first layer, W 2 denotes the weight of the second layer, ⊗ denotes convolution, R denotes the activation function Relu, and R(x) = max(0, x). Remote Sens. 2019, 11, x FOR PEER REVIEW 3 of 21

Universal Image Quality Index
Wang et al. [23] presented a universal image quality index (UIQI), which is used to measure the structure distortion degree. It is composed of three factors: loss of correlation, luminance distortion, and contrast distortion.
The quality metrics is defined as: where F σ and R σ denote the standard deviation of the fused image F and referenced image R, respectively. F R σ denotes the covariance of the fused image F and referenced image R. F μ and R μ denote the mean of the fused image F and referenced image R, respectively. The optimum value of UIQI is 1.

Proposed Method
In this section, we put forward a new PCDRN for pansharpening, which exploits the two residual subnetworks called ResNet to extract accurate features and progressively inject the details of a PAN image into an MS image in a coarse-to-fine manner. A multitask loss function is proposed to prevent the over-smooth phenomenon from preserving spatial information. Furthermore, to address the problem of checkerboard artifacts in the fusion results, resize-convolution is adopted rather than transposed convolution in the upsampling process of LRMS images.

Flowchart of PCDRN
To inject additional spatial information of the PAN image into the LRMS image, we design the PCDRN for pansharpening. As shown in Figure 2, PCDRN consists of two residual subnetworks ResNet1 and ResNet2, which are progressively cascaded to learn nonlinear feature mapping from LRMS and PAN images to HRMS images.
In our experiments, the PCDRN was implemented through three stages. Stage 1: The LRMS images are upsampled by a scale of 2 with nearest-neighbor interpolation, and the PAN images are downsampled by a scale of 2. The upsampled MS images are then concatenated with the downsampled PAN images to form the 5-band inputs.

Universal Image Quality Index
Wang et al. [23] presented a universal image quality index (UIQI), which is used to measure the structure distortion degree. It is composed of three factors: loss of correlation, luminance distortion, and contrast distortion.
The quality metrics is defined as: where σ F and σ R denote the standard deviation of the fused image F and referenced image R, respectively. σ FR denotes the covariance of the fused image F and referenced image R.µ F and µ R denote the mean of the fused image F and referenced image R, respectively. The optimum value of UIQI is 1.

Proposed Method
In this section, we put forward a new PCDRN for pansharpening, which exploits the two residual subnetworks called ResNet to extract accurate features and progressively inject the details of a PAN image into an MS image in a coarse-to-fine manner. A multitask loss function is proposed to prevent the over-smooth phenomenon from preserving spatial information. Furthermore, to address the problem of checkerboard artifacts in the fusion results, resize-convolution is adopted rather than transposed convolution in the upsampling process of LRMS images.

Flowchart of PCDRN
To inject additional spatial information of the PAN image into the LRMS image, we design the PCDRN for pansharpening. As shown in Figure 2, PCDRN consists of two residual subnetworks ResNet1 and ResNet2, which are progressively cascaded to learn nonlinear feature mapping from LRMS and PAN images to HRMS images.
In our experiments, the PCDRN was implemented through three stages. Stage 2: The 5-band inputs are fed into ResNet1 to extract the coarse features. The ResNet includes 2 convolutional layers and 5 residual blocks. An element-wise sum is then performed on the feature maps of ResNet1 and the upsampled LRMS image in a channel by channel manner. Subsequently, 1 × 1 convolutional layers are utilized to reduce spectral dimensionality from 64 bands to 4 bands. The 4-band results are upsampled by a scale of 2 using nearest-neighbor interpolation and then concatenated with the PAN images to obtain new 5-band inputs.  Stage 3: The new 5-band inputs are entered into the ResNet2, which exhibits the same structure as that of the previous ResNet1, to extract finer features. After the LRMS image has been upsampled twice, an element-wise sum is performed on the results of ResNet2 and the upsampled LRMS image, channel by channel. The fused HRMS image is finally obtained by a 1 × 1 convolutional layer and an activation function tanh. In addition, 1 × 1 convolution was employed several times in the network to achieve the reduction of network parameters and boost the performance of PCDRN.
To validate the advantages of PCDRN, a performance comparison between the single ResNet and PCDRN is presented in Figure 3. We observe that the two indices obtained by PCDRN are significantly better than those obtained by the single ResNet.
Remote Sens. 2019, 11, x FOR PEER REVIEW 5 of 21 Stage 3: The new 5-band inputs are entered into the ResNet2, which exhibits the same structure as that of the previous ResNet1, to extract finer features. After the LRMS image has been upsampled twice, an element-wise sum is performed on the results of ResNet2 and the upsampled LRMS image, channel by channel. The fused HRMS image is finally obtained by a 1 × 1 convolutional layer and an activation function tanh. In addition, 1 × 1 convolution was employed several times in the network to achieve the reduction of network parameters and boost the performance of PCDRN.
To validate the advantages of PCDRN, a performance comparison between the single ResNet and PCDRN is presented in Figure 3. We observe that the two indices obtained by PCDRN are significantly better than those obtained by the single ResNet. Average peak signal-to-noise ratio (PSNR) [24] and universal image quality index (UIQI) [23] of the single ResNet and PCDRN on 180 groups of the simulated dataset from Pléiades.

Multitask Loss Function
A mean squared error (MSE) loss function is usually applied in deep learning-based pansharpening methods. However, the MSE loss function often loses high-frequency details, such as texture during the fusion process, which may result in poor perceptual quality and the over-smooth phenomenon. To address this problem, we design a novel multitask loss function comprising MSE loss and universal image quality index (UIQI) loss.  . Average peak signal-to-noise ratio (PSNR) [24] and universal image quality index (UIQI) [23] of the single ResNet and PCDRN on 180 groups of the simulated dataset from Pléiades.

Multitask Loss Function
A mean squared error (MSE) loss function is usually applied in deep learning-based pansharpening methods. However, the MSE loss function often loses high-frequency details, such as texture during the fusion process, which may result in poor perceptual quality and the over-smooth phenomenon. To address this problem, we design a novel multitask loss function comprising MSE loss and universal image quality index (UIQI) loss. The UIQI is often used to measure the structure distortion degree in image quality evaluation. Based on the characteristics of UIQI, the UIQI loss in our network has been designed to preserve structure information. Therefore, our multitask loss function L Fusion can improve the performance of the fusion network in preserving spatial details.
The L Fusion is represented as: where L Fusion NMSE is the normalized MSE loss, L Fusion NUIQI is the normalized UIQI loss, and α denotes the weight coefficient.
The L Fusion NMSE and L Fusion NUIQI are defined as: and where β and γ denote the normalized coefficients, n denotes the number of train image groups, x (i) denotes a low-resolution MS image, F(x (i) ) denotes a fused MS image, and R (i) denotes a reference MS image. In order to roughly balance the contribution of the MSE and UIQI losses, a normalized algorithm is introduced as following Algorithm 1. F(x (i) ) − R (i) 2 and L UIQI (i) by Finally, the L Fusion NMSE and L Fusion NUIQI are obtained by formula (5) and formula (6), respectively. To illustrate the validity of the proposed loss function, we compare the performances of the MSE loss function and the MSE + UIQI loss function, as shown in Table 1. The results demonstrate that the PSNR value obtained by using the proposed MSE + UIQI loss function is higher than that obtained by only using the MSE loss function. Similarly, our network comprising the MSE + UIQI loss function also obtains UIQI value, which is higher than that obtained by using the MSE loss function.

Resize-Convolution
Unlike the traditional convolution, the transposed convolution forms the connectivity in the backward direction [25]. The transposed convolution is usually employed for upsampling images. In addition, different from the weights in a traditional filter, the weights in the transposed convolution are learnable without being predefined. However, transposed convolution can easily produce uneven overlap, which may result in checkerboard artifacts in the image. Therefore, the resize-convolution approach is employed to avoid the problem of uneven overlap, which has been known to be robust against checkerboard artifacts [26]. The approach involves resizing an image using nearest-neighbor interpolation and then executing a convolutional operation. To better illustrate this phenomenon, we carried out a group of experiments to compare the performance of the fused image that employs transposed convolution versus resize-convolution, which can be seen in Figure 4. As shown in Figure 4, a small region is enlarged and shown on the left bottom for better visualization. From the enlarged box in Figure 4a, we can observe the obvious checkerboard artifacts. From the enlarged box in Figure 4b, we can observe that the region is smoother than that in Figure 4a.
Remote Sens. 2019, 11, x FOR PEER REVIEW 7 of 21 by only using the MSE loss function. Similarly, our network comprising the MSE + UIQI loss function also obtains UIQI value, which is higher than that obtained by using the MSE loss function.

Resize-Convolution
Unlike the traditional convolution, the transposed convolution forms the connectivity in the backward direction [25]. The transposed convolution is usually employed for upsampling images. In addition, different from the weights in a traditional filter, the weights in the transposed convolution are learnable without being predefined. However, transposed convolution can easily produce uneven overlap, which may result in checkerboard artifacts in the image. Therefore, the resize-convolution approach is employed to avoid the problem of uneven overlap, which has been known to be robust against checkerboard artifacts [26]. The approach involves resizing an image using nearest-neighbor interpolation and then executing a convolutional operation. To better illustrate this phenomenon, we carried out a group of experiments to compare the performance of the fused image that employs transposed convolution versus resize-convolution, which can be seen in Figure 4. As shown in Figure  4, a small region is enlarged and shown on the left bottom for better visualization. From the enlarged box in Figure 4a, we can observe the obvious checkerboard artifacts. From the enlarged box in Figure  4b, we can observe that the region is smoother than that in Figure 4a.   Table 2 shows the quantitative assessment of the results in Figure 4, in which the bold represents the best value. As we can observe from the table, the experimental results by using resizeconvolution are better than those by using transposed convolution in most image quality indexes, including PSNR [24], the correlation coefficient (CC) [27], UIQI [23], the spectral angle mapper (SAM) [28], and the erreur relative global adimensionnelle de synthese (ERGAS) [29] except the Q2 n [30] index.  Table 2 shows the quantitative assessment of the results in Figure 4, in which the bold represents the best value. As we can observe from the table, the experimental results by using resize-convolution are better than those by using transposed convolution in most image quality indexes, including PSNR [24], the correlation coefficient (CC) [27], UIQI [23], the spectral angle mapper (SAM) [28], and the erreur relative global adimensionnelle de synthese (ERGAS) [29] except the Q2 n [30] index. To further demonstrate the superiority of the resize-convolution method, the polynomial (EXP) [31], transposed convolution, and resize-convolution methods were evaluated by performing experiments on 180 groups of the simulated dataset from Pléiades through a single ResNet. Figure 5 shows the average PSNR and UIQI of these three methods. From Figure 5, we can find that single ResNet with resize-convolution outperforms the other two methods on PSNR. In addition, it not only achieves higher UIQI values than the method with transposed convolution but also achieves similar UIQI values to the method with EXP. Thus, on the whole, the single ResNet with resize-convolution is superior to the methods with transposed convolution and EXP. To further demonstrate the superiority of the resize-convolution method, the polynomial (EXP) [31], transposed convolution, and resize-convolution methods were evaluated by performing experiments on 180 groups of the simulated dataset from Pléiades through a single ResNet. Figure 5 shows the average PSNR and UIQI of these three methods. From Figure 5, we can find that single ResNet with resize-convolution outperforms the other two methods on PSNR. In addition, it not only achieves higher UIQI values than the method with transposed convolution but also achieves similar UIQI values to the method with EXP. Thus, on the whole, the single ResNet with resize-convolution is superior to the methods with transposed convolution and EXP.

Datasets
In the experiments, the MS images and PAN images were captured by the Pléiades and WorldView-3 satellites, respectively. The MS images of the Pléiades satellite contain four bands: red, green, blue, and near-infrared. We selected red, green, blue, and near-infrared1 bands from the MS image of the WorldView-3 satellite to comprise the new 4-band MS image. The spatial resolutions of Pleiades and WorldView-3 datasets are described in Table 3. The source images were degraded to lower resolution by a factor of 4 using Wald's protocol [32,33]. In the approach, the original MS images were degraded by using a low-pass filter matched with modulation transfer function (MTF) of the MS sensor, and the original PAN images were degraded by using the 'bicubic' method. The sizes of the original MS and PAN images are 256 × 256 and 1024 × 1024 pixels, respectively. In training, the sizes of the degraded MS and PAN images are 64 × 64 and 256 × 256 pixels, respectively. The original MS images are considered referenced images. To enhance the generalization of our network, data augmentation approaches in [34] were used in the training process. The 180 groups of simulated data from the Pléiades satellite and the 108 groups of simulated data from WorldView-3 satellite were used as test datasets, respectively. Each group of simulated data is composed of a 64 × 64 degraded MS image, a 256 × 256 degraded PAN image, and a 256 × 256 reference MS image. In addition, the 165 groups of real data from Pléiades satellite and 100 groups of real data from WorldView-3 satellite were used to assess the performance of PCDRN. It should be noted that the test datasets are not used to train PCDRN.

Training Details
The training of PCDRN is achieved in 200 epochs with a batch size of 9 for using adaptive moment estimation (Adam). The learning rate ε is initialized to 8 × 10 −5 , which is divided by a factor of 2 for every 50 epochs. The biases in the network are initialized to zero.
To further demonstrate the effectiveness of the multitask loss function, numerous experiments were performed using different α values on 180 sets of the simulated dataset from Pléiades. The average PSNR and UIQI of PCDRN using different α values on 180 groups of the simulated dataset from Pléiades are shown in Figure 6. As shown in Figure 6, we can observe that the best performances are obtained when α is set to 0.1 in formula (4). The PCDRN is implemented under TensorFlow 1.8 and TensorLayer 1.8. The experiments were performed on an NVIDIA GeForce GT 1080Ti GPU.

Compared Methods
In this paper, PCDRN is compared with one interpolated method and eleven popular pansharpening methods.  [12].

Experiments on Simulated Data
In this subsection, to demonstrate the validity of PCDRN, we compared the visual effects and quantitative assessments of simulated experimental results obtained from different methods. Six metrics were used to evaluate the performance of PCDRN on simulated data from the Pléiades and WorldView-3 datasets, which include PSNR [24], CC [27], UIQI [23], Q2 n [30], SAM [28], and ERGAS [29].
To better observe the texture details in the fused images, there are two rectangle boxes marked in yellow, and the bigger box is the enlarged image of the smaller one. Note that the bold represents the best performance of quantitative assessment in experiments.

Experiments on Pléiades Dataset
A group of fusion results on simulated data from Pléiades are shown in Figure 7. Figure 7a is the reference image that was applied to assess the fused images. Figure 7b displays the degraded PAN. In Figure 7c, the upsampled image by the EXP has good spectral quality but exhibits serious spatial distortions. The corresponding fusion images obtained by twelve pansharpening methods are presented in Figure 7d-o, respectively. As shown in Figure 7, the fusion results of the BT and

Experiments on Simulated Data
In this subsection, to demonstrate the validity of PCDRN, we compared the visual effects and quantitative assessments of simulated experimental results obtained from different methods. Six metrics were used to evaluate the performance of PCDRN on simulated data from the Pléiades and WorldView-3 datasets, which include PSNR [24], CC [27], UIQI [23], Q2 n [30], SAM [28], and ERGAS [29].
To better observe the texture details in the fused images, there are two rectangle boxes marked in yellow, and the bigger box is the enlarged image of the smaller one. Note that the bold represents the best performance of quantitative assessment in experiments.

Experiments on Pléiades Dataset
A group of fusion results on simulated data from Pléiades are shown in Figure 7. Figure 7a is the reference image that was applied to assess the fused images. Figure 7b displays the degraded PAN. In Figure 7c, the upsampled image by the EXP has good spectral quality but exhibits serious spatial distortions. The corresponding fusion images obtained by twelve pansharpening methods are presented in Figure 7d-o, respectively. As shown in Figure 7, the fusion results of the BT and MTF_GLP_HPM methods suffer from serious spectral distortions. The fusion results of other compared methods have obvious spectral distortions, such as the roof of the building in the enlarged box. However, compared to all the comparison pansharpening methods, the fused result of PCDRN is closer to the reference image by observation. Furthermore, from the corresponding quantitative assessment of Figure 7 as shown in Table 4, PCDRN outperforms the other eleven pansharpening methods in six metrics. MTF_GLP_HPM methods suffer from serious spectral distortions. The fusion results of other compared methods have obvious spectral distortions, such as the roof of the building in the enlarged box. However, compared to all the comparison pansharpening methods, the fused result of PCDRN is closer to the reference image by observation. Furthermore, from the corresponding quantitative assessment of Figure 7 as shown in Table 4, PCDRN outperforms the other eleven pansharpening methods in six metrics.    Furthermore, we perform experiments on 180 sets of the simulated dataset from Pléiades. Table 5 tabulates the mean values of experimental results. As shown in Table 5, we can clearly observe that PCDRN produces the best quantitative assessment results of the most metrics among all pansharpening methods.  Figure 8 shows an example of the simulated experiment performed on the WorldView-3 dataset. The reference and Pan images are shown in Figure 8a,b, respectively. The upsampled LRMS using EXP method is shown in Figure 8c, which is also blurred as can be seen from the enlarged box. The 12 fused results are given in Figure 8d-o. From Figure 8d-o, we can observe that the results of the BT, MMMT, and DRPNN methods have serious spectral distortions. The fused results of the ATWT, GSA, MTF_GLP_CBD, GS, MTF_GLP_HPM, and ASIM methods exhibit some spectral distortions on the roof of buildings in the enlarged box. The AIHS method yields obvious spectral distortion and some artifacts that do not exist in the original image. Although the results of MSDCNN method have good spatial quality, they show some spectral distortions in the enlarged box. In Figure 8, we can observe a red line between the white and red areas, which can be clearly observed in the results of PCDRN. However, in the results of other methods, the red line is very blurred or even nonexistent. We can also observe that PCDRN is superior to other methods in preserving the spectral fidelity in the enlarged box. For the corresponding quantitative evaluation of Figure 8 (see Table 6), PCDRN gains the best fusion values in most metrics. Table 6. Quantitative assessment of results in Figure 8.     Table 7 tabulates the average quantitative results on 108 groups of the simulated dataset from WorldView-3. As can be observed, PCDRN achieves the best fusion results in most metrics. Therefore, according to the above experimental results on the simulated data from Pléiades and WorldView-3 satellites, we find that PCDRN is better than the other 11 pansharpening methods.

Experiments on Real Data
To demonstrate the effectiveness of our proposed method, we also compare the visual effects and quantitative evaluations of real data experimental results obtained from 12 pansharpening methods. The twelve pansharpening methods are evaluated on the real dataset from Pléiades and WorldView-3 in terms of quality with no reference (QNR) index [42]. The QNR metric is composed of the spectral distortion index D λ and the spatial distortion index D S . The optimum values of QNR, D λ, and D S are 1, 0, and 0 respectively. Similar to what was demonstrated on simulated data, two yellow rectangle boxes are marked in each fused image to better observe the texture details. The bigger box is the enlarged image of the smaller one.

Experiments on Pléiades Dataset
A set of fusion results on real data from Pleiades are shown in Figure 9. Figure 9a is the upsampled LRMS using bicubic interpolation. Figure 9b displays the PAN image. Figure 9c is the upsampled LRMS using the EXP method. As shown in Figure 9c, the upsampled image by the EXP method has good spectral quality but exhibits serious spatial distortions. As we can see in Figure 9d-o, the result of AIHS method has some artifacts in the enlarged box. BT and MTF_GLP_HPM methods yield serious spectral distortions. The results obtained by ATWT, GSA, MTF_GLP_CBD, MMMT, GS, ASIM, and DRPNN methods have some spectral distortions because they are oversharpened in the process of pansharpening. The MSDCNN method produces slight spectral distortions as seen in the enlarged box. Table 8 also shows the quantitative assessment of the experimental results with real data in Figure 9. As it can be observed, PCDRN obtains the best values in QNR and D λ metrics.   Table 8. Quantitative assessment of results in Figure 9.  The average quantitative results on 165 groups of the real dataset from Pleiades are listed in Table 9. From Table 9, we can testify that PCDRN not only obtains the best results of QNR and D λ but also obtains the second best results of D S in all the comparison methods.  Figure 10 displays a group of fusion results on real data from WorldView-3. The upsampled LRMS using bicubic interpolation is presented in Figure 10a, and Figure 10b is the PAN image. The upsampled LRMS using EXP method is given in Figure 10c, from which we can observe that the EXP method yields obvious spatial distortions. From Figure 10d-o, we can observe that the AIHS method yields some artifacts in the enlarged box. MTF_GLP_HPM, DRPNN and MSDCNN methods produce obvious spectral distortions. ATWT, GSA, BT, MTF_GLP_CBD, MMMT, GS, and ASIM methods yield some spectral distortions in the enlarged box. For the corresponding quantitative evaluation of Figure 10 (see Table 10), PCDRN gains the best fusion values in QNR and D λ metrics.   Table 11 tabulates the average quantitative results on 100 groups of the real dataset from WorldView-3. As we can see in Table 11, PCDRN obtains the optimal fusion results in all metrics.

Experiments on WorldView-3 Dataset
Thus, from the above-mentioned experimental results on real data from Pléiades and WorldView-3 satellites, we can conclude that PCDRN outperforms other fusion methods in balancing the spectral preservation and spatial enhancement.   Table 10. Quantitative assessment of results in Figure 10.  Table 11 tabulates the average quantitative results on 100 groups of the real dataset from WorldView-3. As we can see in Table 11, PCDRN obtains the optimal fusion results in all metrics.

Methods
Thus, from the above-mentioned experimental results on real data from Pléiades and WorldView-3 satellites, we can conclude that PCDRN outperforms other fusion methods in balancing the spectral preservation and spatial enhancement.

Further Discussion
To further verify the performance of PCDRN, experiments were performed on 90 groups of the simulated images, which were selected from another scene of the Pléiades dataset. It should be noted that these images were not used to train the PCDRN. An example of the simulated experiment performed on the Pléiades dataset is shown in Figure 11. Figure 11a,b are the reference image and the degraded PAN image, respectively. The upsampled LRMS using EXP method is shown in Figure 11c, which is blurred because of not using the PAN image. Figure 11d-o displays the corresponding fused results. In Figure 11d-o, the results of AIHS, ATWT, GSA, BT, MTF_GLP_CBD, MMMT, GS, MTF_GLP_HPM, and ASIM methods not only suffer from serious spectral distortion but also exhibit some artifacts. DRPNN and MSDCNN methods produce some spectral distortion. As we can observe, the result of PCDRN is still closest to the reference image in 12 pansharpening methods by visual inspection. Table 12 tabulates the average quantitative results on 90 groups of the simulated dataset from Pléiades. In Table 12, we can also find that PCDRN achieves the best values in 6 image quality indexes, which can again prove the proposed PCDRN is effective.

Conclusions
In this paper, we presented a new deep learning-based approach for pansharpening, i.e., PCDRN. Different from other pansharpening approaches, we interpolated LRMS images twice and extracted features from source images using two residual subnetworks to inject details in two sizes. To avoid the over-smooth phenomenon, we design a multitask loss function to train our network to achieve high-quality remote sensing image fusion. To eliminate checkerboard artifacts, a resizeconvolution consisting of a nearest-neighbor interpolation and a convolution layer was employed instead of transposed convolution for upsampling. Compared with other pansharpening methods, the experimental results demonstrated that PCDRN exhibits the best performance. In the future, we intend to develop an adaptive method of the multitask loss function to preserve additional spatial information.

Conclusions
In this paper, we presented a new deep learning-based approach for pansharpening, i.e., PCDRN. Different from other pansharpening approaches, we interpolated LRMS images twice and extracted features from source images using two residual subnetworks to inject details in two sizes. To avoid the over-smooth phenomenon, we design a multitask loss function to train our network to achieve high-quality remote sensing image fusion. To eliminate checkerboard artifacts, a resize-convolution consisting of a nearest-neighbor interpolation and a convolution layer was employed instead of transposed convolution for upsampling. Compared with other pansharpening methods, the experimental results demonstrated that PCDRN exhibits the best performance. In the future, we intend to develop an adaptive method of the multitask loss function to preserve additional spatial information.