PCDRN: Progressive Cascade Deep Residual Network for Pansharpening

Yang, Yong; Tu, Wei; Huang, Shuying; Lu, Hangyuan

doi:10.3390/rs12040676

Open AccessArticle

PCDRN: Progressive Cascade Deep Residual Network for Pansharpening

¹

School of Information Technology, Jiangxi University of Finance and Economics, Nanchang 330032, China

²

School of Software and Communication Engineering, Jiangxi University of Finance and Economics, Nanchang 330032, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2020, 12(4), 676; https://doi.org/10.3390/rs12040676

Submission received: 14 January 2020 / Revised: 15 February 2020 / Accepted: 17 February 2020 / Published: 19 February 2020

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Pansharpening is the process of fusing a low-resolution multispectral (LRMS) image with a high-resolution panchromatic (PAN) image. In the process of pansharpening, the LRMS image is often directly upsampled by a scale of 4, which may result in the loss of high-frequency details in the fused high-resolution multispectral (HRMS) image. To solve this problem, we put forward a novel progressive cascade deep residual network (PCDRN) with two residual subnetworks for pansharpening. The network adjusts the size of an MS image to the size of a PAN image twice and gradually fuses the LRMS image with the PAN image in a coarse-to-fine manner. To prevent an overly-smooth phenomenon and achieve high-quality fusion results, a multitask loss function is defined to train our network. Furthermore, to eliminate checkerboard artifacts in the fusion results, we employ a resize-convolution approach instead of transposed convolution for upsampling LRMS images. Experimental results on the Pléiades and WorldView-3 datasets prove that PCDRN exhibits superior performance compared to other popular pansharpening methods in terms of quantitative and visual assessments.

Keywords:

pansharpening; deep residual network; loss function

Graphical Abstract

1. Introduction

Remote sensing satellites such as Pléiades, WorldView, and GeoEye provide low spatial resolution multispectral (LRMS) and high spatial resolution panchromatic (PAN) images. To obtain a fused high-resolution multispectral (HRMS) image by fusing an LRMS image with a PAN image of the same scene, pansharpening is considered a powerful image fusion technique. The HRMS image can effectively integrate the spectral characteristics of the LRMS image with the spatial information of the PAN image [1,2].

In recent decades, numerous approaches have been put forward for pansharpening. The conventional methods for pansharpening can be classified into three major categories: component substitution (CS)-based methods, multiresolution analysis (MRA)-based methods, and model-based methods. The CS-based methods primarily include the intensity-hue-saturation (IHS) method [3,4], the principal component analysis method (PCA) [5], and the Gram–Schmidt (GS) transform-based method [6]. Although CS-based methods can usually be quickly and easily implemented [7], obvious spectral distortions may be produced in the spectral domain of the fused image [1]. The MRA-based methods primarily include the Laplacian Pyramid method [8], à trous wavelet transform (ATWT) method [9], discrete wavelet transform (DWT) [10], and non-subsampled Contourlet transform (NSCT) [11]. MRA-based methods generally outperform CS-based methods in spectral preservation. However, they often lead to the problem of spatial distortion [12]. There are several model-based methods that have been recently proposed [13,14,15,16]. Although model-based methods generally exhibit better spectral information preservation, they are characterized by high computational complexity and poor real-time performance.

Because deep learning can automatically learn additional features from various types of data, it has gained considerable attention in recent years [17,18,19,20,21]. Different from the conventional pansharpening methods, deep learning-based methods present more ideal solutions for improving the performance of pansharpening. Masi et al. [17] developed a pansharpening approach that is based on a three-layer convolutional neural network. PanNet was presented by Yang et al. [18], which achieved remote sensing image fusion by spectra-mapping and network parameters training. To obtain the high quality fused image, Wei et al. [19] presented a deep residual network (ResNet) for pansharpening. Shao et al. [20] developed a two-branch network that can separately obtain salient features from MS and PAN images. A multiscale and multidepth convolutional neural network (CNN) for pansharpening was introduced by Yuan et al. [12]. He et al. [21] proposed two convolutional neural network frameworks for pansharpening, i.e., DiCNN1 and DiCNN2, which can obtain a high quality fused image and converge quickly. In the above-mentioned pansharpening methods that are based on deep learning, the LRMS image is directly upsampled by a factor of 4 during the fusion process. However, this may result in loss of high-frequency details owing to the difficulty in learning nonlinear feature mapping. In addition, the mean squared error (MSE) loss function is often employed by many pansharpening methods through deep learning. However, it is difficult for MSE to capture the differences in high-frequency details between the fused image and the reference image. Consequently, this leads to the phenomenon of excessive smoothing and loss of high-quality details.

To overcome the above-mentioned limitations, a new progressive cascade deep residual network (PCDRN) for pansharpening was presented, which includes two residual subnetworks. Different from direct upsampling by a factor of 4 during the fusion process in other pansharpening methods, we first adopt two upsampling operations and then employ the two residual subnetworks to learn the nonlinear feature mapping from the source images to the ground truth in two scales. We finally realize the fusion of LRMS and PAN images in a step by step manner. To better train our network, a multitask loss function is designed to enable PCDRN to extract more precise features. Furthermore, to eliminate checkerboard artifacts in the fused image, the transposed convolution is replaced by the resize-convolution to upsample the LRMS image. Compared with several existing pansharpening methods, the experimental results demonstrate that PCDRN achieves better performance.

2. Related Work

2.1. Residual Network

He et al. [22] proposed a residual learning network architecture, which is substantially deeper than the plain network. Figure 1 illustrates the structure of a residual block. It not only makes the network deeper but also overcomes the vanishing gradient problem of the plain network.

Formally, the residual block is represented as:

y = F (x, {W_{i}}) + x

(1)

where x and y are the input and output vectors of the layers considered, respectively. The function

F (x, {W_{i}})

denotes the residual mapping to be learned. As shown in Figure 1,

F (x)

is defined as:

F (x) = W_{2} \otimes R (W_{1} \otimes x)

(2)

where

W_{1}

denotes the weight of the first layer,

W_{2}

denotes the weight of the second layer,

\otimes

denotes convolution,

R

denotes the activation function Relu, and

R (x) = \max (0, x)

.

2.2. Universal Image Quality Index

Wang et al. [23] presented a universal image quality index (UIQI), which is used to measure the structure distortion degree. It is composed of three factors: loss of correlation, luminance distortion, and contrast distortion.

The quality metrics is defined as:

U I Q I (F, R) = \frac{σ_{F R}}{σ_{F} σ_{R}} \cdot \frac{2 μ_{F} μ_{R}}{μ_{F}^{2} + μ_{R}^{2}} \cdot \frac{2 σ_{F} σ_{R}}{σ_{F}^{2} + σ_{R}^{2}}

(3)

where

σ_{F}

and

σ_{R}

denote the standard deviation of the fused image F and referenced image R, respectively.

σ_{F R}

denotes the covariance of the fused image F and referenced image R.

μ_{F}

and

μ_{R}

denote the mean of the fused image F and referenced image R, respectively. The optimum value of UIQI is 1.

3. Proposed Method

In this section, we put forward a new PCDRN for pansharpening, which exploits the two residual subnetworks called ResNet to extract accurate features and progressively inject the details of a PAN image into an MS image in a coarse-to-fine manner. A multitask loss function is proposed to prevent the over-smooth phenomenon from preserving spatial information. Furthermore, to address the problem of checkerboard artifacts in the fusion results, resize-convolution is adopted rather than transposed convolution in the upsampling process of LRMS images.

3.1. Flowchart of PCDRN

To inject additional spatial information of the PAN image into the LRMS image, we design the PCDRN for pansharpening. As shown in Figure 2, PCDRN consists of two residual subnetworks ResNet1 and ResNet2, which are progressively cascaded to learn nonlinear feature mapping from LRMS and PAN images to HRMS images.

In our experiments, the PCDRN was implemented through three stages.

Stage 1: The LRMS images are upsampled by a scale of 2 with nearest-neighbor interpolation, and the PAN images are downsampled by a scale of 2. The upsampled MS images are then concatenated with the downsampled PAN images to form the 5-band inputs.

Stage 2: The 5-band inputs are fed into ResNet1 to extract the coarse features. The ResNet includes 2 convolutional layers and 5 residual blocks. An element-wise sum is then performed on the feature maps of ResNet1 and the upsampled LRMS image in a channel by channel manner. Subsequently, 1 × 1 convolutional layers are utilized to reduce spectral dimensionality from 64 bands to 4 bands. The 4-band results are upsampled by a scale of 2 using nearest-neighbor interpolation and then concatenated with the PAN images to obtain new 5-band inputs.

Stage 3: The new 5-band inputs are entered into the ResNet2, which exhibits the same structure as that of the previous ResNet1, to extract finer features. After the LRMS image has been upsampled twice, an element-wise sum is performed on the results of ResNet2 and the upsampled LRMS image, channel by channel. The fused HRMS image is finally obtained by a 1 × 1 convolutional layer and an activation function tanh. In addition, 1 × 1 convolution was employed several times in the network to achieve the reduction of network parameters and boost the performance of PCDRN.

To validate the advantages of PCDRN, a performance comparison between the single ResNet and PCDRN is presented in Figure 3. We observe that the two indices obtained by PCDRN are significantly better than those obtained by the single ResNet.

3.2. Multitask Loss Function

A mean squared error (MSE) loss function is usually applied in deep learning-based pansharpening methods. However, the MSE loss function often loses high-frequency details, such as texture during the fusion process, which may result in poor perceptual quality and the over-smooth phenomenon. To address this problem, we design a novel multitask loss function comprising MSE loss and universal image quality index (UIQI) loss.

The UIQI is often used to measure the structure distortion degree in image quality evaluation. Based on the characteristics of UIQI, the UIQI loss in our network has been designed to preserve structure information. Therefore, our multitask loss function

L^{F u s i o n}

can improve the performance of the fusion network in preserving spatial details.

The

L^{F u s i o n}

is represented as:

L^{F u s i o n} = α \cdot L_{N M S E}^{F u s i o n} + (1 - α) \cdot L_{N U I Q I}^{F u s i o n}

(4)

where

L_{N M S E}^{F u s i o n}

is the normalized MSE loss,

L_{N U I Q I}^{F u s i o n}

is the normalized UIQI loss, and

α

denotes the weight coefficient.

The

L_{N M S E}^{F u s i o n}

and

L_{N U I Q I}^{F u s i o n}

are defined as:

L_{N M S E}^{F u s i o n} = β \cdot (\frac{1}{n} {\sum_{i = 1}^{n} ‖ F (x^{(i)}) - R^{(i)} ‖}^{2})

(5)

and

L_{N U I Q I}^{F u s i o n} = γ \cdot (1 - \frac{1}{n} \sum_{i = 1}^{n} U I Q I (F (x^{(i)}), R^{(i)}))

(6)

where

β

and

γ

denote the normalized coefficients,

n

denotes the number of train image groups,

x^{(i)}

denotes a low-resolution MS image,

F (x^{(i)})

denotes a fused MS image, and

R^{(i)}

denotes a reference MS image.

In order to roughly balance the contribution of the MSE and UIQI losses, a normalized algorithm is introduced as following Algorithm 1.

Algorithm 1. The Algorithm for Normalization method

Input: LRMS image (lrms), PAN image(pan),

the number of training epochs (max_train_epoch)

Output: the normalized coefficients

β

and

γ

Initialize:

1) c ← 0.00001

For i = 0 to max_train_epoch do

2) Input

l r m s (i)

and

p a n (i)

to compute

L_{M S E} (i)

by

L_{M S E}^{F u s i o n} = \frac{1}{n} {\sum_{i = 1}^{n} ‖ F (x^{(i)}) - R^{(i)} ‖}^{2}

and

L_{U I Q I} (i)

by

L_{U I Q I}^{F u s i o n} = 1 - \frac{1}{n} \sum_{i = 1}^{n} U I Q I (F (x^{(i)}), R^{(i)})

, respectively

If (converge)

3) Compute

\nabla L_{M S E} (i)

by

\nabla L_{M S E} (i) = | L_{M S E} (i) - L_{M S E} (i - 1) |

4) Compute

\nabla L_{U I Q I} (i)

by

\nabla L_{U I Q I} (i) = | L_{U I Q I} (i) - L_{U I Q I} (i - 1) |

5) Compute

{\bar{\nabla L}}_{M S E}

by averaging the

\nabla L_{M S E} (i)

6) Compute

{\bar{\nabla L}}_{U I Q I}

by averaging the

\nabla L_{U I Q I} (i)

Endif

Endfor

7) Compute the coefficient

β

by

\frac{c}{{\bar{\nabla L}}_{M S E}}

and the coefficient

γ

by

\frac{c}{{\bar{\nabla L}}_{U I Q I}}

Finally, the

L_{N M S E}^{F u s i o n}

and

L_{N U I Q I}^{F u s i o n}

are obtained by formula (5) and formula (6), respectively.

To illustrate the validity of the proposed loss function, we compare the performances of the MSE loss function and the MSE + UIQI loss function, as shown in Table 1. The results demonstrate that the PSNR value obtained by using the proposed MSE + UIQI loss function is higher than that obtained by only using the MSE loss function. Similarly, our network comprising the MSE + UIQI loss function also obtains UIQI value, which is higher than that obtained by using the MSE loss function.

3.3. Resize-Convolution

Unlike the traditional convolution, the transposed convolution forms the connectivity in the backward direction [25]. The transposed convolution is usually employed for upsampling images. In addition, different from the weights in a traditional filter, the weights in the transposed convolution are learnable without being predefined. However, transposed convolution can easily produce uneven overlap, which may result in checkerboard artifacts in the image. Therefore, the resize-convolution approach is employed to avoid the problem of uneven overlap, which has been known to be robust against checkerboard artifacts [26]. The approach involves resizing an image using nearest-neighbor interpolation and then executing a convolutional operation. To better illustrate this phenomenon, we carried out a group of experiments to compare the performance of the fused image that employs transposed convolution versus resize-convolution, which can be seen in Figure 4. As shown in Figure 4, a small region is enlarged and shown on the left bottom for better visualization. From the enlarged box in Figure 4a, we can observe the obvious checkerboard artifacts. From the enlarged box in Figure 4b, we can observe that the region is smoother than that in Figure 4a.

Table 2 shows the quantitative assessment of the results in Figure 4, in which the bold represents the best value. As we can observe from the table, the experimental results by using resize-convolution are better than those by using transposed convolution in most image quality indexes, including PSNR [24], the correlation coefficient (CC) [27], UIQI [23], the spectral angle mapper (SAM) [28], and the erreur relative global adimensionnelle de synthese (ERGAS) [29] except the Q2ⁿ [30] index.

To further demonstrate the superiority of the resize-convolution method, the polynomial (EXP) [31], transposed convolution, and resize-convolution methods were evaluated by performing experiments on 180 groups of the simulated dataset from Pléiades through a single ResNet. Figure 5 shows the average PSNR and UIQI of these three methods. From Figure 5, we can find that single ResNet with resize-convolution outperforms the other two methods on PSNR. In addition, it not only achieves higher UIQI values than the method with transposed convolution but also achieves similar UIQI values to the method with EXP. Thus, on the whole, the single ResNet with resize-convolution is superior to the methods with transposed convolution and EXP.

4. Experimental Results

4.1. Experimental Settings

4.1.1. Datasets

In the experiments, the MS images and PAN images were captured by the Pléiades and WorldView-3 satellites, respectively. The MS images of the Pléiades satellite contain four bands: red, green, blue, and near-infrared. We selected red, green, blue, and near-infrared1 bands from the MS image of the WorldView-3 satellite to comprise the new 4-band MS image. The spatial resolutions of Pleiades and WorldView-3 datasets are described in Table 3.

The source images were degraded to lower resolution by a factor of 4 using Wald’s protocol [32,33]. In the approach, the original MS images were degraded by using a low-pass filter matched with modulation transfer function (MTF) of the MS sensor, and the original PAN images were degraded by using the ‘bicubic’ method. The sizes of the original MS and PAN images are 256 × 256 and 1024 × 1024 pixels, respectively. In training, the sizes of the degraded MS and PAN images are 64 × 64 and 256 × 256 pixels, respectively. The original MS images are considered referenced images. To enhance the generalization of our network, data augmentation approaches in [34] were used in the training process. The 180 groups of simulated data from the Pléiades satellite and the 108 groups of simulated data from WorldView-3 satellite were used as test datasets, respectively. Each group of simulated data is composed of a 64 × 64 degraded MS image, a 256 × 256 degraded PAN image, and a 256 × 256 reference MS image. In addition, the 165 groups of real data from Pléiades satellite and 100 groups of real data from WorldView-3 satellite were used to assess the performance of PCDRN. It should be noted that the test datasets are not used to train PCDRN.

4.1.2. Training Details

The training of PCDRN is achieved in 200 epochs with a batch size of 9 for using adaptive moment estimation (Adam). The learning rate

ε

is initialized to 8 × 10⁻⁵, which is divided by a factor of 2 for every 50 epochs. The biases in the network are initialized to zero.

To further demonstrate the effectiveness of the multitask loss function, numerous experiments were performed using different

α

values on 180 sets of the simulated dataset from Pléiades. The average PSNR and UIQI of PCDRN using different α values on 180 groups of the simulated dataset from Pléiades are shown in Figure 6. As shown in Figure 6, we can observe that the best performances are obtained when

α

is set to 0.1 in formula (4). The PCDRN is implemented under TensorFlow 1.8 and TensorLayer 1.8. The experiments were performed on an NVIDIA GeForce GT 1080Ti GPU.

4.1.3. Compared Methods

In this paper, PCDRN is compared with one interpolated method and eleven popular pansharpening methods.

EXP: an interpolation method based on polynomial kernel [31];
AIHS: adaptive IHS [35];
ATWT: a Trous wavelet transform [9];
GSA: Gram Schmidt adaptive [36];
BT: Brovey transform [37];
MTF-GLP-CBD: generalized Laplacian Pyramid (GLP) [31] with MTF-matched filter [38] and regression based injection model [28];
MMMT: a matting model and multiscale transform [16];
GS: Gram Schmidt [6];
MTF-GLP-HPM: GLP [31] with MTF-matched filter [38] and multiplicative injection model [39];
ASIM: adaptive spectral-intensity modulation [40];
DRPNN: a deep ResNet for pansharpening [41];
MSDCNN: a multiscale and multidepth CNN for pansharpening [12].

The AIHS, GS, GSA, and BT are four typical CS-based methods. The ATWT, MTF-GLP-CBD, and MTF-GLP-HPM are three MRA-based methods. MMMT and ASIM are two model-based methods. Both DRPNN and MSDCNN are deep learning-based methods.

4.2. Experiments on Simulated Data

In this subsection, to demonstrate the validity of PCDRN, we compared the visual effects and quantitative assessments of simulated experimental results obtained from different methods. Six metrics were used to evaluate the performance of PCDRN on simulated data from the Pléiades and WorldView-3 datasets, which include PSNR [24], CC [27], UIQI [23], Q2ⁿ [30], SAM [28], and ERGAS [29].

To better observe the texture details in the fused images, there are two rectangle boxes marked in yellow, and the bigger box is the enlarged image of the smaller one. Note that the bold represents the best performance of quantitative assessment in experiments.

4.2.1. Experiments on Pléiades Dataset

A group of fusion results on simulated data from Pléiades are shown in Figure 7. Figure 7a is the reference image that was applied to assess the fused images. Figure 7b displays the degraded PAN. In Figure 7c, the upsampled image by the EXP has good spectral quality but exhibits serious spatial distortions. The corresponding fusion images obtained by twelve pansharpening methods are presented in Figure 7d–o, respectively. As shown in Figure 7, the fusion results of the BT and MTF_GLP_HPM methods suffer from serious spectral distortions. The fusion results of other compared methods have obvious spectral distortions, such as the roof of the building in the enlarged box. However, compared to all the comparison pansharpening methods, the fused result of PCDRN is closer to the reference image by observation. Furthermore, from the corresponding quantitative assessment of Figure 7 as shown in Table 4, PCDRN outperforms the other eleven pansharpening methods in six metrics.

Furthermore, we perform experiments on 180 sets of the simulated dataset from Pléiades. Table 5 tabulates the mean values of experimental results. As shown in Table 5, we can clearly observe that PCDRN produces the best quantitative assessment results of the most metrics among all pansharpening methods.

4.2.2. Experiments on WorldView-3 Dataset

Figure 8 shows an example of the simulated experiment performed on the WorldView-3 dataset. The reference and Pan images are shown in Figure 8a,b, respectively. The upsampled LRMS using EXP method is shown in Figure 8c, which is also blurred as can be seen from the enlarged box. The 12 fused results are given in Figure 8d–o. From Figure 8d–o, we can observe that the results of the BT, MMMT, and DRPNN methods have serious spectral distortions. The fused results of the ATWT, GSA, MTF_GLP_CBD, GS, MTF_GLP_HPM, and ASIM methods exhibit some spectral distortions on the roof of buildings in the enlarged box. The AIHS method yields obvious spectral distortion and some artifacts that do not exist in the original image. Although the results of MSDCNN method have good spatial quality, they show some spectral distortions in the enlarged box. In Figure 8, we can observe a red line between the white and red areas, which can be clearly observed in the results of PCDRN. However, in the results of other methods, the red line is very blurred or even nonexistent. We can also observe that PCDRN is superior to other methods in preserving the spectral fidelity in the enlarged box. For the corresponding quantitative evaluation of Figure 8 (see Table 6), PCDRN gains the best fusion values in most metrics.

Table 7 tabulates the average quantitative results on 108 groups of the simulated dataset from WorldView-3. As can be observed, PCDRN achieves the best fusion results in most metrics.

Therefore, according to the above experimental results on the simulated data from Pléiades and WorldView-3 satellites, we find that PCDRN is better than the other 11 pansharpening methods.

4.3. Experiments on Real Data

To demonstrate the effectiveness of our proposed method, we also compare the visual effects and quantitative evaluations of real data experimental results obtained from 12 pansharpening methods. The twelve pansharpening methods are evaluated on the real dataset from Pléiades and WorldView-3 in terms of quality with no reference (QNR) index [42]. The QNR metric is composed of the spectral distortion index D_λ and the spatial distortion index D_S. The optimum values of QNR, D_λ, and D_S are 1, 0, and 0 respectively.

Similar to what was demonstrated on simulated data, two yellow rectangle boxes are marked in each fused image to better observe the texture details. The bigger box is the enlarged image of the smaller one.

4.3.1. Experiments on Pléiades Dataset

A set of fusion results on real data from Pleiades are shown in Figure 9. Figure 9a is the upsampled LRMS using bicubic interpolation. Figure 9b displays the PAN image. Figure 9c is the upsampled LRMS using the EXP method. As shown in Figure 9c, the upsampled image by the EXP method has good spectral quality but exhibits serious spatial distortions. As we can see in Figure 9d–o, the result of AIHS method has some artifacts in the enlarged box. BT and MTF_GLP_HPM methods yield serious spectral distortions. The results obtained by ATWT, GSA, MTF_GLP_CBD, MMMT, GS, ASIM, and DRPNN methods have some spectral distortions because they are oversharpened in the process of pansharpening. The MSDCNN method produces slight spectral distortions as seen in the enlarged box. Table 8 also shows the quantitative assessment of the experimental results with real data in Figure 9. As it can be observed, PCDRN obtains the best values in QNR and D_λ metrics.

The average quantitative results on 165 groups of the real dataset from Pleiades are listed in Table 9. From Table 9, we can testify that PCDRN not only obtains the best results of QNR and D_λ but also obtains the second best results of D_S in all the comparison methods.

4.3.2. Experiments on WorldView-3 Dataset

Figure 10 displays a group of fusion results on real data from WorldView-3. The upsampled LRMS using bicubic interpolation is presented in Figure 10a, and Figure 10b is the PAN image. The upsampled LRMS using EXP method is given in Figure 10c, from which we can observe that the EXP method yields obvious spatial distortions. From Figure 10d–o, we can observe that the AIHS method yields some artifacts in the enlarged box. MTF_GLP_HPM, DRPNN and MSDCNN methods produce obvious spectral distortions. ATWT, GSA, BT, MTF_GLP_CBD, MMMT, GS, and ASIM methods yield some spectral distortions in the enlarged box. For the corresponding quantitative evaluation of Figure 10 (see Table 10), PCDRN gains the best fusion values in QNR and D_λ metrics.

Table 11 tabulates the average quantitative results on 100 groups of the real dataset from WorldView-3. As we can see in Table 11, PCDRN obtains the optimal fusion results in all metrics.

Thus, from the above-mentioned experimental results on real data from Pléiades and WorldView-3 satellites, we can conclude that PCDRN outperforms other fusion methods in balancing the spectral preservation and spatial enhancement.

5. Further Discussion

To further verify the performance of PCDRN, experiments were performed on 90 groups of the simulated images, which were selected from another scene of the Pléiades dataset. It should be noted that these images were not used to train the PCDRN. An example of the simulated experiment performed on the Pléiades dataset is shown in Figure 11. Figure 11a,b are the reference image and the degraded PAN image, respectively. The upsampled LRMS using EXP method is shown in Figure 11c, which is blurred because of not using the PAN image. Figure 11d–o displays the corresponding fused results. In Figure 11d–o, the results of AIHS, ATWT, GSA, BT, MTF_GLP_CBD, MMMT, GS, MTF_GLP_HPM, and ASIM methods not only suffer from serious spectral distortion but also exhibit some artifacts. DRPNN and MSDCNN methods produce some spectral distortion. As we can observe, the result of PCDRN is still closest to the reference image in 12 pansharpening methods by visual inspection.

Table 12 tabulates the average quantitative results on 90 groups of the simulated dataset from Pléiades. In Table 12, we can also find that PCDRN achieves the best values in 6 image quality indexes, which can again prove the proposed PCDRN is effective.

6. Conclusions

In this paper, we presented a new deep learning-based approach for pansharpening, i.e., PCDRN. Different from other pansharpening approaches, we interpolated LRMS images twice and extracted features from source images using two residual subnetworks to inject details in two sizes. To avoid the over-smooth phenomenon, we design a multitask loss function to train our network to achieve high-quality remote sensing image fusion. To eliminate checkerboard artifacts, a resize-convolution consisting of a nearest-neighbor interpolation and a convolution layer was employed instead of transposed convolution for upsampling. Compared with other pansharpening methods, the experimental results demonstrated that PCDRN exhibits the best performance. In the future, we intend to develop an adaptive method of the multitask loss function to preserve additional spatial information.

Author Contributions

Y.Y. and S.H. conceived and designed the study; Y.Y. and W.T. drafted the manuscript; W.T. and H.L. conducted the experiments; Y.Y. and S.H. performed manuscript review. All authors have read and agree to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (No. 61662026, No.61862030, and No.61562031), by the Natural Science Foundation of Jiangxi Province (No. 20182BCB22006, No. 20181BAB202010, No.20192ACB20002, and No.20192ACBL21008), and by the Project of the Education Department of Jiangxi Province (No. YC2018-B065).

Conflicts of Interest

The authors declare no conflict of interest.

References

Ghassemian, H. A review of remote sensing image fusion methods. Inf. Fusion 2016, 32, 75–89. [Google Scholar] [CrossRef]
Zhang, Y.; Hong, G. An IHS and wavelet integrated approach to improve pan-sharpening visual quality of natural colour IKONOS and QuickBird images. Inf. Fusion 2005, 6, 225–234. [Google Scholar] [CrossRef]
Tu, T.M.; Su, S.C.; Shyu, H.C.; Huang, P.S. A new look at IHS like image fusion methods. Inf. Fusion 2001, 2, 177–186. [Google Scholar] [CrossRef]
Vivone, G.; Alparone, L.; Chanussot, J.; Mura, M.D.; Garzelli, A.; Licciardi, G.A.; Restaino, R.; Wald, L. A Critical Comparison Among Pansharpening Algorithms. IEEE Trans. Geosci. Remote Sens. 2015, 53, 2565–2586. [Google Scholar] [CrossRef]
Wang, Z.; Ziou, D.; Armenakis, C.; Li, D.; Li, Q. A comparative analysis of image fusion methods. IEEE Trans. Geosci. Remote Sens. 2005, 43, 1391–1402. [Google Scholar] [CrossRef]
Laben, C.A.; Brower, B.V. Process for Enhancing the Spatial Resolution of Multispectral Imagery Using Pan-Sharpening. US Patent 6,011,875, 4 January 2000. [Google Scholar]
Loncan, L.; Almeida, L.B.; Bioucas-Dias, J.M.; Briottet, X.; Chanussot, J.; Dobigeon, N.; Fabre, S.; Liao, W.; Licciardi, G.A.; Simoes, M.; et al. Hyperspectral Pansharpening: A Review. IEEE Geosci. Remote Sens. Mag. 2015, 3, 27–46. [Google Scholar] [CrossRef]
Miao, Q.; Wang, B. Multi-sensor image fusion based on improved laplacian pyramid transform. Acta Opt. Sin. 2007, 27, 1605–1610. [Google Scholar]
Nunez, J.; Otazu, X.; Fors, O.; Prades, A.; Pala, V.; Arbiol, R. Multiresolution-based image fusion with additive wavelet decomposition. IEEE Trans. Geosci. Remote Sens. 1999, 37, 1204–1211. [Google Scholar] [CrossRef]
Li, S.; Kwok, J.T.; Wang, Y. Using the discrete wavelet frame transform to merge Landsat TM and SPOT panchromatic images. Inf. Fusion 2002, 3, 17–23. [Google Scholar] [CrossRef]
Yang, Y.; Tong, S.; Huang, S.; Lin, P. Multi-focus Image Fusion Based on NSCT and Focused Area Detection. IEEE Sens. J. 2015, 15, 2824–2838. [Google Scholar] [CrossRef]
Yuan, Q.; Wei, Y.; Meng, X.; Shen, H.; Zhang, L. A multiscale and multidepth convolutional neural network for remote sensing imagery pan-sharpening. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2018, 11, 978–989. [Google Scholar] [CrossRef]
Fasbender, D.; Radoux, J.; Bogaert, P. Bayesian data fusion for adaptable image pansharpening. IEEE Trans. Geosci. Remote Sens. 2008, 46, 1847–1857. [Google Scholar] [CrossRef]
Joshi, M.; Jalobeanu, A. MAP estimation for multiresolution fusion in remotely sensed images using an IGMRF prior model. IEEE Trans. Geosci. Remote Sens. 2010, 48, 1245–1255. [Google Scholar] [CrossRef]
Cheng, J.; Liu, H.; Liu, T.; Wang, F.; Li, H. Remote sensing image fusion via wavelet transform and sparse representation. ISPRS J. Photogramm. Remote Sens. 2015, 104, 158–173. [Google Scholar] [CrossRef]
Yang, Y.; Wan, W.; Huang, S.; Lin, P.; Que, Y. A Novel Pan-Sharpening Framework Based on Matting Model and Multiscale Transform. Remote Sens. 2017, 9, 391. [Google Scholar] [CrossRef]
Masi, G.; Cozzolino, D.; Verdoliva, L.; Scarpa, G. Pansharpening by convolutional neural networks. Remote Sens. 2016, 8, 594. [Google Scholar] [CrossRef]
Yang, J.; Fu, X.; Hu, Y.; Huang, Y.; Ding, X.; Paisley, J. PanNet: A deep network architecture for pan-sharpening. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Wei, Y.; Yuan, Q. Deep residual learning for remote sensed imagery pansharpening. In Proceedings of the 2017 International Workshop on Remote Sensing with Intelligent Processing (RSIP), Shanghai, China, 18–21 May 2017. [Google Scholar]
Shao, Z.; Cai, J. Remote sensing image fusion with deep convolutional neural network. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2018, 11, 1656–1669. [Google Scholar] [CrossRef]
He, L.; Rao, Y.; Li, J.; Chanussot, J.; Plaza, A.; Zhu, J.; Li, B. Pansharpening via Detail Injection Based Convolutional Neural Networks. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2019, 12, 1188–1204. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Wang, Z.; Bovik, A.C. A universal image quality index. IEEE Signal Process. Lett. 2002, 9, 81–84. [Google Scholar] [CrossRef]
Nezhad, Z.H.; Karami, A.; Heylen, R.; Scheunders, P. Fusion of hyperspectral and multispectral images using spectral unmixing and sparsecoding. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2016, 9, 2377–2389. [Google Scholar] [CrossRef]
Dumoulin, V.; Visin, F. A Guide to Convolution Arithmetic for Deep Learning [Online]. Available online: https://arxiv.org/abs/1603.07285 (accessed on 5 February 2020).
Chong, N.; Wong, L.; See, J. GANmera: Reproducing aesthetically pleasing photographs using deep adversarial networks. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2016. [Google Scholar]
Zhu, X.X.; Bamler, R. A sparse image fusion algorithm with application to pan-sharpening. IEEE Trans. Geosci. Remote Sens. 2013, 51, 2827–2836. [Google Scholar] [CrossRef]
Alparone, L.; Wald, L.; Chanussot, J.; Thomas, C.; Gamba, P.; Bruce, L.M. Comparison of pansharpening algorithms: Outcome of the 2006 GRSS data fusion contest. IEEE Trans. Geosci. Remote Sens. 2007, 45, 3012–3021. [Google Scholar] [CrossRef]
Wald, L. Quality of high resolution synthesised images: Is there a simple criterion? In Proceedings of the 3rd Conference Fusion Earth Data: Merging Point Measurement, Raster Maps and Remotely Sensed Images, Sophia Antipolis, France, 26–28 January 2000. [Google Scholar]
Garzelli, A.; Nencini, F. Hypercomplex quality assessment of multi-/hyper-spectral images. IEEE Geosci. Remote Sens. Lett. 2009, 6, 662–665. [Google Scholar] [CrossRef]
Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A. Context-driven fusion of high spatial and spectral resolution images based on oversampled multiresolution analysis. IEEE Trans. Geosci. Remote Sens. 2002, 40, 2300–2312. [Google Scholar] [CrossRef]
Wald, L.; Ranchin, T.; Mangolini, M. Fusion of satellite images of different spatial resolutions: Assessing the quality of resulting images. Photogramm. Eng. Remote Sens. 1997, 63, 691–699. [Google Scholar]
Selva, M.; Santurri, L.; Baronti, S. On the use of the expanded image in quality assessment of pansharpened images. IEEE Geosci. Remote Sens. Lett. 2018, 15, 320–324. [Google Scholar] [CrossRef]
Yang, Y.; Nie, Z.; Huang, S.; Lin, P.; Wu, J. Multilevel features convolutional neural network for multifocus image fusion. IEEE Trans. Comput. Imag. 2019, 5, 262–273. [Google Scholar] [CrossRef]
Rahmani, S.; Strait, M.; Merkurjev, D.; Moeller, M.; Wittman, T. An adaptive IHS pan-sharpening method. IEEE Geosci. Remote Sens. Lett. 2010, 7, 746–750. [Google Scholar] [CrossRef]
Aiazzi, B.; Baronti, S.; Selva, M. Improving component substitution pansharpening through multivariate regression of MS+Pan data. IEEE Trans. Geosci. Remote Sens. 2007, 45, 3230–3239. [Google Scholar] [CrossRef]
Gillespie, A.; Kahle, A.; Walker, R. Color enhancement of highly correlated images. II. Channel ratio and “Chromaticity” transformation techniques. Remote Sens. Environ. 1987, 22, 343–365. [Google Scholar] [CrossRef]
Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A.; Selva, M. MTF-tailored multiscale fusion of high-resolution MS and Pan imagery. Photogramm. Eng. Remote Sens. 2006, 72, 591–596. [Google Scholar] [CrossRef]
Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A.; Selva, M. An MTF-based spectral distortion minimizing model for pan-sharpening of very high resolution multispectral images of urban areas. In Proceedings of the 2nd GRSS/ISPRS Joint Workshop Remote Sens. Data Fusion URBAN Areas, Berlin, Germany, 22–23 May 2003. [Google Scholar]
Yang, Y.; Wu, L.; Huang, S.; Tang, Y.; Wan, W. Pansharpening for multiband images with adaptive spectral-intensity modulation. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2018, 11, 3196–3208. [Google Scholar] [CrossRef]
Wei, Y.; Yuan, Q.; Shen, H.; Zhang, L. Boosting the accuracy of multispectral image pansharpening by learning a deep residual network. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1795–1799. [Google Scholar] [CrossRef]
Alparone, L.; Aiazzi, B.; Baronti, S.; Garzelli, A.; Nencini, F.; Selva, M. Multispectral and panchromatic data fusion assessment without reference. Photogramm. Eng. Remote Sens. 2008, 74, 193–200. [Google Scholar] [CrossRef]

Figure 1. A residual block.

Figure 2. Flowchart of the proposed progressive cascade deep residual network (PCDRN).

Figure 3. Average peak signal-to-noise ratio (PSNR) [24] and universal image quality index (UIQI) [23] of the single ResNet and PCDRN on 180 groups of the simulated dataset from Pléiades.

Figure 4. The comparison between transposed convolution and resize-convolution from the WorldView-3 dataset. (a) Fused image using transposed convolution; and (b) fused image using resize-convolution.

Figure 5. Average PSNR and UIQI of 3 methods on 180 groups of the simulated dataset from Pléiades.

Figure 6. Average PSNR and UIQI of PCDRN using different α values on 180 groups of the simulated dataset from Pléiades.

Figure 7. Fusion results on simulated data from Pléiades. (a) Reference image; (b) PAN; (c) EXP; (d) AIHS; (e) ATWT; (f) GSA; (g) BT; (h) MTF_GLP_CBD; (i) MMMT; (j) GS; (k) MTF_GLP_HPM; (l) ASIM; (m) DRPNN; (n) MSDCNN; and (o) PCDRN.

Figure 8. Fusion results on simulated data from WorldView-3. (a) Reference image; (b) PAN; (c) EXP; (d) AIHS; (e) ATWT; (f) GSA; (g) BT; (h) MTF_GLP_CBD; (i) MMMT; (j) GS; (k) MTF_GLP_HPM; (l) ASIM; (m) DRPNN; (n) MSDCNN; and (o) PCDRN.

Figure 9. Fusion results on real data from Pleiades. (a) MS image; (b) PAN; (c) EXP; (d) AIHS; (e) ATWT; (f) GSA; (g) BT; (h) MTF_GLP_CBD; (i) MMMT; (j) GS; (k) MTF_GLP_HPM; (l) ASIM; (m) DRPNN; (n) MSDCNN; and (o) PCDRN.

Figure 10. Fusion results on real data from WorldView-3. (a) MS; (b) PAN; (c) EXP; (d) AIHS; (e) ATWT; (f) GSA; (g) BT; (h) MTF_GLP_CBD; (i) MMMT; (j) GS; (k) MTF_GLP_HPM; (l) ASIM; (m) DRPNN; (n) MSDCNN; and (o) PCDRN.

Figure 11. An example of the simulated experiment on Pléiades dataset. (a) Reference image; (b) PAN; (c) EXP; (d) AIHS; (e) ATWT; (f) GSA; (g) BT; (h) MTF_GLP_CBD; (i) MMMT; (j) GS; (k) MTF_GLP_HPM; (l) ASIM; (m) DRPNN; (n) MSDCNN; and (o) PCDRN.

Table 1. Average PSNR and UIQI of PCDRN using MSE and MSE+UIQI loss on 180 groups of the simulated dataset from Pléiades.

Methods	PSNR (↑)	UIQI (↑)
PCDRN using MSE loss	31.3716	0.9754
PCDRN using MSE+UIQI loss	31.5914	0.9767

Table 2. Quantitative assessment of the results in Figure 4.

Methods	PSNR (↑)	CC (↑)	UIQI (↑)	Q2ⁿ (↑)	SAM (↓)	ERGAS (↓)
Transposed convolution	26.3580	0.9705	0.9576	0.8460	5.0280	5.3059
Resize-convolution	26.6129	0.9760	0.9621	0.8412	4.4173	5.1850

Table 3. The spatial resolutions of Pleiades and WorldView-3 datasets.

	PAN	MS
Pleiades	0.5 m GSD (0.7 m GSD at nadir)	2 m GSD (2.8 m GSD at nadir)
WorldView-3	0.31 m GSD at nadir	1.24 m GSD at nadir

Table 4. Quantitative assessment of results in Figure 7.

Methods	PSNR (↑)	CC (↑)	UIQI (↑)	Q2ⁿ (↑)	SAM (↓)	ERGAS (↓)
EXP *	26.5538	0.9008	0.8900	0.8129	4.1067	5.0701
AIHS	27.4025	0.9231	0.9074	0.8491	4.5483	4.6003
ATWT	28.0016	0.9235	0.9279	0.8745	4.5283	4.2789
GSA	27.0666	0.9130	0.9265	0.8702	4.5274	4.7951
BT	20.6228	0.8936	0.8179	0.4855	4.3264	15.1748
MTF_GLP_CBD	27.2334	0.9129	0.9276	0.8712	4.4921	4.6934
MMMT	27.7634	0.9212	0.9225	0.8698	4.7816	4.3933
GS	26.7568	0.9046	0.8889	0.8366	4.5298	4.9587
MTF_GLP_HPM	22.3144	0.7899	0.8482	0.7274	4.7561	7.6598
ASIM	28.8910	0.9383	0.9448	0.9059	4.1985	3.8388
DRPNN	29.2947	0.9712	0.9500	0.9070	3.4037	3.9624
MSDCNN	28.0166	0.9608	0.9394	0.8660	3.4079	4.6946
PCDRN	30.2689	0.9846	0.9681	0.9171	2.9567	3.5617