A Distributed Fusion Framework of Multispectral and Panchromatic Images Based on Residual Network

: Remote sensing images have been widely applied in various industries; nevertheless, the resolution of such images is relatively low. Panchromatic sharpening (pan-sharpening) is a research focus in the image fusion domain of remote sensing. Pan-sharpening is used to generate high-resolution multispectral (HRMS) images making full use of low-resolution multispectral (LRMS) images and panchromatic (PAN) images. Traditional pan-sharpening has the problems of spectral distortion, ringing effect, and low resolution. The convolutional neural network (CNN) is gradually applied to pan-sharpening. Aiming at the aforementioned problems, we propose a distributed fusion framework based on residual CNN (RCNN), namely, RDFNet, which realizes the data fusion of three channels. It can make the most of the spectral information and spatial information of LRMS and PAN images. The proposed fusion network employs a distributed fusion architecture to make the best of the fusion outcome of the previous step in the fusion channel, so that the subsequent fusion acquires much more spectral and spatial information. Moreover, two feature extraction channels are used to extract the features of MS and PAN images respectively, using the residual module, and features of different scales are used for the fusion channel. In this way, spectral distortion and spatial information loss are reduced. Employing data from four different satellites to compare the proposed RDFNet, the results of the experiment show that the proposed RDFNet has superior performance in improving spatial resolution and preserving spectral information, and has good robustness and generalization in improving the fusion quality.


Introduction
For a long time, remote sensing images have been widely applied in various industries, such as agricultural yield prediction, plant diseases and pests detection, disaster prediction, geological exploration, national defense, vegetation coverage and land use, environmental change detection, and so on [1,2]. However, due to the limitations of satellite sensor technology, it is impossible to obtain images with high spatial resolution and high spectral resolution at the same time. Only PAN images with high spatial resolution and low spectral resolution and MS images with low spatial resolution and high spectral resolution can be obtained [3]. Notwithstanding, a variety of fields need to make use of images with both high spatial resolution and high spectral resolution (HRHM), and even images with high temporal resolution.
HRHM images are obtained by taking advantage of the redundant and complementary information of high spatial resolution and low spectral resolution images and high spectral resolution and low spatial resolution (LRMS) images. At present, the major image processing technologies consist of image enhancement, super-resolution reconstruction, image fusion, and so on. One of the most used and main technologies is image fusion, which generates a higher quality and more abundant information image by making good different spectral resolutions and spatial resolutions [28]. The online coupled dictionary learning method (OCDL) makes full use of the spatial information of PAN image to reduce spectral distortion [29]. Sparse matrix decomposition [30] learns a spectral dictionary from LRMS images and then predicts HRHM images based on the learned spectral dictionary and high spatial resolution images. Although the spectral distortion is reduced, the spatial information of the high spatial resolution image cannot be fully utilized, which results in the spatial resolution of the fused image being inferior to that of reference image. In addition, in order to make full use of the high-frequency information of PAN and intensity images, the low-rank decomposition method is used to decompose the highfrequency information into low-rank component and sparse component [31]. Then, the low-rank component and sparse component are fused by appropriate fusion rules, and the fused image is obtained by merging and inverse transformation. Although traditional methods are improving, spectral distortion and loss of spatial details are common problems.
With the development of deep learning (DL) technology, more and more scholars use DL for remote sensing image fusion. CNN is used most in fusion and its model is improved. The super-resolution convolution neural network (SRCNN) [32][33][34] uses a full convolution network to build a nonlinear model of low-resolution image mapping to generate a high-resolution image. SRCNN is relatively shallow, with only three layers, which is relatively easy to implement. However, SRCNN can only be used for a single image. Based on SRCNN, pan-sharpening by convolutional neural network (PNN) [35][36][37] has been proposed, which directly uses a relatively simple three-layer convolution to pansharpen making the best of nonlinearity. Yet, SRCNN and PNN may bring about overfitting. Then, residual learning [38] was introduced into the deep convolution neural networkthat is, residual network based panchromatic sharpening (DRPNN) [39]. DRPNN makes the most of the nonlinearity of the network to improve spatial resolution and retain spectral information. The RCNN uses the difference between the HRMS image and the LRMS image to pan-sharpen. In order to preserve spatial and spectral information, PanNet was proposed in [40], which adds upsampled MS to the output and trains PanNet parameters in high-pass filtering domain. PanNet is robust to various satellites. MSDCNN was proposed in [41], and consists of two branches to extract features. One is fundamental three-layer CNN, the other is a deeper multiscale feature extraction, which employs skip connection. The application of multiscale features is more conducive to preserving spectral and spatial information, and skip connection is more conducive to the convergence of the network. However, MSDCNN extracts the features of MS and PAN images simultaneously. 3D-CNN [42] fuses MS images and hyperspectral (HS) images to generate high-resolution hyperspectral (HRHS) images. The idea of 3D-CNN is similar to that of 2D-CNN. In order to reduce the amount of computation, it is necessary to use PCA for dimensionality reduction of HS images. The literature in [43] proposes a two-branch feature extraction network, namely, RSIFNN. RSIFNN extracts features from MS and PAN images, respectively, then fuses the features. In order to preserve much more spectral information, RSIFNN introduces residual learning in the last layer. A-PNN was proposed in [44]; A-PNN can not only perform registered images fusion but also carry out unregistered images and multisensor data fusion. A two-stream fusion network (TFNet) was proposed in [45] and comprised three parts. The first extracts features from MS and PAN images, respectively. The second concatenates MS and PAN features, then represents spectral and spatial information, simultaneously. Finally, the pan-sharpened image is reconstructed. Further residual learning is introduced to form ResTFNet. Compared with TFNet, ResTFNet gains better performance in pan-sharpening. The literature in [46] puts forward PSGAN, and was the first to apply generative adversarial network (GAN) on pan-sharpening with the goal of generating more realistic images. PSGAN consists of a generator and discriminator. Generator structure is very similar to TFNet, which can preserve more spectral information and spatial details, and skip connection makes the network faster. However, it is prone to causing gradient vanish. Discriminator employs reference MS as input to determine the performance of pan-sharpening, making use of full convolution. In order to solve the problem of shallow network and detail loss, a residual encoder-decoder conditional generative adversarial network (RED-cGAN) was proposed in [47]. Different from PSGAN, the generator of RED-cGAN makes use of residual encoder-decoder module to extract multiscale features to generate pan-sharpened images. The discriminator employs (pansharpened, PAN) and (reference, PAN) as input to determine pan-sharpened images. It proves that RED-cGAN outperforms PSGAN. To deal with the issues that the current CNN needs to be supervised and spatial details are lost in the process of fusion, Pan-GAN [48] was proposed. Different from RED-cGAN, Pan-GAN consists of a generator and two discriminators: a spectral discriminator and a spatial discriminator. Generator is implemented based on super-resolution and skip connections, which is simpler and easier to train, and makes full use of supplement information. The spectral discriminator makes a distinction between pan-sharpened images and upsampled LRMS images in order to preserve spectral information. Spatial discriminator makes a distinction between pansharpened images and PAN images in order to preserve spatial information. Pan-GAN is a unsupervised network, and not likely to depend on ground-truth during training. Although scholars have studied a variety of networks to improve the performance of pan-sharpening, most of them are one-channel or two-channel, or the network extracts the features of MS and PAN images simultaneously to fuse, or the network extracts features first, respectively, and then concatenates them. Yet, there are still the problems of spectral distortion and loss of spatial details.
Although there are many CNN-based networks at present, the structure of CNN-based networks can be changeable, improving the performance of the pan-sharpening network. Consequently, in consideration of the limitations of the above methods, motivated by the advantages of distributed fusion structure and residual learning, we propose a novel distributed fusion framework based on RCNN, called RDFNet. RDFNet combines the characteristics of distributed fusion structure, which can more effectively preserve spectral information and spatial details from MS and PAN images simultaneously. The main contributions are as follows: • A new RDFNet pan-sharpening model with powerful robustness and improved generalization performance is proposed, motivated by distributed framework and residual learning. • A new three-branch pan-sharpening structure is proposed, two branches of which are used to extract MS and PAN images features, respectively. The most important is the third branch, realizing data fusion of three channels, which concatenates the two feature branches and the previous layer's fusion results, layer by layer, yielding pan-sharpened images. • A large number of experiments are carried out to verify the robustness and generalization of the proposed RDFNet, employing four different sensors and typical comparison methods, including traditional and DL methods.
The other parts of the paper are arranged as follows: In Section 2, we introduce the relevant theoretical background. In Section 3, we describe the composition of RDFNet in detail. In Section 4, we introduce the training and testing datasets used in the paper, and the evaluation metrics of experiments. We employ the mainstream methods to carry out comparative experiments at reduced and full resolution, respectively. We further analyze experimental results by subjective visual evaluation and objective metrics. In Section 5, we arrive at our conclusions.

Distributed Fusion Structure
Distributed fusion structure is a typical fusion structure in track fusion and has two typical track fusion structures [49]. One track fusion structure is the sensor to sensor, and the other track fusion structure is the sensor to system [50]. The track fusion structure of the sensor to system is shown in Figure 1. During the process of generating the system track by track fusion, not only the track information of sensor A but also the track information of sensor B is applied. In the process of fusion, the known prior conditions are fully utilized to improve the accuracy of fusion track as much as possible [51].

Residual Network
He et al. [52] proposed a residual network consisting of a series of basic residual blocks. The residual network is very effective in solving the problems of gradient disappearance and gradient explosion, and can ensure better performance of the network even when the depth of the network increases. The basic residual module is shown in Figure 2a, which can be expressed as [53] x The residual block consists of two parts: direct mapping and residual mapping. The left part of Figure 2a is direct mapping; the right part of Figure 2a is the residual part F (x l , W l ), which generally includes 2 or 3 convolutional layers. If the dimensions of input x l and output x l+1 are different, it is necessary to use the 1 × 1 convolution operation to reduce or increase the dimension of the input, which is as shown in Figure 2b and can be expressed as [53] x where h(x l ) is the skip connection part and W l is the 1 × 1 convolution kernel.  If the network uses n residual modules, then the corresponding relationship between input x l and output x l+n can be expressed as [53] x l+n = x l + The location of the activation function in the network will also affect the performance of the residual network. He et al. [53] improves the residual network and proves that the residual module of the structure shown in Figure 2c has the best performance. This structure puts the batch normalization (BN) and ReLU activation function before the convolution operation. Further, the activation function of the second layer moves from the addition operation to the residual part.

Methods
Activated by the advantages of distributed architecture and the residual module, we propose a new three-branch distributed fusion framework of MS and PAN images based on the residual module, RDFNet. The processes of fusion on LRMS and PAN images are roughly divided into four steps: 1.
MS and PAN images fed into RDFNet need to be preprocessed. Due to the different levels of remote sensing data obtained by different researchers, different preprocessing operations are also needed for remote sensing images; for example, radiometric correction including radiometric calibration and atmospheric correction, registration, and so forth. On account of the Landsat-8 and Landsat-7, the obtained data is L1T level, which has finished geometric accurate correction and radiometric correction; we only register the data following [54,55]. The GF-2 data is the 2A level, which has finished primary geometric correction and radiometric correction, so we carry out geometric accurate correction for it to make use of ENVI. QuickBird data is used as a standard product, and then we carry out geometric accurate correction for it to make use of ENVI. GF-2 and QuickBird data are registered with the same method as Landsat-8.

2.
The LRMS and PAN images are fused to generate the HRMS images. In fact, there are no MS images with the same spatial resolution as the fused HRMS images. Consequently, according to Wald's protocol [56], the original-scale MS and PAN images are downsampled, denoted as DLMS and LPAN images, respectively; the specific process is shown in Figure 3. According to the resolution of MS and PAN images, the scaling factor is determined. As the size of MS and PAN images fed into RDFNet has to be kept the same, it is necessary to interpolate DLMS images to the size of LPAN image. Therefore, the original MS images can be used as ground truth.

3.
The DULMS and LPAN images are fed into RDFNet, and the original MS images are the output of RDFNet, as shown in Figure 4. DULMS, LPAN, and MS images are randomly cropped 64 × 64 subimages to form training samples. By adjusting the super parameters and structure of the network, and after sufficient training, the optimal network is obtained. As shown in Figures 5 and 6, the parameters of the well-trained network are then frozen and the performance of the network is tested on reduced-resolution and full-resolution MS and PAN images, respectively.

4.
Eventually, the pan-sharpened images with reduced resolution are evaluated subjectively and quantitatively with the original MS images, making use of the fullreference metrics mentioned in Section 4.2, as shown in Figure 5. Additionally, the full-resolution, pan-sharpened images are evaluated subjectively and quantitatively, making use of the no-reference metrics mentioned in Section 4.2, as shown in Figure 6. Proceeding to the next step, the pan-sharpening performance of the proposed network is verified by analyzing the indicators and by subjective visual evaluation.

Overall Structure
Activated by the advantages of distributed architecture and the residual module, we propose a new three-branch distributed fusion framework of MS and PAN images based on the residual module, RDFNet. In the pan-sharpening of remote sensing images, the information collected by the MS and PAN sensors can be used simultaneously. Not only are the MS and PAN images under the current scale fused, but also the fused image of the previous different scale is fused. In this way, multiscale information is fully utilized to improve the accuracy of the generated HRMS image; the proposed RDFNet's overall framework of pan-sharpening is as shown in Figure 7. From the perspective of mathematical theory, the fusion process is shown as follows.  The MS features extraction branch can be defined as where MS 0 is the LRMS image and is the MS input of the fusion network. MS i is other representations of MS 0 after the residual module. H i means the residual module acting on the MS i−1 . MS i represents MS images at different scales, which represent the features of different levels of MS 0 , where MS 1 is the lowest level feature and MS 4 is the highest level feature. Different scale features of the spectral image are fused at each layer, which can make the best of each scale information of MS images. This means the method expresses more MS information so as to reduce spectral distortion. The PAN features extraction branch can be defined as where PAN 0 is the high-spatial-resolution PAN image and is the PAN input of fusion network. PAN i is other representations of PAN 0 after the residual module. G i means the residual module acting on the PAN i−1 . PAN i denotes high-resolution PAN images at different scales, which represent different feature levels of PAN 0 , where PAN 1 is the lowest level feature and PAN 4 is the highest level feature. Different scale features of the PAN image are fused at each layer, which can make the best of each scale information of PAN image. As a result, the method expresses more spatial details so as to improve the spatial resolution of the fused image. The cross scale fusion branch can be defined as where MSP 1 , MSP 2 , MSP 3 , and MSP 4 = F MSP are the fusion results of different levels. F u i is equal to the fusion rule. They are the fusion results of the MS i , PAN i of ith feature extraction layer, and the different scales of i − 1th fusion layer MSP i−1 . In this way, it realizes cross-layer fusion. It can make better use of the local information of multisource images and then reduce the information loss in the convolution process.

Network Structure
The pan-sharpening model RDFNet proposed in this paper is composed of three branches; the RDFNet structure is shown in Figure 8. The two branches are used to extract the features of MS and PAN images, respectively. The last branch is used to fuse the features of the two branches and the fusion results of the previous step, layer by layer, until the last layer generates the pan-sharpened image. The first branch is for multiscale feature extraction of MS images. Four residual modules REM1, REM2, REM3, and REM4 are used to process the MS images so as to extract the multiscale features, as shown in the left of Figure 8. The third branch is used for multiscale feature extraction of PAN images. The four residual modules REP1, REP2, REP3, and REP4 are used to process the PAN images so as to extract the multiscale features, as shown on the right of Figure 8. The second branch is used for fusion, which is composed of FMP1, FMP2, FMP3, FMP4, and FMP5 modules; FMP5 module is the last convolution layer. It realizes the fusion of the two branches' multiscale MS images, PAN image, and the fusion result of the previous layer, as shown in the middle part of Figure 8. For the sake of preserving the spectral information of the MS images and the spatial details of the PAN images as much as possible, the network uses full convolution instead of pooling. As pooling will miss some information, which may cause spectral distortion as well as texture and detail loss. RDFNet is equivalent to a powerful fusion function, in which the LRMS image MS 0 and PAN image PAN 0 are input, and HRMS image F MSP is output.
Each module of the MS images features extraction branch can be expressed as follows: REMi module: where MS i−1 is the input of REMi module; MS i is the output of REMi module; h(MS i−1 ) indicates the skip connection; * represents the convolution operation; W MS i−1 is the REMi module convolution kernel, the size of which is 1 × 1, and the number of convolution kernels is 32, 64, 128, 256 in the skip connection, respectively. This operation is used to increase dimension and transfer information across scales. F MS i−1 , W MS i−1 represents the residual part; W MS i−1 is the convolution kernel; the size of the convolution kernel is 3 × 3, and there are 32, 64, 128, 256 convolution kernels in the residual part, respectively. Then, the feature extraction branch of MS images can be expressed as where MS i is the output of the ith residual module. It can be seen from the expression that the cross-layer transmission of information is realized, which is related to the residual part of residual module. Each module of the PAN image features extraction branch can be expressed as follows: REPi module: where PAN i−1 is the input of REPi module; PAN i is the output of REPi module; W P i−1 is the REPi module convolution kernel, the size of which is 1 × 1, and the number of convolution kernels is 32, 64, 128, 256 in the skip connection, respectively. This operation is used to increase dimension and transfer information across scales. F PAN i−1 , W P i−1 represents the residual part; W P i−1 is the convolution kernel; the size of the convolution kernel is 3 × 3, and there are 32, 64, 128, 256 convolution kernels in the residual part, respectively. Then, the feature extraction branch of PAN image can be expressed as follows: where PAN i is the output of the ith residual module. The operation of each part of fusion branch can be expressed as follows: FMPi module: where MSP i is the result of the fusion module FMPi, MSP 5 = F MSP is the fusion result of FMP5 module and the whole network. RDFNet can generate a powerful remote sensing image fusion model after sufficient training of samples. In order to improve the accuracy of the RDFNet, the residual module uses the combination mode with higher accuracy. First, BN is performed; then, nonlinear operation is carried out using the ReLU activation function; finally, the convolution operation is performed. Different from the conventional residual module, the last layer of ReLU activation function is moved from the addition operation to the residual part. In the fusion module, the 1 × 1 convolution layer is used to realize multichannel information fusion, and the ReLU activation function is used to increase the nonlinearity and improve the fusion ability of the fusion model. The fusion model of the whole RDFNet can be expressed as F MSP = F u (MS, PAN, W); F u is the fusion model.

Loss Function
Assuming that the input of the network is LM and the ideal fusion result is HM (label), then the training samples of the network can be expressed as N is the total number of training samples. Then, the training process of the fusion network is to find the fusion model F MSP = F u (MS, PAN, W), where F MSP is the prediction output, that is, the actual output of the fusion network. The process of training the fusion function is actually the problem of the regression function. The mean square error (MSE) L F is chosen as the loss function of the network.
where m is the batch size, that is, the number of training samples used in each iteration.
In the process of training, Adam optimizer [57] is used to optimize the loss function, that is, the minimum value of LF minL F is obtained. During the optimization process, the weights are updated as follows.
where m t is the exponential moving averages of the gradient, v t is the squared gradient,m t andv t are the bias-corrected estimates, t − 1 represents the previous time, t represents the current time, a is the learning rate, the initial iteration is set to 10 −3 , and a is set reasonably according to the number of iterations so that the learning rate in the later stage is not too low. The exponential decay rate of first-order moment estimation β 1 is set to 0.9 and second-order moment estimation β 2 is set to 0.999. ε is a very small value, ensuring that the denominator is not 0, set to 10 −8 .

Study Area and Datasets
The datasets are divided into training datasets and testing datasets. The data of the Landsat-8 satellite are used as the training datasets. In order to verify the fusion performance of RDFNet, the data of Landsat-8, Landsat-7, QuickBird, and GF-2 satellites are used as the testing datasets. The Landsat-8 and Landsat-7 data download address is https://glovis.usgs.gov/ (accessed on 3 March 1879).
Landsat-8 carries two sensors: Operational Land Imager (OLI) and Thermal Infrared Sensor (TIRS). OLI has nine bands, of which the spatial resolution of bands is 1-7, 9 is 30 m, and band 8 is panchromatic band with 15 m spatial resolution. TIRS consists of 10 band and 11 band. The band range is shown in Table 1. In this paper, 4, 3, and 2 bands are used as R, G, and B channels, respectively. The Landsat-7 satellite carries an Enhanced Thematic Mapper (ETM+) with a total of eight bands, among which the resolution of 1, 2, and 3 bands is 30 m, and the resolution of 8-band is 15 m. As shown in Table 1, we use the 3, 2, and 1 bands of the multispectrum as the R, G, and B channels, respectively.
The resolution of the PAN image of QuickBird satellite products is 0.61-0.72 m, and the spatial resolution of the MS image is 2.44-2.88 m. The resolution of the images we used is 0.7 m and 2.8 m, as shown in Table 1.
GF-2 satellite products have a total of five bands. The resolution of PAN image is 1 m, and the spatial resolution of MS image is 4 m. In this paper, 4, 3, and 2 bands of multispectrum are used as R, G, and B channels, respectively, as shown in Table 1.

Training Datasets
We use two Landsat-8 image pairs in total, the first and left half of the second images are used as training data, and the right half of the second image is used as testing data. The sizes of the two MS images are 4584 × 4674 and 4566 × 4644, respectively, and the corresponding PAN images sizes are 9168 × 9348 and 9132 × 9288. One of the training datasets is shown in Figure 9 (For better typesetting, the size of MS and PAN images are shown as the same size even when they have different resolutions in actuality.). The images are obtained by the Landsat-8 satellite sensors. The corresponding date of the remote sensing images is 6 May 2020. The corresponding area is near the South Bay in Haikou City, Hainan Province, where Figure 9a is the MS image with a resolution of 30 m and pixel size of 600 × 600; Figure 9b is the PAN image with a resolution of 15 m and pixel size of 1200 × 1200. As there are no MS images with a spatial resolution of 15 m in the actual collected data, in order to verify the fusion performance of the proposed network RDFNet, we follow Wald's criterion [56] to downsample the remote sensing images to obtain simulated images.
According to the bicubic resampling method [58], the PAN image with spatial resolution of 15 m is downsampled to 30 m; the MS images with spatial resolution of 30 m are also downsampled to 60 m. In this way, the downsampled PAN image with a spatial resolution of 30 m and the MS images with spatial resolution of 60 m can be fused to obtain the MS images with spatial resolution of 30 m. The fused MS images obtained by the RDFNet are compared with the 30 m MS images obtained in reality (the ideal output image of the network). Then, the performance of the fusion network is evaluated. As the input of the network needs to maintain the same size, it is also necessary to perform an upsampling operation on the MS images with a spatial resolution of 60 m to obtain the MS images with the same size as the PAN image with spatial resolution of 30 m. As the size of the input network is different, 64 × 64 subimages are randomly cropped from the preprocessed images as training datasets. We simulate 20,688 image pairs using Lansat-8 images for training and 5172 image pairs for validating the fusion network.

Testing Datasets
In order to verify the generalization ability and fusion performance of the RDFNet proposed in this paper, four different satellites' data are used for experiments. In order to better illustrate that the proposed network can fuse different sizes of images, the size of the testing data we employed is different from that of the training data. Some testing datasets are shown in Figure 10 (For better typesetting, the size of MS and PAN images are shown as the same size even when they have different resolutions in actuality.). There are four different regions of different satellites.
Area 1 images are acquired by the Landsat-8 satellite sensors as a part of testing data. This testing data comes from the right side of the second image in Section 4.1.1. There is no overlap between area 1 and the training datasets, as shown in Figure 10a,b. The remote sensing images were acquired on 15 June 2017 at an area near Bohai Bay in Cangzhou City, Hebei Province. Figure 10a shows the corresponding MS image with a spatial resolution of 30 m and pixel size of 600 × 600. Figure 10b shows the corresponding PAN image with a spatial resolution of 15 m and pixel size of 1200 × 1200. According to Wald's criterion, the 15-m PAN and 30-m MS images are downsampled by a factor of 2 to obtain 30-m PAN and 60-m MS simulation images, respectively. We simulated 55 testing image pairs for Landsat-8. Images of area 2 are some testing datasets obtained by QuickBird satellite, located in the Inner Mongolia Autonomous Region, as shown in Figure 10c,d. Figure 10c is the MS images with a spatial resolution of 2.8 m and pixel size of 510 × 510. Figure 5d is the corresponding PAN image with a resolution of 0.7 m and pixel size of 2040 × 2040. According to Wald's criterion, the PAN and MS images are downsampled by a factor of 4 to obtain 2.8 m PAN and 11.2 m MS simulation images, respectively. We simulated 48 testing image pairs for QuickBird.
Area 3 is a part of image pairs of Haikou City, Hainan Province, near the South China Sea, acquired by the Landsat-7 satellite on 8 November 2000, as shown in Figure 10e,f. Figure 10e is the MS image with a spatial resolution of 30 m and pixel size 600 × 600. Figure 10f is the corresponding PAN image with a spatial resolution of 15 m and pixel size 1200 × 1200. According to Wald's criterion, the PAN and MS images are downsampled by a factor of 2 to obtain 30 m PAN and 60 m MS simulation images, respectively. We simulated 50 testing image pairs for Landsat-7.
Area 4 is the partial images of Haikou City, Hainan Province acquired by the GF-2 satellite sensors on 9 December 2016, as shown in Figure 10g,h. Figure 10g is the MS image with a spatial resolution of 4 m and pixel size of 785 × 822. Figure 10h is the corresponding PAN image with a resolution of 1 m and pixel size of 3140 × 3288. According to Wald's criterion, the PAN and MS images are downsampled by a factor of 4 to obtain 4 m PAN and 16 m MS simulation images, respectively. We simulated 45 testing image pairs for GF-2.

Fusion Quality Metrics
The final fusion results are evaluated by subjective visual perception and objective metrics. Subjective visual perception evaluation is used to compare the fusion result, reference image, and the original image with human vision and observe the clarity, color, outline, and some details, then to judge whether the fusion effect is good or bad. Subjective visual perception evaluation will vary from person to person, and can only represent the judgment results of certain people, which is somewhat one-sided. A reliable, quantitative, and objective quality metric is also needed to further analyze the fusion results to determine the fusion effect. Among them, the following objective quality indicators are mainly used to evaluate the fusion results, which can be divided into full-reference and no-reference indicators. The full-reference indexes we use include correlation coefficient (CC) [56], Root Mean Square Error (RMSE) [59], Structural similarity (SSI M) [60], Spectral angle mapping (SAM) [61], Erreur Relative Globale Adimensionnelle de Synthése (ERGAS) [14], and universal image quality index (U IQI) [62]. The no-reference indexes we use consist of D λ , D s , QNR [63].
The definition of CC [56] of images T and F is expressed as where CC is the similarity index, T is the reference image, and F is the fused image. CC represents the spatial similarity of images T and F, and the value of CC is between [−1,1].
is the standard deviation of F, µ T represents the mean value of image T, µ F represents the mean value of image F if and only if T = F, CC = 1-that is, if the T and F images are more similar, the corresponding CC is closer to 1. Therefore, the closer the CC is to 1, the better the corresponding fusion effect.
RMSE [59] is defined as RMSE represents the difference degree of pixels between the fused image F and the reference image T, and it is the evaluation metric of spatial detail information if and only if T = F, RMSE = 0, so the ideal value of RMSE is 0. The smaller the RMSE, the smaller the difference between the fusion image and the reference image and the better the fusion effect.

SSI M [60] is expressed as
Generally, σ TF is the covariance of image T and F, σ T is the variance of the image T, and σ F is the variance of image F. For the sake of preventing the denominator from being 0, c 3 = c 2 /2, c 1 = (k 1 L) 2 , and c 2 = (k 2 L) 2 are constants; L is the dynamic range of pixel values; k 1 = 0.01; and k 2 = 0.03. α = β = γ = 1; then, we can obtain the equation of SSI M.

SSI M(T, F)
In each calculation, we take a window from the image, and then slide the window continuously for calculation. Finally, we take the average value as the global SSI M. For MS images, SSI M is calculated in different bands and the average value is taken. SSI M is a number between 0 and 1. The larger the SSI M, the smaller the difference between the fused image and the reference image-that is, the better the image quality. When two images are such as two peas, SSI M = 1.
SAM [61] considers the spectrum of a pixel as a vector and measures the similarity of the spectrum by calculating the angle between two vectors. The more similar the fusion spectrum is to the reference spectrum, the better the corresponding fusion effect. The mathematical expression is defined as where T V = {v 1 , v 2 , · · · , v N } is the spectral vector of the reference image and F V = v 1 , v 2 , · · · , v N is the spectral vector of fused image. When and only when T V = F V , SAM = 0. The closer the SAM is to 0, the smaller the degree of spectral distortion and the better the fusion effect. ERGAS [14] is mainly used to calculate the degree of spectral distortion, and its expression is defined as Among them, T P represents the spatial resolution of PAN image, T M represents the spatial resolution of MS image, RMSE(i) is the root mean square error of the ith band of the reference image and the pan-sharpened image, µ i represents the average value of the ith band of the reference image. The smaller the ERGAS value, the better the fusion effect.
U IQI [62] is used to estimate the similarity between images T and F, and its expression is defined as If and only if T = F, Q = 1. Therefore, the more similar T and F are, the closer the Q value is to 1, and the better the spectral quality of the corresponding fused image.
Quality with no-reference (QNR) metric [63] is introduced on the basis of U IQI. For pan-sharpening evaluation, it does not need the ideal output as the reference metric. It includes two indexes: spectral distortion and spatial distortion.
Spectral distortion is represented as D λ , which is related to LRMS image and fusion image, the formula is expressed as Among them,M l expresses the lth band of the LRMS image,F r means the rth band of the fused image, B is the number of bands, and p is a positive integer for amplifying the spectral difference-the default is 1.
Spatial distortion is represented by D s , which is related to the LRMS image, PAN image, and fusion image. Its formula is expressed as where P is PAN image and P is the PAN image degraded from P to LRMS image resolution. The expression of QNR is defined as The smaller the degree of spectral distortion and spatial distortion between the fusion image, LRMS image, and PAN image, the larger the corresponding QNR value.

Implementation Details
The datasets are divided into a training dataset, Landsat-8, and testing datasets, Landsat-8, Landsat-7, QuickBird, and GF-2. According to Wald's protocol, the obtained degraded LPAN and DULMS images of the training dataset Landsat-8 are randomly cropped; the subimage size is 64 × 64 and augments the subimage with rotation. Experiments of RDFNet fusion model are performed on a TensorFlow hardware setup with an Intel Xeon CPU, a NVIDIA Tesla V100 PCIE GPU, and 16 GB RAM.
The fusion results of the Landsat-8 training data are shown in Figure 11; the objective evaluation metrics of the fusion results are calculated and the corresponding bar chart is shown in Figure 12. Figure 11a is the degraded MS image with a spatial resolution of 60 m (The size of LRMS is the same as PAN and fused images even when they have different resolutions in actuality.). Figure 11b is the degraded PAN image with a spatial resolution of 30 m. Figure 11c is the only upsampled MS image by Figure 11a. Figure 11m is the MS image collected with a spatial resolution of 30 m, namely, the ground truth. Figure 11d  Through careful observation of the pan-sharpening results shown in Figures 11 and 12, these methods can improve the spatial resolution, but the degree of improvement is not the same, and there is some spectral distortion. From Figure 12, we can see that the proposed RDFNet achieves the best performance on the basis of all indexes, except SAM, which is second only to SFIM. For CC and SSIM, RDFNet method is the best, which shows that this method extracts more details. Although the metric SAM of RDFNet is a little bigger than SFIM, it is smaller than any other method, and ERGAS is the smallest of the comparison methods. The quality of pan-sharpening is further improved by establishing the network with three branches and a residual module, especially reducing spectral distortion and preserving spatial details. Bicubic interpolation is represented by EXP [44,54,55] and no details are injected. Numerical results show that the performance of IFCNN is the worst. As the IFCNN network is a general fusion model, the fusion network is not sensitive to remote sensing images because of the unique MS and PAN information of remote sensing images. We can clearly see that the performance of PNN is very different from DRPNN. The network using residual is obviously better than PNN in preserving both spectral information and spatial details. DRPNN and PanNet have similar performance. The performance of ResTFNet is better than that of DRPNN and PanNet, the success of which is attributed to the feature extraction of MS and PAN images by two branches. Interestingly, we found that the proposed RDFNet performance is superior to ResTFNet. This is because the proposed RDFNet with the three branches fusion structure can make full use of the spectral information and spatial structure, so as to reduce the spectral distortion and the loss of spatial details.
From Figure 11, we can find that all the methods produce visually clearer pansharpened images than LRMS. The pan-sharpened image generated by the proposed method is very similar to the ground truth image in terms of vision. The spatial information is well preserved, and there is no noticeable ringing phenomenon and spectral distortion. The EXP image is very blurry. Brovey and GS can significantly improve the spatial resolution, but there are problems of oversharpening and spectral distortion. Although SFIM can suppress the oversharpening problem, it will cause the fusion result to be blurred and the spectrum to be slightly distorted. The fusion result of IFCNN gives rise to serious spectral distortion, and the image is vague. The spectral distortion of PNN is reduced, but there are problems of blur and ringing effects. DRPNN improves the definition slightly on the basis of PNN. In fact, the residual-based pan-sharpening effects of DRPNN, PanNet, ResTFNet, and RDFNet are quite good; however, we can find that the RDFNet pansharpened image is closest to the ground truth. In more detail, the buildings in the red box of the PanNet result have the greatest color contrast, followed by ResTFNet and DRPNN, while RDFNet is the closest to the reference image. The RDFNet method can ensure higher spatial resolution; simultaneously, it has the least spectral and spatial distortion, so the fusion performance of the fusion network proposed in this paper is superior.
In order to better represent the difference between the pan-sharpened MS image and the ground-truth MS image, the average intensity difference maps and the average spectral difference maps [65] between the pan-sharpened MS image and the ground-truth MS image are given in Figure 13. The color map is used to represent the difference value of the comparison methods. The color bar is at the bottom of Figure 13, and the value increases gradually from left to right. The top row are the average intensity difference maps of the whole image; the second row are the average intensity difference maps of the enlarged area in Figure 11; the third row are the average spectral difference maps of the whole image; and the bottom row are the average spectral difference maps of the enlarged area in Figure 11. In the top row of Figure 13, it can be seen that the difference of the proposed method is the smallest. The difference of Brovey is greatest in Figure 13a. In the second row of Figure 13, it can be seen that the difference shown by the proposed method is smallest. The great differences are shown in Figure 13a Figure 13c displays SFIM. Although SFIM retains the spectral information, the detail information is lost; hence, its spatial resolution is relatively low. It can be observed from Figure 13e-h-PNN, DRPNN, PanNet, and ResTFNet, respectively-that, the spectral information of ResTFNet is well preserved and more details are extracted. However, the spectral distortion and spatial distortion of the proposed method, RDFNet (Figure 13i), are lower.  Figure 13. Average intensity difference maps and average spectral difference maps between a pan-sharpened MS image and ground-truth MS image for the Landsat-8 training dataset with reduced resolution; the top row are the average intensity difference maps of the whole image, the second row are the average intensity difference maps of the enlarged area in Figure 11, the third row are the average spectral difference maps of the whole image, and the bottom row are the average spectral difference maps of the enlarged area in Figure 11. In order to verify the performance of the proposed RDFNet, we carried out a lot of experiments on low resolution and full resolution, respectively, using MS and PAN images of four different satellites, including Landsat-8, QuickBird, Landsat-7, and GF-2. This section is used to present the degraded resolution fusion results.
The degraded resolution experimental results of the Landsat-8 testing data are shown in Figure 14. The representation of each image is the same as that in Figure 11. The average intensity difference maps and the average spectral difference maps are shown in Figure 15. It can be seen from Figure 14 that the proposed method is closer to the ground truth image. Figure 15i also shows the smallest difference. It can be seen from Figure 14 that compared with ground truth, Brovey and GS (Figure 14d,e) fusion results are oversharpened, and there is spectral distortion. The corresponding images in Figure 15a,b show a relatively large intensity difference and spectral difference. We also observe in Figure 14g that IFCNN apparently loses much spectral information, and its spectral difference and detail information difference are relatively large in Figure 15d. In Figure 14f, although there is less spectral distortion of SFIM fusion results, the spatial resolution improvement is lower, and the fusion images look vague compared with the ground truth images. In Figure 14h-l, comparing the fusion results of PNN, DRPNN, PanNet, ResTFNet, and RDFNet with the ground truth image, we can see that the spectral fidelity of these methods is fairly good. However, we can see in the difference from color maps of Figure 15 that RDFNet retains the best spectral and structural information, followed by ResTFNet.  Figure 15. Average intensity difference maps and average spectral difference maps between the pan-sharpened MS image and ground-truth MS image for the Landsat-8 testing dataset with reduced resolution; the top row are the average intensity difference maps of the whole image, the second row are the average intensity difference maps of the enlarged area in Figure 14, the third row are the average spectral difference maps of the whole image, and the bottom row are the average spectral difference maps of the enlarged area in Figure 14. The degraded resolution experimental results of QuickBird testing data are shown in Figure 16. The average intensity difference maps and the average spectral difference maps are shown in Figure 17. The QuickBird has only four bands, but the spectral range of visible light is very similar to that of Landsat-8. It can be seen from Figure 16 that the proposed method is closer to the reference image in terms of spectral information and clarity. Figure 17i also shows that the proposed method RDFNet retains much more spectral and structural information. The spectral distortions of Figure 17a Brovey, Figure 17b GS, Figure 17d IFCNN, and Figure 17e PNN are more serious, and the severity is less in turn. From Figure 17f-i, comparing the fusion results of DRPNN, PanNet, and ResTFNet with RDFNet, we observe that the spectral fidelity of RDFNet is fairly good and preserves more details.
The degraded resolution experimental results of Landsat-7 testing data are shown in Figure 18. The average intensity difference maps and the average spectral difference maps are shown in Figures 19 and 20. In Figure 19, the top row are the average intensity difference maps of the whole image, the second row are the average spectral difference maps of the whole image. In Figure 20, the top row are the enlarged views of the yellow box in Figure 18, the middle row are the intensity difference maps of the enlarged area of the yellow box in Figure 18, and the bottom row are the spectral difference maps of the enlarged area of the yellow box in Figure 18. Obviously, it can be observed from Figures 18 and 19 that the proposed fusion model RDFNet is much better for both improving spatial resolution and retaining spectral information. Figure 20 shows an enlarged view of the river branch. From the details, we can clearly distinguish the difference of each comparison method in retaining spectral information and spatial information.   Figure 17. Average intensity difference maps and average spectral difference maps between the pan-sharpened MS image and ground-truth MS image for the QuickBird testing dataset with reduced resolution; the top row are the average intensity difference maps of the whole image, the second row are the average intensity difference maps of the enlarged area in Figure 16, the third row are the average spectral difference maps of the whole image, and the bottom row are the average spectral difference maps of the enlarged area in Figure 16.   The degraded resolution experimental results of GF-2 testing data are shown in Figure 21. The average intensity difference maps and the average spectral difference maps are shown in Figure 22. From Figures 21 and 22, the spectral distortion of IFCNN, Brovey, and GS is still serious; however, compared with these three methods, the structure of IFCNN is closer to the ground truth. In comparison, the spectral information of SFIM is better preserved. However, the spatial information is a little sharpened and there are artifacts compared with the proposed pansharpened model RDFNet. From Figure 21h-l, comparing the fusion results of PNN, DRPNN, PanNet, and ResTFNet with the ground truth image, we can see that the spectral fidelity of these methods is fairly good. However, we can see from the difference of the color maps in Figure 22 that ResTFNet has less spectral distortion with a little sharpening. On the whole, the effect of the model proposed is better. In summary, compared with the aforementioned algorithms, the proposed RDFNet fusion result not only improves the spatial resolution, but also has less spectral distortion and almost no oversharpening.  Figure 18, the corresponding average intensity difference maps, and the average spectral difference maps; the top row are the enlarged views of the yellow box in Figure 18,   The objective evaluation indexes of fusion results on testing datasets are calculated, and the corresponding bar charts are shown in Figures 23-26, respectively. From the value of each metric, the effect of RDFNet proposed in this paper is the best. It can be seen from the bar chart in Figure 23 that the indexes of CC, SSIM, and UIQI of the proposed method are the largest, while RMSE, SAM, and ERGAS are the smallest, respectively. This shows that the proposed method retains more spectral information while preserving spatial structure information, which is more conducive to providing better HRMS images. From the digital indicators in Figure 23, Brovey and GS have little difference, which is consistent with visual perception (Figure 14). However, the RMSE, SAM, and ERGAS numerical values of IFCNN are relatively large, and the fusion effect is not good. The fusion results of DRPNN and PanNet have similar digital indicators, which are better than PNN. Further, the digital index of ResTFNet fusion result is better than that of DRPNN and PanNet. In a word, the digital index of proposed RDFNet method is optimal. The indexes of other testing datasets are shown in Figures 24-26. Similarly, it can be seen from bar charts (Figures 24-26) that each evaluation index of the proposed method is optimal. Consequently, the fusion performance of the proposed RDFNet is the most outstanding. This shows that the proposed method can simultaneously improve the spatial resolution and retain the spectral information better. Although there is little difference with the existing methods, it improves the spatial resolution and retains more spectral information on the basis of the existing methods. This is of great significance to applications that require higher spatial resolution or higher spectral resolution, and has certain significance in practical applications.  In order to comprehensively analyze the fusion performance of the network, Figure 27 shows the fusion time of the testing datasets on low resolution. In the future, we will employ the combination of extracting high-frequency information and deep network to explore the lighter fusion model.

Full-Resolution Datasets' Experimental Results
In this section, remote sensing data collected from Landsat-8, QuickBird, Landsat-7, and GF-2 are used for pan-sharpening. The visual perception comparison of mainstream methods on fusion results is shown in Figures 28-31. Among them, ULMS is an upsampled MS image by the original MS image. From Figures 28-31, we observe that the spatial resolution of all the fusion results is improved compared with ULMS. However, it can be seen from Figure 28 that fusion results of Brovey, IFCNN, and PanNet have obvious spectral distortion. By observing the fusion results of other methods corresponding to the two yellow regions of ULMS image, we find that the result of RDFNet is closer to the spectrum of ULMS image. In addition, ResTFNet is blurrier than RDFNet. From Figure 29, the resolution of all fusion results is improved. It can be seen that the SFIM, PNN, and DRPNN methods have the problems of blur and artifacts. Compared with ULMS, the Brovey, GS, IFCNN, and PanNet methods have obvious spectral distortion. Visually, compared with the luminance of ULMS image, the fusion result of ResTFNet is darker, while the fusion result of RDFNet is more consistent with the ULMS image. In Figure 30, the resolution of the fusion results of all methods is higher than that of ULMS. The fusion results of IFCNN and PanNet show serious spectral distortion. The fusion results of Brovey, GS, SFIM, PNN and DRPNN also show spectral distortion, but the color of Brovey, SFIM, PNN, and DRPNN are darker than that of ULMS, and the color of GS is lighter than that of ULMS. The fusion results of the ResTFNet method are better than the above methods, but compared with the fusion result of RDFNet method proposed, the fusion result of RDFNet method has higher resolution and preserves more spectral information. For Figure 31, compared with ULMS, the resolution of all fusion results is improved. However, Brovey, GS, and IFCNN all produce severe spectral distortion. In terms of SFIM, PNN, and PanNet, the fusion results are blurry and produce ringing artifacts. Relatively speaking, the fusion results generated by DRPNN, ResTFNet, and RDFNet are better. However, by carefully observation, we show that the fusion result of RDFNet is more accurate than ResTFNet and DRPNN. All in all, compared with these methods, the RDFNet proposed in the paper has higher spatial resolution and the least spatial and spectral distortion.     The objective evaluation indicators for the fusion results of Landsat-8, QuickBird, Landsat-7, and GF-2 with full resolution are shown in Tables 2 and 3, respectively; the best results are displayed in bold. From the numerical indicators, as shown in Table 2, it can be seen that on the Landsat-8 testing dataset, the QNR value of RDFNet is the best, followed by ResTFNet, DRPNN, PNN, and PanNet. Compared with DRPNN and PNN, the performance of PanNet is worse. Combined with the experimental results of low resolution, the fusion results of PanNet with full resolution show overfitting. From the QNR value of QuickBird testing dataset, as shown in Table 2, it can be observed that although the D λ of RDFNet ranks third (there is little difference with the other two values), the QNR value is the most optimal. In terms of PanNet, because the value of D λ is relatively large, there is serious spectral distortion, so the QNR value is relatively small. However, the values of D λ and D s are the largest, and the spectral and spatial distortion is serious. The difference of the QNR value of other methods is small. For the digital indexes of the fusion results on the Landsat-7 testing dataset, RDFNet is the best in D s and QNR, yet, D λ ranked third, with only 0.0018 and 0.0008 difference from first and second, respectively. In terms of GF-2, the index value of RDFNet is the best. From the above analysis, the IFCNN method is not sensitive to remote sensing data and the fusion effect is poor. PanNet shows a serious overfitting phenomenon in our datasets. From an overall point of view, the proposed RDFNet minimizes spectral distortion and spatial distortion, and preserves more spatial details and spectral information.

Conclusions
In this paper, we propose a distributed fusion framework based on residual CNN (RCNN), namely, RDFNet, which realizes the data fusion of three channels. It can make the most of the spectral information and spatial information of LRMS and PAN images. The proposed fusion network employs a distributed fusion architecture to make the best of the fusion outcome of the previous step in the fusion channel, so that the subsequent fusion acquires much more spectral and spatial information. Moreover, two feature extraction channels are used to extract the features of MS and PAN images, respectively, using the residual module, and features of different scales are used for the fusion channel. In this way, spectral distortion and spatial information loss are reduced. We employ data from four different satellites to compare the proposed RDFNet, such as Landsat-8, Landsat-7, QuickBird, and GF-2. Comparative experiments are carried out with reduced resolution and full resolution, respectively. The results of the experiment demonstrate that the proposed RDFNet has superior performance in improving spatial resolution and preserving spectral information, and has good robustness and generalization in improving the fusion quality.

Data Availability Statement:
The data provided in this study can be provided at the request of the corresponding author. The data has not been made public because it is still used for further research in the study field.