MDECNN: A Multiscale Perception Dense Encoding Convolutional Neural Network for Multispectral Pan-Sharpening

: With the rapid development of deep neural networks in the ﬁeld of remote sensing image fusion, the pan-sharpening method based on convolutional neural networks has achieved remarkable effects. However, because remote sensing images contain complex features, existing methods cannot fully extract spatial features while maintaining spectral quality, resulting in insufﬁcient reconstruction capabilities. To produce high-quality pan-sharpened images, a multiscale perception dense coding convolutional neural network (MDECNN) is proposed. The network is based on dual-stream input, designing multiscale blocks to separately extract the rich spatial information contained in panchromatic (PAN) images, designing feature enhancement blocks and dense coding structures to fully learn the feature mapping relationship, and proposing comprehensive loss constraint expecta-tions. Spectral mapping is used to maintain spectral quality and obtain high-quality fused images. Experiments on different satellite datasets show that this method is superior to the existing methods in both subjective and objective evaluations.


Introduction
Since the 1960s, satellite technology has developed rapidly, and remote sensing technology has been widely used, for example, in environmental monitoring and geological exploration, map navigation, precision agriculture, and national defense security [1]. Remote sensing data is collected by satellite sensors with different imaging modes. The image information contained in these different data has both redundant parts and complementary parts in space. Remote sensing images can obtain visible light images that we are familiar with and also multispectral and hyperspectral images with more abundant spectral information. All objects on Earth emit or reflect externally in the form of electromagnetic waves and absorb energy internally. Because of the essential differences between objects, the electromagnetic characteristics they exhibit are also different. Remote sensing images provide information about these objects and provide a snapshot of different aspects of objects on the Earth's surface. The combination of different vision technologies and remote sensing technologies is more conducive for us to accomplish high-level vision tasks.
Limited by different satellite sensors, remote sensing imaging technology can acquire only panchromatic (PAN) images with high spatial resolution and multispectral (MS) images with high spectral resolution. For example, although Earth observation satellites such as QuickBird, GeoEye, Ikonos, and WorldView-3 can capture two different types of remote sensing images, satellite sensors cannot acquire MS images with high spatial resolution due to the contradiction between spectrum and space, which cannot solve the current research problems. This problem has led to the rapid development of multisource information fusion technology. Therefore, a large number of studies are currently devoted structure has become more and more complex. Wei et al. [24] proposed a deep residual network (DRPNN) to extract more abundant image information. Yang et al. [25] proposed a deep network architecture for pan-sharpening (PanNet), a residual structure of high-pass domain training, on the basis of the previous deep learning network and better retained the spectral information by means of spectral mapping. By learning the high-frequency components of images, a correlation mapping relationship was obtained, and better fusion results were obtained.
PanNet also has certain limitations. First, PanNet performs feature extraction by directly superimposing PAN and MS images, resulting in the network's inability to fully utilize the different features of PAN and MS images and its insufficient utilization of different spatial information and spectral information. Second, PanNet only uses a simple residual structure, which cannot fully extract image features of different scales and lacks the ability to recover details. Finally, the network directly outputs the fusion result through a single-layer convolutional layer, failing to make full use of all the features extracted by the network, which affects the final fusion effect.
In response to the above problems, we consider using a multiscale perception dense coding convolutional neural network (MDECNN) to improve the learning ability and reconstruction ability of the model. The problem of gradient disappearance caused by a large number of network layers is avoided by means of skip connections. The different features of MS and PAN images are extracted by dual-stream network input. At the same time, multiscale blocks are designed to extract features from PAN images with richer spatial information. Two different multiscale feature extraction blocks are used to enhance the features of the network, and then the spectral and spatial features of the image are reconstructed with complete detailed information through dense coding blocks. Finally, the fusion image reconstruction is completed through a three-layer superresolution network.
The main contributions of this paper are as follows: 1.
In view of the limitations of image spatial information acquisition, a single multiscale feature extraction block is used for feature extraction of PAN images with high spatial resolution, which enriches the spatial information of network extraction.

2.
We propose a feature extraction block composed of two multiscale blocks with different receptive fields to enhance the image details in the training network and reduce the loss of details in the network training process. 3.
We design a dense coding structure block to reconstruct the spectral and spatial features of the image and improve the spectral quality and detail recovery capabilities of the fused image. 4.
We propose a comprehensive spectral loss, adding spatial constraints on the basis of common L 2 loss, reducing the loss of edge information during training, and enhancing the spatial quality of fused images. 5.
The rest of this paper is arranged as follows: In Section 2, the background of image blending and related work are introduced, and the CNN-based pan-sharpening approach is briefly reviewed; in Section 3, we describe our proposed multiscale dense network structure in detail; in Section 4 we present the experimental results and compare them with other methods; in Section 5, we discuss the structure of multiscale dense networks; and finally, I Section 6, we provide conclusions.

Traditionally Based Pan-Sharpening
Remote sensing image fusion combines multiple registered images of the same scene into the same image. The resulting composite image has better image interpretation and better visual effect than the remote sensing image obtained by a single sensor, which is more favorable for subsequent processing. In the process of image fusion, the following three conditions must be met: save all relevant information as much as possible, elimi-Remote Sens. 2021, 13, 535 4 of 22 nate irrelevant information and noise, and minimize distortion and inconsistency in the merged image.
In the IHS fusion algorithm proposed by Xu, the three bands of the multispectral image are converted from the red, green, and blue color space to the IHS color space [26]. Intensity describes the luminance value based on the amount of illumination, hue is the actual color, and saturation describes the luminance value measured as a percentage. The IHS fusion algorithm replaces the intensity component with the PAN image to sharpen the enhanced image and finally obtains the fused image through inverse transformation. The GS method, proposed by Laben, C. A., is based on the general algorithm of vector orthogonalization-orthogonalization [27]. Each band corresponds to a high-dimensional vector, and the core of the algorithm is the input non-orthogonal vector, which is orthogonal by rotation. First, the MS band is weighted to calculate a low-resolution PAN band. Then, each band vector is processed using the GS orthogonalization method. Finally, the lowresolution vector is replaced with the PAN image, and the fusion result is obtained by inverse transformation [28].
The High-Pass Filter (HPF) method, proposed by Gangkofner et al. [29], injects highfrequency components of PAN into MS images, which can effectively improve the problem of spectral distortion. HPF first calculates the spatial resolution ratio of PAN and MS images. On the basis of the resolution ratio, a high-pass convolution filter is established for convolution calculation of HR input. Then, the HPF images are added to each band, the HPF images are weighted according to the global standard deviation of the MS band, and the weight factors are calculated according to the scale. Finally, linear stretching is used to fuse the image.
Although the traditional algorithm has a relatively good effect, the use of the features of the image itself is insufficient, and the efficiency in detail recovery is low. Deep learning algorithms solve these problems better, therefore, in the field of remote sensing image fusion, deep learning methods are more commonly used. The next section will introduce the concept of remote sensing image fusion in the direction of deep learning.

CNN-Based Pan-Sharpening
In recent years, the use of deep learning technology in the field of remote sensing image fusion has become increasingly extensive. Through the learning of image features, corresponding losses and mapping relationships, HRMS images can be reconstructed.
Recently, Yang et al. [25] proposed a pan-sharpening method for MS images with deep network structures. PanNet uses high-pass filtering to obtain high-frequency information of MS and PAN images as the input of the network, and then improves the spectral information of image fusion through spectral mapping. The experimental results show that the deep residual network structure through high-pass domain training and spectral mapping can make the image fusion algorithm show better results, and it also provides more ideas for the research of remote sensing image fusion. Later, to further improve the network quality, the Fu team [30] proposed a deep multiscale image sharpening method using the dilated convolution block to extract the information of different scales of the image, and then obtained better fusion results through the learning of the residual network. A large number of network structure examples verify that the depth of the network and the size of the receptive field have a significant influence on the quality of image fusion [31][32][33][34][35][36][37][38][39][40][41][42][43][44][45].
Deep learning technology is a training method with parameters. As shown in Figure 1, we represent the PAN image as g pan (size : (H × scale) × (W × scale)) and the MS image of N bands as g ms (size : H × W × N). According to the Wald protocol [46], the MS and PAN images are sampled up and down, respectively, obtaining the degraded images g ms (H × W × N) and g pan (H × W) to form the input data f (g ms , g pan ). Through deep network training, the prediction loss between the generated image G = T( f (g ms , g pan )) and the reference image g ms is minimized, and the final data model is obtained. Finally, in the testing phase, the trained data model is used to reconstruct the input real data f (g ms , g pan ) to generate HRMS images G .

The Proposed Network
In this chapter, we introduce the proposed MDECNN, as shown in Figure 2. The dual-stream network is used to extract the remote sensing image information contained in the MS

The Proposed Network
In this chapter, we introduce the proposed MDECNN, as shown in Figure 2. The dualstream network is used to extract the remote sensing image information contained in the MS (size : H × W × N) image and PAN (size : H × W) image, and the fused imagê G(size : H × W × N) is obtained after feature processing and image reconstruction of the fusion network as: where H denotes the high-pass information, δ(·) denotes the convolution operation, and f mb (·) denotes the multiscale feature extraction. Finally, we concatenate F MS and F PAN to from the fusion features as follows: where ⊗ refers to the concatenation operation. Then, the output F em is obtained through the feature enhancement module and the output F dc is obtained through the dense coding structure as: where f em (·) denotes the feature enhancement operation and f dc denotes the dense coded operation. The final predictionĜ is as follows:

The Proposed Network
In this chapter, we introduce the proposed MDECNN, as shown in Figure 2. The dualstream network is used to extract the remote sensing image information contained in the MS (size : H × W × N) image and PAN (size : H × W) image, and the fused imagê G(size : H × W × N) is obtained after feature processing and image reconstruction of the fusion network as: where H denotes the high-pass information, δ(·) denotes the convolution operation, and f mb (·) denotes the multiscale feature extraction. Finally, we concatenate F MS and F PAN to from the fusion features as follows: where ⊗ refers to the concatenation operation. Then, the output F em is obtained through the feature enhancement module and the output F dc is obtained through the dense coding structure as: where f em (·) denotes the feature enhancement operation and f dc denotes the dense coded operation. The final predictionĜ is as follows: tion block. After the two-layer convolution, the feature is enhanced by skip connection with the feature enhancement module. Then, a dense coding structure similar to the U-Net structure, is sent for feature fusion and reconstruction. Finally, the fusion image is obtained by enhancing the spectral information of the image by spectral mapping. The weight parameters of the whole network are obtained by learning many nonlinear relationships between simulated data and do not need to be set manually. The details of our proposed network architecture are described below.  We attempt to use deep neural networks to learn the map between the input MS, PAN, and outputĜ. We us G to represent the reference target and use Loss regularization as the loss function of the training to measure the magnitude of the error betweenĜ and G as follows: where the function · 1 is the matrix norm, especially · 2 2 is the square of Frobenius norm.
To obtain more spatial features, a multiscale feature extraction module was designed to extract the feature information of the PAN image. The feature images extracted from the dual-stream network are superimposed into the trunk network, and the features of different receptive fields of the image are obtained by using the parallel dilated convolution block. After the two-layer convolution, the feature is enhanced by skip connection with the feature enhancement module. Then, a dense coding structure similar to the U-Net structure, is sent for feature fusion and reconstruction. Finally, the fusion image is obtained by enhancing the spectral information of the image by spectral mapping. The weight parameters of the whole network are obtained by learning many nonlinear relationships between simulated data and do not need to be set manually. The details of our proposed network architecture are described below.

Multiscale Feature Extraction Block
The depth and width of the network have a significant influence on the image fusion results. With a deeper network structure, the network can learn richer feature information and context-dependent mapping. However, with the deepening of the network structure, gradient explosion, gradient disappearance, training difficulties, and other problems often occur. To solve relevant problems, He et al. [47] proposed a residual network structure, the ResNet network structure. By means of skip connection, the training process is optimized, while the network depth is guaranteed. In terms of the width of the network, Szegedy, C. et al. [48] proposed an inception structure, which fully expands the width of the network and enables the network to obtain more characteristic information.
Inspired by GoogLeNet, a multiscale feature extraction block was designed to extract the rich spatial features contained in PAN images. Figure 3 shows the multiscale blocks we designed for feature extraction of PAN images. ten occur. To solve relevant problems, He et al. [47] proposed a residual network structure, the ResNet network structure. By means of skip connection, the training process is optimized, while the network depth is guaranteed. In terms of the width of the network, Szegedy, C. et al. [48] proposed an inception structure, which fully expands the width of the network and enables the network to obtain more characteristic information.
Inspired by GoogLeNet, a multiscale feature extraction block was designed to extract the rich spatial features contained in PAN images. Figure 3 shows the multiscale blocks we designed for feature extraction of PAN images.  The convolution kernels with sizes of  7 7 ,  5 5 ,  3 3 , and  1 1 are used for feature extraction of PAN images after two convolution layers. The size of the first three convolution kernels to extract the characteristics of the different sizes of receptive field, using the  1 1 convolution for dimension reduction of character figure, across the channel characteristics of integration and model simplification, by the multiscale feature extraction piece, we can get rich images in PAN image information. The convolution kernels with sizes of 7 × 7, 5 × 5, 3 × 3, and 1 × 1 are used for feature extraction of PAN images after two convolution layers. The size of the first three convolution kernels to extract the characteristics of the different sizes of receptive field, using the 1 × 1 convolution for dimension reduction of character figure, across the channel characteristics of integration and model simplification, by the multiscale feature extraction piece, we can get rich images in PAN image information.

Feature Enhancement (FE) Block
The feature enhancement module is shown in Figure 4. Remote sensing images contain a large number of buildings, vegetation, mountains, water, and other large-scale objects and contain relatively small-scale target objects such as vehicles, ships, and roads. The traditional convolutional neural network selects the convolution kernel of fixed size, and the receptive field is relatively small, so the context information of the image is not sufficiently learned. To solve this problem, in this paper, the feature enhancement block is proposed. As shown in Figure 4, we select three sensory fields of 5 × 5, 3 × 3, and 1 × 1 and do not use an activation function to retain image information when passing through the first convolutional layer. After the first convolutional layer, we use a 3 × 3 convolutional layer to enlarge the feature sensory fields and obtain more contextual information.
To enhance each feature detail in the remote sensing image, a dilated convolution block, as shown in Figure 5, is designed in the trunk network to extract the multiscale details of the image, and then the features extracted by a skip connection and parallel feature extraction block are stacked to achieve the effect of feature enhancement.
We follow the experimental setting of Fu et al. [30] and set the dilation rate of the dilated convolution block to 1, 2, 3, and 4. The magnitude of the receptive field of the convolution kernel of the dilated convolution is d × (k − 1) + 1, where d represents the dilation rate, and k represents the size of the convolution kernel. In the parameter setting, the size of the standard convolution kernel and the dilated convolution kernel are both ote Sens. 2021, 13, 535 8 of 23

Feature Enhancement (FE) Block
The feature enhancement module is shown in Figure 4.  Remote sensing images contain a large number of buildings, vegetation, mountains, water, and other large-scale objects and contain relatively small-scale target objects such as vehicles, ships, and roads. The traditional convolutional neural network selects the convolution kernel of fixed size, and the receptive field is relatively small, so the context information of the image is not sufficiently learned. To solve this problem, in this paper, the feature enhancement block is proposed. As shown in Figure 4, we select three sensory fields of  5 5 ,  3 3 , and  1 1 and do not use an activation function to retain image information when passing through the first convolutional layer. After the first convolutional layer, we use a  3 3 convolutional layer to enlarge the feature sensory fields and obtain more contextual information.
To enhance each feature detail in the remote sensing image, a dilated convolution block, as shown in Figure 5, is designed in the trunk network to extract the multiscale details of the image, and then the features extracted by a skip connection and parallel feature extraction block are stacked to achieve the effect of feature enhancement.   We follow the experimental setting of Fu et al. [30] and set the dilation rate of the dilated convolution block to 1, 2, 3, and 4. The magnitude of the receptive field of the convolution kernel of the dilated convolution is Although the dilated convolution can increase the receptive field of the convolution kernel by expanding the dilation coefficient, it has the problem of "meshing" [49]. In remote sensing images, there are a large number of buildings, vegetation, vehicles, and other objects. These feature-rich objects tend to gather in a large number in the same area, so that there is a strong similarity in the spectrum and spatial structure. Therefore, the use of dilated convolution will lead to the loss of local information from remote sensing images.
To solve this problem, the feature enhancement method, mentioned above, is used to extract multiscale features through parallel feature extraction blocks and to fuse and enhance the features extracted by dilated convolution blocks in the trunk network to improve the robustness of feature extraction in various complex remote sensing images. In addition, through such a feature extraction method, we can extract more perfect spectral information and spatial information and reduce the feature loss in the fusion process.

Dense Coding (DC) Structure
Considering the abundance of remote sensing image characteristics, a common network structure would not be able to fully extract the deep image characteristics, easily causing information loss in the process of convolution, therefore, we designed a dense coding block to fully image the deep character extraction and avoid common coding in the network layer useful information leakage problems. As shown in Figure 6, in the dense coding network, the feature mapping obtained at each layer is cascaded with the input at the next layer, and the information of the middle layer is retained to the greatest extent by adding channels. At the same time, the feature multiplexing of dense connections does not introduce redundant parameters and does not increase the computing consumption.  There are three advantages to using a block-intensive architecture which are the following: (1) it can hold as much information as possible; (2) this architecture can improve the information flow and gradient flow in the network, making the network easy to train; and (3) intensive contact has a regularized effect, which reduces the over adaptation of tasks [50].
In the decoding stage, to avoid information loss caused by channel plummeting, a U-Net decoding structure similar to the structure of the encoder is used, which is reduced to the number of channels equivalent to the encoder each time to facilitate the full extraction and fusion of features. Its structure is shown in Figure 7. In the encoding and decoding process, the convolution kernel size is set as  3 3 and the number of channels is 64, so the reduction of channels at each layer in the decoding process is 64. Through the dense coding structure proposed by us, the deep features of remote sensing images can be fully extracted, and the feature details can be fully recovered in the subsequent image reconstruction process. With the deepening of the network depth, spectral information is often seriously lost. Inspired by PanNet, the spectral mapping method is used to enhance the spectral details of the image in the final image recon- There are three advantages to using a block-intensive architecture which are the following: (1) it can hold as much information as possible; (2) this architecture can improve the information flow and gradient flow in the network, making the network easy to train; and (3) intensive contact has a regularized effect, which reduces the over adaptation of tasks [50].
In the decoding stage, to avoid information loss caused by channel plummeting, a U-Net decoding structure similar to the structure of the encoder is used, which is reduced to the number of channels equivalent to the encoder each time to facilitate the full extraction and fusion of features. Its structure is shown in Figure 7.  There are three advantages to using a block-intensive architecture which are the following: (1) it can hold as much information as possible; (2) this architecture can improve the information flow and gradient flow in the network, making the network easy to train; and (3) intensive contact has a regularized effect, which reduces the over adaptation of tasks [50].
In the decoding stage, to avoid information loss caused by channel plummeting, a U-Net decoding structure similar to the structure of the encoder is used, which is reduced to the number of channels equivalent to the encoder each time to facilitate the full extraction and fusion of features. Its structure is shown in Figure 7.  In the encoding and decoding process, the convolution kernel size is set as  3 3 and the number of channels is 64, so the reduction of channels at each layer in the decoding process is 64. Through the dense coding structure proposed by us, the deep features of remote sensing images can be fully extracted, and the feature details can be fully recovered in the subsequent image reconstruction process. With the deepening of the network depth, spectral information is often seriously lost. Inspired by PanNet, the spectral mapping method is used to enhance the spectral details of the image in the final image recon- In the encoding and decoding process, the convolution kernel size is set as 3 × 3 and the number of channels is 64, so the reduction of channels at each layer in the decoding process is 64. Through the dense coding structure proposed by us, the deep features of remote sensing images can be fully extracted, and the feature details can be fully recovered in the subsequent image reconstruction process. With the deepening of the network depth, spectral information is often seriously lost. Inspired by PanNet, the spectral mapping method is used to enhance the spectral details of the image in the final image reconstruction part of the network to ensure the spectral quality of the fused image.

Loss Function
In addition to the network architecture, the loss function is another important factor affecting the fusion image quality. The loss function optimizes the parameters by minimizing the loss between the reconstructed image and the corresponding ground live HR image.
Thus, we give a set of training sets X = X where N is the number of small batch samples and i is the ith image. We select the Adam optimizer to carry out back propagation and optimize the allocation of all parameters in the iterative network.
Sharpening the image using L 2 losses smooths the image and penalizes larger outliers but is less sensitive to smaller outliers, meaning that the learning process slows significantly as the output approaches the target. To make further improvements, additional small outliers are processed, and image edge information is retained. The L 1 norm provides the better effect, with the more pronounced smooth L 1 loss defined as follows: Selecting the L 2 norm alone will cause the image to be too smooth and the edge information will be lost. The use of L 1 norm alone will lead to insufficient training convergence and serious spectral noise. In view of this problem, we have designed a mixed loss function that uses the combination of L 2 loss and smooth L 1 loss. The loss of spectral information is constrained by L 2 loss and smooth L 1 is used as the spatial loss constraint. The mixed loss is defined as follows: Through experience, the value of λ is set to 0.3.

Experimental Analysis
In this section, we will demonstrate the superiority of the proposed method through experimental results on multiple datasets. By comparing and evaluating the training and test results of the models with different network parameters, the best model is selected for the experiment. Finally, the visual and objective indicators of our best model are compared with several other existing methods to prove the superior performance of the proposed method.

Dataset and Model Training
To evaluate the performance of our proposed dense coding network based on multiscale perception, we conducted model training and testing on datasets collected by three different satellite sensors, GeoEye-1, Quickbird, and WorldView-3. The band number and spatial resolution of different satellite sensors are shown in Table 1. For the convenience of training, the input images of each dataset are uniformly set as 64 × 64 image patch, and the size of each training batch is 4. The train set is used for network training, while the test set is used to evaluate network performance. The spatial resolution of the train set and the test set are shown in Table 2. The maximum number of training sessions is set to 350,000. For the Adam optimizer, we set the learning rate to 0.001 and the exponential decay factor to 0.9. We set the weight attenuation to 10 −6 . We use the proposed comprehensive loss as a loss function to minimize the prediction error of the model, and the training time of the overall program is approximately 26 h 54 min.
The network is implemented in the TensorFlow deep learning framework and trained on an NVIDIA Tesla V100-SXM2-32GB, and the results are presented with ENVI Classic 5.3.
To facilitate visual observation, the red, green, and blue bands of the multispectral image are used as the imaging bands of the RGB image to form color images. However, in the calculation of objective indicators, other bands of the image will not be ignored.
In real application scenes of remote sensing images, HRMS images are often lacking. Therefore, in the comparison algorithm, we use the following two kinds of experiments for comparison: one is the simulation experiment with HRMS images as a reference, and the other is the real experiment without HRMS images. The evaluation criteria of the reference images are as follows: the spectral angle mapper (SAM), the relative average spectral error (RASE), the root mean squared error (RMSE), the universal image quality index (QAVE), the relative dimensionless global error in synthesis (ERGAS), the correlation coefficient (CC), and the structural similarity (SSIM). The other assessments are based on the quality with no reference index (QNR) and the spectral and spatial components (D λ and D S ). Figure 8 shows a set of fusion results on WorldView-3 satellite data; the data are 8-band data. Figure 8a,b show the HRMS and PAN images with resolution, respectively.   Figure 8 shows that seven methods of non-deep learning are accompanied by relatively obvious spectral deviation. Among these methods, DWT and SIRF exhibit obvious spectral distortion, while the edge details of the image are blurred. The IHS fusion image shows partial detail loss in some spectral distortion areas and fuzzy artefacts in road vehicle areas. The HPF, GS, GLP, and PRACS methods show good performance in the overall spatial structure, but they are distorted and blurred in both spectrum and detail. For the fusion method of deep learning, the image texture information performs well, but in terms of spectral information, the fusion method of PSGan shows obvious changes in partial regional spectra, while other differences are not obvious. To further distinguish the image quality, we use the objective evaluation index mentioned before for further comparison. The results are shown in Table 3.  Figure 8 shows that seven methods of non-deep learning are accompanied by relatively obvious spectral deviation. Among these methods, DWT and SIRF exhibit obvious spectral distortion, while the edge details of the image are blurred. The IHS fusion image shows partial detail loss in some spectral distortion areas and fuzzy artefacts in road vehicle areas. The HPF, GS, GLP, and PRACS methods show good performance in the overall spatial structure, but they are distorted and blurred in both spectrum and detail. For the fusion method of deep learning, the image texture information performs well, but in terms of spectral information, the fusion method of PSGan shows obvious changes in partial regional spectra, while other differences are not obvious. To further distinguish the image quality, we use the objective evaluation index mentioned before for further comparison. The results are shown in Table 3. Table 3. Quantitative assessment of the WorldView-3 dataset is shown in Figure 8. The best performance is shown in bold. As shown in Table 3, from the perspective of the reference index of WorldView-3 dataset, the pan-sharpening method of deep learning is obviously better than the fusion method of non-deep learning. Among these methods, GLP is superior to other non-deep learning methods in overall effect, and the spectral information of fusion results obtained by HPF and GLP is superior to that obtained by other non-deep learning methods. GLP and PRACS are more complete in preserving spatial information than those of the non-deep learning methods. The results obtained by the PRACS, HPF, and GLP methods showed no significant difference in image quality. In the pan-sharpening method of deep learning, the effectiveness of the network structure directly affects the fusion effect. Therefore, the method proposed in this paper is obviously superior to the existing fusion methods, which proves the effectiveness of the method proposed in this paper. In Figure 9, the non-deep learning method obviously has spectral distortion. From Figure 9c-i, the traditional fusion method more or less exhibits the whole spectrum distortion phenomenon. Among the methods, DWT, his, and SIRF present the most severe spectral distortion. GLP and GS present obvious edge blurring in the spectral distortion area, and the PRACS method presents artefacts in the image edge. The deep learning method has good fidelity in both spectral information and spatial information, among which the method proposed by us is the most similar to the original image in both spectral information and spatial information. Table 4 below objectively analyses each method in terms of index values.

Experiment with QuickBird Dataset
As shown in Table 4, the QuickBird experimental assessment results show that the performance of the pan-sharpening method, which is deep learning on the 4-band dataset and is significantly better than the traditional method. In terms of the experimental results of these data, HPF has achieved an overall better performance in traditional methods. Although the HPF method and GLP method are not significantly different in other indicators, the HPF method is obviously superior to the GLP method in maintaining spectral information. PanNet and PSGan have good performance in the deep learning method, but the method proposed in this paper is the best among all the existing methods in terms of all the indicators.  Table 4. Quantitative assessment of the QuickBird dataset shown in Figure 9. The best performance is shown in bold. As shown in Table 4, the QuickBird experimental assessment results show that the performance of the pan-sharpening method, which is deep learning on the 4-band dataset and is significantly better than the traditional method. In terms of the experimental results of these data, HPF has achieved an overall better performance in traditional methods. Although the HPF method and GLP method are not significantly different in other indicators, the HPF method is obviously superior to the GLP method in maintaining spectral information. PanNet and PSGan have good performance in the deep learning method, but the  Table 4. Quantitative assessment of the QuickBird dataset shown in Figure 9. The best performance is shown in bold.

Experiment with GeoEye-1 Dataset
In this section, experiments were performed using a 4-band dataset from GeoEye-1, and the image size is 256 × 256. Figure 10 shows the experimental results of a set of images. Figure 10a smooth, the edge information is severely lost, and there are many artefacts. Compared with GLP and HPF methods, the overall effect is better. In the deep learning method, the PSGan method exhibits spectral distortion in local areas, and the overall effect of deep learning is better than traditional methods. The image from our proposed method is the closest to the original image. The index values shown in Table 5 objectively show the comparison of various methods.   Figure 10 shows that obvious spectral distortion occurs in the DWT, GS, IHS, and SIRF methods, and blurring or loss of edge details occurs in all seven traditional methods. The PRACS method retains good spectral information, but the spatial structure is too smooth, the edge information is severely lost, and there are many artefacts. Compared with GLP and HPF methods, the overall effect is better. In the deep learning method, the PSGan method exhibits spectral distortion in local areas, and the overall effect of deep learning is better than traditional methods. The image from our proposed method is the closest to the original image. The index values shown in Table 5 objectively show the comparison of various methods.
Combined with Table 5, the experimental evaluation indexes of GeoEye-1, QuickBird, and WorldView-3 are roughly the same, which proves the robustness of the network structure proposed by us. Through the above experimental results, the numerical values clearly support the proposed solution, thus, indicating that the proposed solution achieves a significant performance improvement on the same satellite or different satellite, 8-band or 4-band datasets. Table 5. Quantitative assessment of the GeoEye-1 dataset shown in Figure 10. The best performance is shown in bold.  Figure 11 shows the pan-sharpening results of the GeoEye-1 image size dataset under real data from unreferenced images. Figure 11a,b show the MS and PAN images, respectively. Figure 11c-l show the DWT, GLP, GS, HPF, IHS, PRACS, SIRF, PanNet, PSGan, and our fusion results of the proposed method. Table 5. Quantitative assessment of the GeoEye-1 dataset shown in Figure 10. The best performance is shown in bold. Combined with Table 5, the experimental evaluation indexes of GeoEye-1, QuickB and WorldView-3 are roughly the same, which proves the robustness of the netw structure proposed by us. Through the above experimental results, the numerical va clearly support the proposed solution, thus, indicating that the proposed solu achieves a significant performance improvement on the same satellite or different satel 8-band or 4-band datasets. Figure 11 shows the pan-sharpening results of the GeoEye-1 image size dataset un real data from unreferenced images. Figure 11a   By observing the fusion images, DWT, IHS, and SIRF all can be found to have obvious spectral distortion, and the edge information of SIRF appears fuzzy. Although the overall spatial structure information is well preserved in the GS and GLP methods, local information is lost. The merged image in the PRACS method is too smooth, resulting in severe loss of edge details. PanNet, PSGan, and our proposed method have the best overall performance, but spectral distortion appears in some regions of PSGan. Table 6 shows that the fusion method proposed by us is the most effective on the real dataset without reference images. Table 6. Quantitative assessment of the GeoEye-1 real dataset shown in Figure 11. The best performance is shown in bold.  Figure 12 shows the convergence process of L 2 loss function and the loss function proposed in this paper on the training set. Table 7 shows the objective evaluation indexes of fused images obtained by different loss functions. The network structure proposed in the paper and the PanNet network structure are used to test the convergence of the loss function. Figure 12a shows the convergence effect of the loss function proposed in this paper of MDECNN, and Figure 12b shows the convergence effect of the loss function proposed in this paper on the network structure of PanNet. The convergence positions of Figure 12a,b indicate that the new loss function training network converges faster and has a better final convergence effect than L 2 loss function. At the same time, Table 7 shows that the fused images obtained by the new loss function are more in line with expectations. Meanwhile, by comparing the convergence images in Figure 12a,b, it can be seen that the error fluctuation in Figure 12a is small, indicating that our network structure is more stable and has better convergence effect than that of PanNet. Combined with the results in Figure 12 and Table 7, the proposed solution of the comprehensive loss function is shown to be obviously superior to the general solution of spectral loss L 2 .

Method
Remote Sens. 2021, 13, 535 18 of 23 By observing the fusion images, DWT, IHS, and SIRF all can be found to have obvious spectral distortion, and the edge information of SIRF appears fuzzy. Although the overall spatial structure information is well preserved in the GS and GLP methods, local information is lost. The merged image in the PRACS method is too smooth, resulting in severe loss of edge details. PanNet, PSGan, and our proposed method have the best overall performance, but spectral distortion appears in some regions of PSGan. Table 6 shows that the fusion method proposed by us is the most effective on the real dataset without reference images. Table 6. Quantitative assessment of the GeoEye-1 real dataset shown in Figure 11. The best performance is shown in bold.  Figure 12 shows the convergence process of L2 loss function and the loss function proposed in this paper on the training set. Table 7 shows the objective evaluation indexes of fused images obtained by different loss functions. The network structure proposed in the paper and the PanNet network structure are used to test the convergence of the loss function. Figure 12a shows the convergence effect of the loss function proposed in this paper of MDECNN, and Figure 12b shows the convergence effect of the loss function proposed in this paper on the network structure of PanNet. The convergence positions of Figure 12a,b indicate that the new loss function training network converges faster and has a better final convergence effect than L2 loss function. At the same time, Table 7 shows that the fused images obtained by the new loss function are more in line with expectations. Meanwhile, by comparing the convergence images in Figure 12a,b, it can be seen that the error fluctuation in Figure 12a is small, indicating that our network structure is more stable and has better convergence effect than that of PanNet. Combined with the results in Figure 12 and Table 7, the proposed solution of the comprehensive loss function is shown to be obviously superior to the general solution of spectral loss L2.

Ablation Study
We provide ablation learning to explore the impact of each part of our model as follows: Influence of the multiscale spatial information extraction module This paper focuses on the extraction of rich spatial information and proposes the multiscale spatial information extraction module to independently extract rich spatial information from PAN images. In order to verify the effectiveness of the proposed module and the influence of different receptive field parameters on the fusion results, several convolution blocks with different receptive field sizes are cascaded to form a multiscale feature extraction module. We compare the multiscale block of different scales to test the effect of the multiscale block of different scales. Specifically, we select the best multiscale block by using convolution kernel combinations with different sensory fields, where the convolution kernel size K= {1, 3, 5, 7}. These convolution kernels of different sizes are combined in different ways to obtain the multiscale blocks required by the experiment. To avoid the "meshwork" problem caused by the use of dilated convolution, we use a convolution kernel with different sensory fields to extract feature maps. To make a fair comparison, we adjust the different multiscale blocks so that their parameter numbers are close to each other. The experimental results are shown in Table 8. The quantitative evaluation results show that the feature information obtained by using a richer receptive field is more expressive. As shown in Table 8, it is obvious that our proposed method is superior to other receptive field sets under different orders of magnitude. Therefore, to balance the performance and computing speed, we use four multiscale sensing modules with different sensing fields, namely, 1, 3, 5, and 7.
Influence of the feature enhancement module Influenced by the inception module, we propose the structure of the feature enhancement module. To validate its impact, we remove the feature enhancement module and add more modules to validate its impact. We experiment on the trunk network without the feature enhancement module and the double-branch network with two feature enhancement modules cascaded. Fusion results are obtained and compared. The quantitative results are shown in Table 9. As seen from the quantitative evaluation results, using a feature enhancement module to broaden the width of the network can enable the network to extract richer feature information and learn more mapping relationships in line with expectations. Failure to use feature enhancement modules led to insufficient learning ability of the model for multiscale features, inadequate learning of details, and decreased ability of image reconstruction. However, using too many feature enhancement modules would lead to convergence difficulty or feature explosion, increasing the computing consumption and also affecting the network convergence effect. Therefore, based on the results of the experiment, we choose to use one feature enhancement module to deal with features for our network.
Setting of encoding network parameters We also test the influence of encoding network depth. Specifically, we fix the other module parameters, and then we set the encoding network depth to L= {3, 6, 12, 14, 16} for verification. The model is trained by using the coding networks of 3, 6, 12, 14, and 16 dense-coded layers and the decoding networks of corresponding layers to obtain the corresponding fusion images. Meanwhile, objective evaluation indexes are used to observe the visual statistics on the results, and the quantitative results are shown in Figure 13.  As seen from the quantitative evaluation results, using a feature enhancement module to broaden the width of the network can enable the network to extract richer feature information and learn more mapping relationships in line with expectations. Failure to use feature enhancement modules led to insufficient learning ability of the model for multiscale features, inadequate learning of details, and decreased ability of image reconstruction. However, using too many feature enhancement modules would lead to convergence difficulty or feature explosion, increasing the computing consumption and also affecting the network convergence effect. Therefore, based on the results of the experiment, we choose to use one feature enhancement module to deal with features for our network.
Setting of encoding network parameters We also test the influence of encoding network depth. Specifically, we fix the other module parameters, and then we set the encoding network depth to L= {3, 6, 12, 14, 16} for verification. The model is trained by using the coding networks of 3, 6, 12, 14, and 16 dense-coded layers and the decoding networks of corresponding layers to obtain the corresponding fusion images. Meanwhile, objective evaluation indexes are used to observe the visual statistics on the results, and the quantitative results are shown in Figure 13. The objective evaluation index shows that increasing the depth of the dense coding network can improve the performance of the network, and the performance of the network can be significantly improved when the depth of the coding network increases. The reason lies in the increase in network depth and width, which enhances the ability of network to extract and reconstruct high latitude features. However, when the network depth is greater than 16 layers and the network is too deep and wide, the redundancy of feature extraction is increased, and the loss of features is also caused, resulting in the convergence The objective evaluation index shows that increasing the depth of the dense coding network can improve the performance of the network, and the performance of the network can be significantly improved when the depth of the coding network increases. The reason lies in the increase in network depth and width, which enhances the ability of network to extract and reconstruct high latitude features. However, when the network depth is greater than 16 layers and the network is too deep and wide, the redundancy of feature extraction is increased, and the loss of features is also caused, resulting in the convergence difficulty of the network. Figure 13 shows that the depth of the coding network affects the performance of the network. After testing, the depth of the dense coding network was finally set to 14.

Conclusions
In this paper, we propose a deep learning-based method to solve the pan-sharpening problem by combining convolutional neural network technology and domain-specific knowledge. On the basis of the existing pan-sharpening solutions, multiscale feature blocks are designed to process PAN images separately to extract richer and more complete spatial information, feature enhancement blocks and dense coding networks are used to learn more accurate mapping relationships, and comprehensive loss functions are designed to constrain image loss. Better fusion images can be obtained with full consideration of different spectral and spatial characteristics. In remote sensing images, regional spatial structure, land cover and development characteristics are diverse. Because the method proposed in this paper is more sensitive to multiscale features in theory, MDECNN can achieve better results in different types of remote sensing images in areas with different sizes of seeding sites, diverse structures in densely built areas, and different urban greening proportions. It is significant for remote sensing image fusion of complex image information. At the same time, in some remote sensing images with relatively single image features, the improvement of fusion effect of the proposed method is relatively limited, which reflects the limitations of multiscale feature image fusion. The experimental results of three kinds of satellite datasets show that the proposed method can perform better than the existing methods in the pan-sharpening of a wide range of satellite data, which proves the potential value of our network for different tasks. Next, we will take the loss function with the constraint of objective indexes as the starting point to further improve the network performance on the premise of ensuring the spectrum and space quality. Data Availability Statement: Data sharing is not applicable to this article.