Detail Information Prior Net for Remote Sensing Image Pansharpening

: Pansharpening, which fuses the panchromatic (PAN) band with multispectral (MS) bands to obtain an MS image with spatial resolution of the PAN images, has been a popular topic in remote sensing applications in recent years. Although the deep-learning-based pansharpening algorithm has achieved better performance than traditional methods, the fusion extracts insufﬁcient spatial information from a PAN image, producing low-quality pansharpened images. To address this problem, this paper proposes a novel progressive PAN-injected fusion method based on superresolution (SR). The network extracts the detail features of a PAN image by using two-stream PAN input; uses a feature fusion unit (FFU) to gradually inject low-frequency PAN features, with high-frequency PAN features added after subpixel convolution; uses a plain autoencoder to inject the extracted PAN features; and applies a structural similarity index measure (SSIM) loss to focus on the structural quality. Experiments performed on different datasets indicate that the proposed method outperforms several state-of-the-art pansharpening methods in both visual appearance and objective indexes, and the SSIM loss can help improve the pansharpened quality on the original dataset.


Introduction
The sensors onboard satellite platforms record the digital number of land surfaces in different spectral channels. The acquired images have formed the basis for mapping different land surfaces. Thus, the spectral parameters of an image, such as the number of spectral channels, channel width, and mid-bandwidth, are important for evaluating the quality of remote sensing imagery. The spatial resolution, which is the area of the land surface represented by a pixel in remote sensing imagery, is another important parameter. High-resolution remote sensing imagery can distinctly describe the distribution and structure in a land surface, which forms the basis for fine surface mapping. Therefore, obtaining imagery with high spatial and spectral resolutions will enrich the information content in imagery and enhance the capacity for identifying various land surfaces.
Due to the limitations imposed by the data volume collected by the sensor, the data transmission between a satellite and Earth, and the incoming radiation energy into sensors within surface units [1], it is exceedingly difficult to obtain imagery with high spatial and spectral resolution. To address these problems, one panchromatic (PAN) band and multiple multispectral (MS) bands can be used when installing several different spectrum monitors for a sensor. Pansharpening, which can overcome the shortcomings of sensors, increases the spectral resolution of a PAN band by integrating it with MS bands. This process can also be viewed as an enhancement of the spatial resolution of the MS bands, with the optimization objective of maintaining their spectral features while increasing their spatial resolution. To date, pansharpening has become an important technique for processing remote sensing data. Based on whether deep learning (DL) is used, pansharpening methods can be categorized into conventional methods and DL-based methods.
Component substitution (CS)-based methods are one type of conventional pansharpening method. Considering that a PAN band receives a relatively broad range of wavelengths, generally covering the wavelength range of visible light, it is strongly correlated with the luminous intensity (I) component of the intensity-hue-saturation (IHS) color space. Based on a previous assumption, the IHS replacement method [2] first transforms an MS image into an IHS space and subsequently replaces the I component with a PAN band, thereby imparting the high-resolution information carried by the PAN band to the MS image and realizing pansharpening. As the formation mechanism of the reflectance varies between two land surfaces, the IHS replacement method is prone to causing color distortions in some land surface features. The principal component analysis (PCA) method [3] converts an MS image to multiple independent components that contain the main land surface information and subsequently replaces the first principal component with a PAN band to produce a sharpened MS image. However, high computational costs and poor real-time performance pose challenges for application of the PCA method to image pansharpening. The Gram-Schmidt adaptive (GSA) method [4] and the partial replacement adaptive component substitution (PRACS) method [5] are two improved CS-based pansharpening methods.
Another commonly employed pansharpening method is the multiresolution analysis (MRA)-based method. First, a PAN image and a low-resolution MS image are decomposed into one group of low-frequency images and one group of high-frequency images; second, the images at each corresponding scale are fused by using a combing algorithm; and last, the images are fused again at the original scale to produce a sharpened image. MRA-based combining methods require pyramid processing algorithms, including Laplacian pyramid transform, wavelet transforms, and so on. Two such algorithms include the generalized Laplacian pyramid with modulation transfer function-matched filter and regression-based injection model (MTF-GLP-CBD) [6] and the à Trous wavelet transform (ATWT) [7].
Because the differences among the regions, resolutions, spectral channels, and resolution conversion relationships for the channels are not adequately described by simple linear equations, spectral distortions appear in the pansharpened images. Characterized by nonlinear activation functions and multilayer convolution operations, a DL-based method, the convolutional neural network (CNN), has been extensively applied in areas that require the establishment of complex nonlinear relationships (e.g., pansharpening) in recent years. The CNN is capable of adaptively establishing complex relationships by supervised learning.
DL-based pansharpening methods have produced good results in applications. However, available methods upscale an MS image to the size of a PAN image simply by interpolation in a preprocessing procedure and fuse interpolated MS images with PAN images (e.g., CS-based and DL-based methods). The information content in features fused using multiscale fusion methods (e.g., MRA-based method) is limited, which generates some distorted results. Frontier research [8] is also exploring when to fuse or extract multiresolution features. In view of these two problems, this study presents a new progressive PAN-injected fusion method based on SR for remote sensing imagery, referred to as detail information prior net (DIPNet). The main contributions of this study are summarized as follows: (1) We use two-stream PAN input to extract PAN features by using a convolution network.
(2) We use the feature fusion unit (FFU) to gradually inject low-frequency PAN features, and high-frequency PAN features are added after subpixel convolution to perfect an upsampled MS image. (3) We use a plain autoencoder to inject the extracted PAN features. (4) We use the structural similarity index measure (SSIM) [9] loss to guide the network during training, focusing on the structural quality.
The remainder of this paper is organized as follows: Section 2 presents related works in this study. Section 3 details the pansharpening method proposed in this study. Section 4 introduces the experimental data used in this study as well as the methods applied to evaluate the results. In Section 5, experimental results and comparisons are presented. Section 6 focuses on the discussion and evaluation of the results. Section 7 concludes the paper.

Image Upscaling and Pyramid Processing Algorithm
Interpolation-based (e.g., nearest neighbor and bilinear) image upscaling algorithms are prone to blurring images after increasing their size. This phenomenon becomes more pronounced as the upscaling factor increases, mainly due to a lack of high-frequency, detailed spatial information after image upscaling. This blurring phenomenon is similarly associated with the CNN-based SR method. However, the CNN-based SR method is capable of adaptively adjusting the SR equation based on the image content and, consequently, inhibiting blurring to a certain extent. Nevertheless, the CNN-based SR method is unable to completely eliminate blurring.
The Gaussian-Laplacian pyramid-based processing algorithm integrates high-and low-frequency information at multiple scales and has shown relatively good performance in the merging and fusion of images. Similarly, for pansharpening, detailed features can be restored by adding the high-frequency information contained in a PAN image to an SR-upscaled MS image at multiple scales. This study presents a pansharpening method referred to as DIPNet that uses high-frequency, detailed information and the fusion of low-frequency PAN information.

Deep-Learning-Based Pansharpening
Based on the selected model, DL-based pansharpening methods can be categorized into four main types, namely image-feature-based methods, autoencoders, SR methods, and GANs.
Image-feature-based methods: Image-feature-based pansharpening methods have effective network architectures designed to correspond to the features of fused images. An MSDCNN [10], which involves the use of convolution kernels of varying sizes to extract multiscale features to improve the fusion performance, was designed to take full advantage of the multiscale information contained in image features. A network for pansharpening, referred to as PanNet [11], was proposed to improve the fusion of high-resolution satellite imagery. In the PanNet architecture, high-frequency image information is employed to train a residual network (ResNet) [12] to obtain the details missing in low-resolution images. Based on the general idea of PanNet, a deep multiscale detail network (DMDNet) [13] was designed by superseding the conventional residual module with a grouped, multiscale dilated convolutional residual module. The performance of DMDNet is superior to that of PanNet in migration, fusion, and reconstruction. Moreover, in the field of image restoration, You Only Look Yourself (YOLY) [14] uses image features to design an unsupervised image dehaze model. Therefore, the design of the network, which is based on image features, can achieve improved performance.
Autoencoders: An autoencoder converts an input image to deep features through a series of nonlinear mapping operations and subsequently restores the original image by decoding. Sparse [15] and convolutional [16] autoencoders encode the PAN and MS bands into sparse matrix features and subsequently enhance the spatial resolution of the MS image with the PAN image by decoding. TFNet [17] is a pansharpening network based on a convolutional autoencoder. Through a two-stream architecture, TFNet extracts features from an MS image and a PAN image and ultimately reconstructs a high-spatial-resolution, high-spectral-resolution image using a decoder.
SR methods: SR-based pansharpening methods view pansharpening as an SR problem in MS bands under the constraints of a PAN band. Using the SRCNN [18] architecture, the PNN method [19] integrates the spatial information in a PAN band during the SR process by introducing upsampled MS and PAN information and produces results superior to those produced by conventional methods. To further enhance the spatial resolution of imagery, a deep residual PNN [20] method was introduced by improving the PNN method with the ResNet architecture. A bidirectional pyramid network [21] extracts features from a PAN image by convolution operations and produces good pansharpening results by subpixel convolutional SR fusion of MS and PAN image features at corresponding scales. The PCDRN [22] method progressively fuses images through ResNets and interpolation based on the scale relationship between MS images and PAN images. The PCDRN method has shown good fusion performance in high-resolution satellite imagery. The SR-guided progressive pansharpening based on a deep CNN (SRPPNN) [23] method upscales a low-resolution MS image by progressive SR and integrates it with a multiscale, high-frequency PAN image. This method has yielded good results in the pansharpening of remote sensing imagery.
GANs: The GAN architecture contains a generator coupled with a discriminator and achieves collaborative optimization through adversarial training. This architecture has achieved good results in areas such as image generation and style transfer. As pansharpening can be viewed as an image generation problem, deep networks based on the GAN architecture can be also employed in pansharpening. For example, the pansharpening GAN (PSGAN) [24] reconstructs high-spatial-resolution multiband images with TFNet as a generator and a conditional discriminator. Similarly, through the improvement of the PSGAN architecture, a residual encoder-decoder conditional GAN [25] was designed to further enhance the capacity to fuse remote sensing imagery. GAN-based pansharpening methods can help to describe the nonlinear mapping relationships among remote sensing images and produce relatively good results.

Framework of the Method
The core ideas of the DIPNet are described as follows: (1) A PAN band contains potential information in the MS bands. Low-frequency PAN information can reflect the main MS information. High-frequency PAN information can reflect the details in the PAN band. (2) In this study, pansharpening is viewed as a PAN band-guided SR problem. Highfrequency PAN information is added to ameliorate the SR-induced blurring problem. (3) Multiple SR processes are required to obtain an MS image with the same spatial resolution in the PAN band. In conventional methods, single-scale fusion is inordinately simple, while features fused at multiple scales have limited information content. Multiscale high-and low-frequency deep PAN and MS features can be combined to better describe the mapping relationship between PAN bands and MS bands and to achieve higher-accuracy pansharpening. (4) A multiscale auxiliary encoder with detailed PAN information and potential MS information in the PAN band is used to further reconstruct spatial information for the MS image.
For clarity, Figure 1 shows the workflow of our proposed work. Figure 2 shows the network architecture designed in this study based on the abovementioned ideas.  To facilitate the description of the problem, let h and l be the spatial resolutions of the PAN image P and the MS image M, respectively. To ensure a clear discussion, the PAN image and MS image are denoted P h and M l , respectively. Pansharpening fuses these two images into an MS image M h with a spatial resolution of h. DIPNet involves four main steps: (1) Frequency decomposition. In this step, P h is decomposed into a high-pass component P h H and a low-pass component P h L . P h H reflects the high-frequency details (e.g., boundaries) of P h . P h L reflects the complete spectral features (e.g., color features in a relatively large local area) of P h . Frequency spectrum decomposition is achieved by Gaussian filtering. First, a Gaussian filter matrix with a window size of W r is established. Second, P h is filtered, and the result is treated as P h L . The difference between P h and P h L is treated as P h H .
(2) Feature Extraction. Features are extracted from P h H and P h L using a 3 × 3 convolution operation followed by the ResNet module. Features F(P h H ) and F(P h L ), each with a spatial resolution of h and a total of K channels, are thus obtained. In addition, features are extracted from M l using a 3 × 3 convolution operation. MS image features F(M l ) with a total of K channels and a resolution of l are also obtained.
In many cases, the spatial-resolution multiples (e.g., two, four, or eight iterations) vary between a PAN image and MS image. In each convolutional downsampling operation, the output feature size is half the input feature size. The number of downsampling iterations required to downsample the PAN image to the spatial resolution in the MS bands also varies. For ease of the description of the problem, let m be the intermediate resolution. For example, when the resolution ratio of an MS image to a PAN image is 4, l is 4, m is 2, and h is 1; when the resolution ratio of an MS image to a PAN image is 6, l is 6, m is 2, and h is 1. This paper discusses a situation in which the resolution ratio of an MS image to a PAN image is 4, which is suitable for most high-resolution satellite images. F(P h H ) and F(P h L ) are downsampled by a 3 × 3 convolution operation with a step size of 2. Features are further extracted using the ResNet module. A high-pass component and a low-pass component, each with a spatial resolution of m, are thus obtained; they are denoted as F(P m H ) and F(P m L ), respectively. Similarly, a high-pass component and a low-pass component, each with a spatial resolution of l, can be obtained; they are denoted as F(P l H ) and F(P l L ), respectively. From this process, a low-frequency PAN information feature group F(P h,m,l L ) and a high-frequency PAN information feature group F(P h,m,l H ) are obtained: (3) Feature Fusion (FF). F(P l L ) and F(M l ) are fused using an FFU. Features F(M m F ) with a resolution of m are obtained by SR and subsequently added to F(P m H ) pixel by pixel. This process is repeated, and ultimately, MS features F(M h F ) with a resolution of h are obtained.
In this process, MS features are fused with low-and high-frequency PAN features at multiple scales. Thus, progressively fused MS features have more information content than features extracted from an interpolation-upscaled MS image.
(4) Image Reconstruction. An autoencoder is used to reconstruct the structure based on F(P h,m,l L ), F(P h,m,l H ), and F(M h F ) (fused features obtained by FF-based SR). A PAN image with an enhanced spatial resolution is thus obtained. In this process, multiscale PAN features are injected into the decoder to further increase the information content of the MS image.
Regarding the network activation function, a leaky rectified linear unit with a parameter of 0.2 is set as the activation function for all the convolutional layers, except for the ResNet module and the subpixel and output convolutional layers for SR.
The following section introduces an FF-based SR module and image reconstruction module into which high-and low-frequency PAN information is injected.

FF-Based SR Module (Step 3)
Prior to SR, the FFU is used to fuse F(P l L ) and F(M l ) in the following manner: where © represents an operation that connects feature images in series, ⊕ denotes an operation that adds feature images pixel by pixel, Conv 1×1 is a convolution function with a convolution kernel size of 1, and F(M l F ) is the fused MS features (the subscript F indicates fused features). The FFU produces combined features with a total of 3K channels through a serial connection operation and subsequently performs a 1 × 1 convolution operation on the combined features to produce fused features with a total of K channels. Thus, the extracted features are linearly fused by using rich per-added features.
Subsequently, the ResNet module is used to extract features from F(M l F ): where RBs represents the extraction operations by a total of L ResNet modules. The input features F(M l F ) are added for residual learning. Thus, F(M l F )-based deep features F res (M l F ) are obtained. For the extraction of PAN features in Step 2, the residual module similarly consists of a total of L ResNet modules.
Subpixel convolution [26] is an upsampling method based on conventional convolution and pixel arrangement in feature images and can be used to achieve image SR. Let r be the upscaling factor and c × h × w (c, h, and w are the number of channels, height, and width, respectively) be the size of the initial input feature image F res (M l F ). First, through convolution operations on F res (M l F ), a total of r 2 c convolution kernels is extracted, and an output feature image with a size of r 2 c × h × w (i.e., a total of h × w vectors each with a length of r 2 c) is obtained. Second, all the vectors, each with a length of r 2 c, are arranged into a c × r × r pixel matrix. Thus, a feature image with a size of c × hr × wr is obtained. As the current resolution of this image is m, it is denoted by F(M m ↑ ). In conventional image SR, due to a lack of sufficient information for predicting the postupscaling pixel values, the post-SR image lacks detailed spatial features, i.e., the post-SR image is blurry. To address this problem, high-frequency features are fused to sharpen the blurry areas. The previously obtained F(M m ↑ ) and the high-frequency information image F(P m H ) of the corresponding size extracted by convolution operations are added to restore a feature image that has become blurry after upscaling (i.e., F(M m )), as shown in the following equation: Based on these steps, the resolution of the MS image features is improved from l to m. Similarly, the image features can be improved from m to h. Ultimately, fused MS and PAN image features F(M h ) are obtained.

Image Reconstruction Module into Which High-and Low-Frequency PAN Information Is Injected (Step 4)
A convolutional autoencoder can effectively encode an image to produce high-dimensional coded information and decode deep information by reversing the encoding process to reconstruct the input image. Thus, an autoencoder is employed to reconstruct the image based on F(M h ): where E represents a three-layer convolutional encoding operation (the first layer is a convolutional operation with a step size of 1 performed to produce coded features with a total of K channels and a resolution of h; the last two layers are convolutional operations with a step size of 2 performed to produce coded features with a total of 2K channels and a resolution of m and coded features with a total of 4K channels and a resolution of l), and F(e h ), F(e m ), and F(e l ) are coded features with scales of h, m, and l, respectively. Conventional convolution operations produce feature images with specific sizes based on the convolution kernel size, weight, and step size of the sliding window. Generally, convolution operations reduce the feature size. To preserve the feature size, it is possible to fill numerical values at the boundaries of the feature image.
The encoder applied in this study encodes each feature image by taking advantage of the properties of convolution to recover multiscale feature information and thus facilitate the injection of multiscale PAN features.
To utilize important multiscale PAN information, the high-frequency PAN information feature group F(P h,m,l H ) and low-frequency PAN information feature group F(P h,m,l L ) are injected into the features that require decoding through the decoder architecture as follows: where © represents an operation that connects feature images in a series based on the number of channels, and DeConv represents a deconvolution operation, which is the reverse process of convolution and can upscale and output feature images with specific numbers of channels. Both DeConv 1 and DeConv 2 are 2 × 2 deconvolution operations with a step size of 2; Conv 3×3 represents a conventional 3 × 3 convolution operation; and F(d m ), F(d h ), and F(d) are fused high-frequency features, fused low-frequency features, and fused coded features, respectively, with resolutions of l, m, and h and 3K, 3K, and K channels, respectively. The decoder used in this study upscales and decodes coded features by taking advantage of the properties of deconvolution.

Loss Function
Based on the abovementioned architecture, the whole pansharpening network architecture can be represented by the following equation: where θ is a network parameter, f E represents the extraction of multiscale features from the high-and low-pass components of the PAN image, f SR represents FF-based SR, and f AE is a function of the autoencoder structure . The SSIM function can quantitatively reflect the differences in brightness, contrast, and structure between two images. This function can make the network focus on the structural information of the image rather than the distance between the result and ground truth (e.g., MAE and MSE). The SSIM loss function is used to train the model in this study in the manner shown by the following equation: where M l i , P h i , M h i represent the ith training sample. As it is impossible to obtain true high-resolution MS images, the training data are preprocessed according to Wald's protocol [27]. Specifically, the downsampled MS and PAN images are input into the network model; the original MS image is treated as the true-value image; and Equation (13) is applied to calculate and update the network.

Datasets
Three datasets produced by different satellites were selected for evaluating DIPNet and the comparison methods. The following subsection details information (i.e., sensors, wavelength, spatial resolution, and number of bands) about the datasets.

QuickBird Dataset
This dataset contains imagery for six regions in different geographic locations, which is from [23]. The surface cover types in these regions include forests, farmlands, buildings, and rivers. The MS images contain the visible-light band (RGB channels) and the nearinfrared (NIR) band. The PAN images cover the RGB and NIR bands of the MS images. The spatial resolution (0.7 m) of the PAN images is four times that (2.8 m) of the MS images.

Dataset Preprocessing
The images have an 11-bit radiometric resolution, ranging from 0 to 2047. In this study, the images were not subjected to any relevant radiation corrections. The abovementioned images differ in size. To facilitate testing and training, the MS images and PAN images for the corresponding areas were cropped to 256 × 256 image blocks and 1024 × 1024 image blocks, respectively, which were then randomly divided into a training set and testing set. Table 1 summarizes the number of image blocks obtained. In this study, labels were prepared according to Wald's protocol for model training. The procedure is detailed as follows: first, the MS image and PAN image were downsampled fourfold based on the MTF low-pass filter of the corresponding sensor to a 64 × 64 image M ↓4 and a 256 × 256 image P ↓4 , respectively. Eventually, a simulated image pair (M ↓4 , P ↓4 , M) was obtained to allow for the use of the original MS image as a supervision objective for training. For the training set, each original MS-PAN image pair was similarly downsampled to obtain a simulated image pair (M ↓4 , P ↓4 , M). The results were evaluated.

Experimental and Comparison Methods
During the training process, an Adam optimizer with an initial learning rate of 0.0001, a weight decay parameter of 10 −8 , and other parameters set to their respective default values was employed to train 1000 epochs to compare DIPNet with other methods. The training parameters are detailed as follows: the training batch size was set to 16. Prior to training, all the initial weights of the neural network were initialized using a normal distribution with a mean of 0 and a variance of 0.02. All the other parameters were set to their respective default values. During the training process, several data augmentation techniques, including random horizontal flipping, random vertical flipping, random rotation by 90°, and random cropping, were used. In the random cropping process, each simulated image pair (M ↓4 , P ↓4 , M) was cropped to a 32 × 32 M ↓4 , a 128 × 128 P ↓4 , and a 128 × 128 M.
With respect to the parameters of the experimental method, the size W r and variance of the Gaussian filter kernel were set to 11 and 1, respectively, and the number of convolution kernels K and number of residual blocks L were set to 64 and 2, respectively. To prevent randomness from affecting the experimental results, the same seed was set for deterministic calculations to ensure that the experimental results were reproducible.
Four conventional methods (GSA, PRACS, ATWT, and MTF-GLP-CBD) and five DL methods (PNN, MSDCNN, PanNet, TFNet, and SRPPNN) were selected for comparison in this study. The MATLAB code for pansharpening provided by Vivone et al. [29] was used for the conventional pansharpening methods and comparison calculations. The experiment was conducted on a computer with an AMD Ryzen 5 3600 3.6 GHz processor, 32 GB of memory, and an NVIDIA RTX 2070 Super graphics card. The coding environment involved Windows 10 (64 bit), MATLAB (R2013a), Python 3.7.4, and PyTorch 1.6.0.

Quantitative Evaluation Indices
Several quantitative indices, including the relative dimensionless global error in synthesis (ERGAS) [30], spectral angle mapper (SAM) [31], universal image quality index (UIQI) [32] and its extended index Q2 n [33], spatial correlation coefficient (SCC) [34], and quality without reference (QNR) [35], were employed in the experiment. According to the types of indicators, we divided them into three parts to provide a detailed description.
(1) Indices for spectrum: The ERGAS and SAM primarily reflect the spectral distortions in an enhanced image compared to a reference image. Lower values of ERGAS and SAM indicate that the spectral distribution of an enhanced image is similar to that of the reference image. The details are provided as follows: EDRAS(x, y) = 100 h l SAM(x, y) = arccos( x · y x · y ) (16) where x and y are the pansharpened image and ground truth, respectively; m is the number of the pixels in the images; h and l are the spatial resolution of the PAN image and MS image, respectively; and MEAN(y i ) is the mean of the ith band of the ground truth which has a total of N bands.
(2) Indices for structure: UIQI and Q2 n represent the quality of each band and the quality of all the bands. High values of the UIQI and Q2 n suggest that the quality of the resultant and reference images is similar. Their equations are expressed as follows: For UIQI, µ x and µ y are the means of x and y, respectively; σ x and σ y are the variances of x and y, respectively; and σ xy denotes the covariance between x and y. Generally, the index is calculated by a kernel.
For Q2 n , X and Y are the hypercomplex numbers of x and y, respectively; µ X and µ Y are the means of X and Y, respectively; σ X and σ Y are the variances of X and Y, respectively; and σ XY denotes the covariance between X and Y.
The SCC is a spatial evaluation index that primarily reflects the difference in highfrequency details between two images, and a value of SCC near 1 indicates a good spatial resolution of the resultant image, as follows: where Filter is a high frequency kernel, which is used to process images; µ Filter(x) and µ Filter(y) are the means of Filter(x) and Filter(y), respectively; and w and h are the weight and height, respectively, of an image.
(3) Indices for no reference: The QNR mainly reflects the fusion performance in the absence of true reference values, which consists of D s and D λ . An index of D s near 0 represents good performance of a structure; an index of D λ near 0 shows good fusion in a spectrum; and a value of QNR near 1 indicates a good original pansharpening performance.
where p and q denote positive integer exponents; M and P are the MS image and PAN image, respectively; i and j are the weighted parameters to quantify the spectral distortion and spatial distortions, respectively; and C is the number of the bands in an MS image. In our experiment, p, q, i, and j are set to 1.

Experimental Results
This section presents a visual comparison of DIPNet and the comparison methods. To facilitate visualization, the RGB portion of each image was cropped and extended at 2% to an 8-bit color image. To clearly visually compare the reconstructed images, the absolute difference between the true-value and fused images was increased by factors of 10, 4, and 4 for the QuickBird dataset, WorldView 2 dataset, and IKONOS dataset, respectively. Figure 3 shows the performance of each method on the QuickBird dataset. With respect to the original data, as shown in Figure 3I,IV, DIPNet performed the best in preserving both spectral information and structural information, whereas PanNet produced bright color spots at the edges of the buildings, and TFNet distorted the spectral information.
With respect to the simulated data, as shown in Figure 3II,V, the result produced by DIPNet was the closest to the true-value image, while the four DL-based methods, namely PNN, MSDCNN, TFNet, and SRPPNN, also performed considerably well. However, as shown by the residual images in Figure 3III,VI, the performance of PanNet was inferior to that of the other DL methods in data reconstruction on the QuickBird dataset.  Figure 4 shows the performance of each method on the WorldView 2 dataset. With respect to the original data, as shown in Figure 4I,IV, DIPNet notably outperformed the other methods in fusion and reconstruction (evidenced, for example, by the structural edges of the trees and the swimming pool to the right of the building within the red box). With respect to the simulated data, as shown in Figure 4II,V, the results produced by DIPNet were the closest to the true-value image. In addition, DIPNet far outperformed the other methods in representing the edge information for the swimming pool within the red box. These findings, with the residual images in Figure 4III,VI, show that DIPNet outperformed the other methods in the reconstruction of the structural details on the WorldView 2 dataset.  Figure 5 shows the performance of each method on the IKONOS dataset. With respect to the original data, as shown in Figure 5I,IV, the edges of the buildings within the red box in the image produced by DIPNet were the smoothest and consistent with those in the original PAN image, whereas the edges of the buildings within the red box in the image produced by each of the other DL-based methods were distorted. While the buildings in the images produced by the conventional methods were structurally distinguishable, their colors differed from those in the true-value image. The simulated data in Figure 5II,V and the residual images in Figure 5III,VI show that DIPNet outperformed the other methods in the reconstruction of structural details on the IKONOS dataset.

Comparison of the Quantitative Indices
This section presents a numeric assessment of DIPNet and the comparison methods. To facilitate a numeric comparison, the best performance, second-best performance, and third-best performance in Tables 2-4 are shown in red, green, and blue, respectively. Table 2 summarizes a comparison of the values of the indices for the methods on 200 images from the QuickBird dataset. As demonstrated by the values of the first five indices, DIPNet outperformed the other methods in the preservation of the spectral and structural information. Regarding the last three indices, the D λ value for PRACS was the lowest, indicating that the PRACS method outperformed the other methods in terms of preserving the spectral information on the original scale. The D s value for DIPNet was the lowest, suggesting that the DIPNet far outperformed the other methods in preservation of the structural information and took full advantage of the multiscale high-and low-frequency feature information contained in the PAN image. When evaluated by the total index QNR, DIPNet was second only to SRPPNN. This finding is attributed to the notion that DIPNet does not adequately consider spectral information and that the D λ value of DIPNet is higher than that of SRPPNN. Table 3 summarizes a comparison of the values of the indices for the methods on 150 images from the WorldView 2 dataset. As demonstrated by the first five indices, DIPNet outperformed the other methods in preservation of the spectral and structural information. Regarding the last three indices, PRACS similarly performed the best in preservation of spectral information, followed by DIPNet, suggesting good performance of DIPNet in preservation of the original MS information on the WorldView 2 dataset. The D s value for PanNet was the lowest, indicating good performance in the fusion of data acquired by the high-resolution satellites of the WorldView series. When evaluated by the total index QNR, DIPNet outperformed PanNet because DIPNet optimizes the nonlinear relationships between low-frequency PAN and MS information and high-frequency PAN and MS information at multiple scales. Table 4 summarizes a comparison of the values of the indices for the methods on 50 images from the IKONOS dataset. As demonstrated by the first five indices, DIPNet outperformed the other methods in preservation of the spectral and structural information. With respect to the last three indices, the D λ value for DIPNet was the lowest, suggesting that DIPNet performed the best regarding preservation of the MS information in a lowresolution satellite. The D s value for DIPNet was also the lowest. Thus, DIPNet performed the best in preservation of the structural information and total index QNR.

Ablation Experiment
In DIPNet, the high-and low-frequency PAN feature groups and MS features are progressively fused through means such as an SR module, an FFU module, and feature addition. An ablation experiment was conducted to examine the effectiveness of DIPNet. The SR of the MS features, fusion with low-frequency PAN information, fusion with high-frequency PAN information, low-frequency PAN information autoencoder, and highfrequency PAN information autoencoder are denoted as MSR, PL, PH, AEL, and AEH, respectively. The network parameters for the ablation experiment were set to L = 2 and K = 64. To facilitate numeric comparison, the best performance, second-best performance, and third-best performance in Tables 5-7 are shown in red, green, and blue, respectively.  To verify the performance improvement resulting from the integration of the high-and low-frequency PAN feature groups with MSR, the architecture of DIPNet was split into the following: MSR; FFU-based SR of the MS and low-frequency PAN information (MSR+PL); MSR combined with high-frequency PAN information (MSR+PH); FFU-based SR of the MS and low-frequency PAN information combined with high-frequency PAN information (MSR+PL+PH); FFU-based SR of the MS and low-frequency PAN information combined with high-frequency PAN information and low-frequency PAN information autoencoder (MSR+PL+PH+AEL); and FFU-based SR of the MS and low-frequency PAN information combined with high-frequency PAN information and high-frequency PAN information autoencoder (MSR+PL+PH+AEH). Rows 1, 2, 3, 4, 5, 6, and 8 in each of Tables 5-7 show the results produced by these six components and the complete DIPNet under the same training conditions. As demonstrated in these three tables, MSR alone could not produce relatively good quantitative results due to a lack of PAN information. Adding PH to MSR slightly improved the indices due to the addition of some high-frequency information after upsampling of the network. Adding PL to MSR improved the indices to a far greater extent than adding PH, due to the inclusion of MS-band information in the PAN image. The improvement in the indices from integrating a combination of PL and PH with MSR differed insignificantly from that from integrating PL alone with MSR. Introducing the features into the autoencoder for reconstruction further improved the indices compared to those with the integration of MSR and PL, suggesting that an autoencoder with multiscale high-and low-frequency PAN features can improve the robustness of the network. On the IKONOS dataset, however, using the low-frequency PAN information autoencoder can increase the number of reduced indices.

Function of the FFU
Equation (2) details the fusion method for the FFU. We believe that a simple feature addition damages the detailed outline features at the edges that can be potentially extracted.
A comparison of rows 7 and 8 in Tables 5-7 under the same conditions shows that the proposed FFU significantly improved the fusion performance on the IKONOS dataset. While the FFU did not improve the fusion performance on the images acquired by the other satellites, it did not have a significant impact. Figure 6 visualizes the effects of the FFU on the network. A comparison of Figure 6a,c with Figure 6d,f reveals that the fused feature image produced by the network with the FFU showed no distortion compared to the extracted MS features, that the network with the FFU exhibited good stability, and that the fused feature image produced by simply adding the feature images pixel by pixel was overexposed, affecting network learning.

Loss Function
Equations (12) and (13) were used to optimize and fit DIPNet. A comparison of rows 8, 9, and 10 in each of Tables 5-7 shows that the D s value for DIPNet was far smaller than those for the methods that use the MSE and MAE losses in image reconstruction on the QuickBird, WorldView 2, and IKONOS datasets and that the SSIM loss enhanced the image fusion performance at the original size. A comparison of the first five indices in rows 9 and 10 in Table 6 reveals that the use of the MAE loss was superior to that of the SSIM loss on the simulated WorldView 2 data. However, a comparison of the last three indices indicates that the use of the SSIM loss led to better fusion performance on the original data. Thus, based on the evaluation of the simulated and original data and by taking into account the ultimate fusion application needs and performance, the SSIM loss-based DIPNet was selected as the ultimate method proposed in this study. The effects of K on the fusion performance were investigated. While the architecture of the network was maintained and the SSIM loss function was used in each case under the same experimental conditions, K was set to 16, 32, 48, and 64 and L was set to 2 on all three datasets. In addition, the effects of L on the fusion performance were examined. While the architecture of the network was maintained and the SSIM loss function was used in each case under the same experimental conditions, L was set to 0, 1, 2, and 3 and K was set to 64 on all three datasets. To facilitate a numeric comparison, the best, second-best, and third-best performance in Table 8 are shown in red, green, and blue, respectively. Table 8 summarizes the quantitative results. Clearly, increasing K could effectively improve the fusion performance. However, if L was too high or too low, the performance deteriorated. Thus, by comprehensively considering the experimental results, computational expenditure, and performance on different datasets, a K of 64 and an L of 2 were selected as the parameter settings for the proposed method. The number of parameters for our proposed work is compared with other prior networks according to Table 9. As shown in Tables 2-4 and 8, the setting of K of 32 and L of 2 has fewer parameters than prior networks but achieves the same performance.  Due to the limitation of our computing resources, in the prediction and pansharpening stage, we divide the high-resolution image into small pieces of a certain size for pansharpening and then combine them into the whole image.
In the original resolution evaluation experiment for QuickBird, we record the average running time of the different DL methods. The corresponding results are summarized in Table 10. As shown in Tables 2-4 and 8, the setting of K of 32 and L of 2 has been as fast as prior networks. Although the running time of our proposed method, which has deeper features, is slower than that of other DL methods because a larger number of parameters reduces the efficiency of the network, our method outperforms other methods.

Conclusions
This study presents a new DL-based pansharpening method referred to as DIPNet. DIPNet addresses two difficult problems, namely the need for upsampling and serial fusion at a single scale and limited information content in multiscale fused features. In regard to preprocessing, different from other methods that fuse upsampled MS images, DIPNet separates the frequency information contained in a PAN image and then obtains the corresponding features by convolution operations as prior information. To achieve improved fusion performance, DIPNet uses an SR module to fuse the prior PAN information and the MS features and learns nonlinear mapping relationships through the conventional encoder-decoder architecture to produce an enhanced remote sensing image. To enable the network to focus on the structural quality, the SSIM loss function is applied instead of the conventional MSE loss function to train the network to facilitate the high-quality fusion of remote sensing images. The experimental results demonstrate the superiority of DIPNet to the other methods.
Although we have achieved gratifying results, the method of frequency decomposition, in which we simply use a Gaussian filter, is common. We did not discuss the impact of other backbones (in this paper, we use ResNet) for extracting PAN features on network performance and efficiency or even design a better module for pansharpening. In the near future, we will focus on a novel way to pre-extract PAN image features and the design of a novel panchromatic image feature extraction network. For reconstruction, we will develop a new method to reconstruct an image to further improve the quality. As an application, we will also apply this method to other low-resolution satellites, such as Landsat 8 and Sentinel 2.