From Regression Based on Dynamic Filter Network to Pansharpening by Pixel-Dependent Spatial-Detail Injection

: Compared with hardware upgrading, pansharpening is a low-cost way to acquire high-quality images, which usually combines multispectral images (MS) in low spatial resolution with panchromatic images (PAN) in high spatial resolution. This paper proposes a pixel-dependent spatial-detail injection network (PDSDNet). Based on a dynamic ﬁlter network, PDSDNet constructs nonlinear mapping of the simulated panchromatic band from low-resolution multispectral bands through ﬁltering convolution regression. PDSDNet reduces the possibility of spectral distortion and enriches spatial details by improving the similarity between the simulated panchromatic band and the real panchromatic band. Moreover, PDSDNet assumes that if an ideal multispectral image that has the same resolution with the panchromatic image exists, each band of it should have the same spatial details as in the panchromatic image. Thus, the details we ﬁll into each multispectral band are the same and they can be extracted effectively in one pass. Experimental results demonstrate that PDSDNet can generate high-quality fusion images with multispectral images and panchromatic images. Compared with BDSD, MTF-GLP-HPM-PP, and PanNet, which are widely applied on IKONOS, QuickBird, and WorldView-3 datasets, pansharpened images of the proposed method have rich spatial details and present superior visual effects without noticeable spectral and spatial distortion.


Introduction
Remote sensing sensors generate images by capturing information of electromagnetic waves reflected off the Earth's surface. However, it is arduous to obtain images with both high spatial resolution and high spectral resolution simultaneously. The energy received by the sensor is double the integral of the electromagnetic wave in space and wavelength. Generating images with higher spatial and spectral resolution means that the energy is integrated at shorter wavelengths and in smaller areas. Consequently, the energy is weaker, resulting in poorer image qualities. Only one can be enhanced at a time. Thus, it is challenging to acquire high-quality images with high spectral and spatial resolution, limited by the equipment on remote sensing platforms. Compared with hardware upgrading, pansharpening is a low-cost way to sufficiently utilize data to obtain high spectral and spatial resolution images. Pansharpening combines multispectral images (MS) with low spatial resolution and panchromatic images (PAN) with high spatial resolution.
A promising pansharpening method should produce results that meet following requirements: • Spectral fidelity: The spatial information of fusion result should be as close as possible to the spatial information of original MS. Chromatic aberration and spectral distortion should be avoided. • Exact spatial details: The spatial details of fusion result should be as close as possible to details of original PAN. Blur, lack, and distortion of details should be avoided.
• The CS-based method is a class of methods that decompose MS into spectral information and structural information, then substitute the structural information with PAN, such as intensity-hue-saturation transform (IHS) [1][2][3][4], brovey transform (BT) [5], Gram-Schmidt transform (GS) [6,7], principal component analysis (PCA) [8,9], banddependent spatial-detail (BDSD) [10], partial replacement adaptive component substitution (PRACS) [11], etc. Higher correlation between the PAN and the component being replaced will reduce the distortion of the fused image. • The MRA-based method is a class of methods that adopt a multi-resolution decomposition on the PAN for low-frequency information, and then inject the details from the differences between them into MS. The way of decomposition can be based on wavelets [12], for instance, undecimated wavelet transform (UDWT) [13], decimated wavelet transform (DWT) [14,15], "à trous" wavelet transform (ATWT) [16][17][18], or not, such as Laplacian pyramid (LP) [19]. The key is to find a filter to acquire the low frequency component and the most common being the modulation transfer function (MTF) [20][21][22]. • The deep-learning-based method [23] is a rapidly developing pansharpening method in recent years. Deep-learning-based methods commonly develop on the structure of super-resolution methods [24], such as PNN [25], DRPNN [26], and MSDCNN [27]. Some methods combine component substitution and nonlinear mapping, for example, PanNet [28], Target-PNN [29], cross-scale learning model based on Target-PNN [30], RSIFNN [31], etc. These methods do not just regard the output of deep convolution network as fusion result, but apply the deep network to learn the details MS lacked, then attach the details to the upsampled MS to generate fusion image. In addition, deep-learning-based methods have another branch based on generative adversarial network (GAN) [32] that combines the theory of reinforcement learning (RL), for instance, PSGAN [33], RED-cGAN [34], Pan-GAN [35], PanColorGAN [36], etc. With a two-stream structure model, PSGAN [33] based on TFNet [37] accomplishes fusion in feature domain. PanColorGAN [36] based on CS, regarding pansharpening as a guided colorization task rather than a super-resolution task.
From the process of fulfilling pansharpening, the CS-based and MRA-based methods are dedicated to solving the following sub-problems, i.e., details extraction and details injection.
Regardless of details injection, as far as details extraction is concerned, both CS-based and MRA-based methods assume that the details come from the differences between the high spatial resolution PAN and the low spatial resolution PAN. The difference lies only in how the low spatial resolution PAN is acquired. In most CS-based methods, the low spatial resolution PAN is assumed to be a linear combination of low spatial resolution multispectral bands. In contrast, in MRA-based methods, the low spatial resolution PAN is assumed to be the low-frequency version of the high spatial resolution PAN.
The DL-based methods directly construct a convolution neural network (CNN) model to represent the relationship between lower spatial resolution multispectral bands and higher spatial resolution PAN by training down-sampled data, then directly apply the CNN model to the higher spatial resolution PAN to obtain sharpened multispectral bands. The relationship constructed by the CNN model is non-linear.
In this article, we propose a novel CS-based pansharpening method that employs an adaptive filter model as the non-linear combination mapping between the low spatial resolution PAN and the low spatial resolution multispectral bands. Further, we extract the details in the same way as CS-based method by the difference between the high spatial resolution PAN and the low spatial resolution PAN, then inject details back to the low spatial resolution multispectral bands. The adaptive filter model is a pixel-dependent spatial-detail injection model. Our method combines multispectral bands to obtain a low-resolution PAN through pixel-dependent local band adaptive filter convolution. The adaptive filters of multispectral bands are generated based on a "dynamic filter network" (DFN) [38]. The DFN adopts an encoder-decoder structure to learn the location-dependent kernel and applies a separate subnet to predict the convolution filter weight at each pixel. The network learns in a supervised way and has high flexibility due to its self-adaptability.
The proposed method presents superior visual effects due to the following aspects: (1) Based on the dynamic filter network, the nonlinear mapping between the panchromatic band and the low-resolution multispectral bands through filter convolution regression is constructed. Compared with other CS-based methods, the proposed method is more reasonable. Figure 1 shows spectral response functions of panchromatic and multispectral imagery for QuickBird sensors. Spectral response functions are similar to a normal distribution of a single peak function. Obviously, it is challenging to generalize the radiance between PAN and multispectral bands by linear combination model. Traditional CS-based methods are not completely accurate. (2) Assuming that if an ideal MS that has the same resolution with PAN exists, each band of it should have the same spatial details as in PAN, spatial details are acquired in the same way as CS-based method from the differences between the high spatial resolution PAN and the low spatial resolution PAN, then inject the details back to the low spatial resolution multispectral bands; therefore, the pansharpened images have rich spatial details. Compared with PNN-based methods, the proposed method is more explainable.
Different from the general DL-based fusion method, no extra work is required to make truth values at a small scale. In most image fusion methods, training datasets and ground truth need to be made artificially at the downscale level to learn the mapping between MS, PAN, and fusion images in reduced resolution then apply it on a larger scale.
We have introduced an overview of traditional fusion methods and methods based on deep learning. The rest of this paper is organized as follows: In Section 2, firstly, we focus on the development of the injection model and propose the pixel-dependent spatial-detail injection model (PDSDNet), then we introduce DFN and describe the adaptations that were made for the application of DFN to remote sensing. Next, we describe datasets and the process of our experiments in detail, and the results are shown in Section 3. The last Section 4 draws our conclusion.

Pixel-Dependent Spatial-Detail Network and Dynamic Filter Network
Let MS represent original multispectral image (MS) and MS b denote the b-th band of MS. B is the total number of MS bands, and b is from 1 to B. For example, MS 1 represents the first band of original MS. MS b is the upsampled MS b . P is the original panchromatic image (PAN) and P LP is the low-resolution PAN. The size of MS b is the same as original PAN. F is the fused image, F b represents the b-th band of the fused image. L 1 and L 2 is the width and height of the image of PAN. The resolution ratio of PAN and MS is R. For IKONOS, QuickBird, and WorldView-3 images, the value of the scale ratio R is four.
For CS-based spatial-detail model, we recall that Andrea Garzelli presented two linear injection models [10]. The first model is the single spatial-detail (SSD) image model, and the model extracts a spatial-detail image from the PAN band by subtracting low-pass version PAN, which can be obtained with convolution by an MTF-shaped filter with approximately 1/R cutoff frequency [20]. Alternatively, the low-pass version can be obtained by the linear regression of overlapped multispectral bands; the second model is the band-dependent spatial-detail (BDSD) model, which adopts different detail images extracted from the PAN band to pansharpen MS depending on the particular MS band.
We present the third model, a pixel-dependent spatial-detail network (PDSDNet) model, which extracts the details from the PAN and low-pass version PAN. The lowresolution PAN is generated by filters particularly depending on the pixel of MS.

SSD Model & BDSD Model
In the SSD model, such as IHS transform as the representative of component substitution, the low-resolution panchromatic image is substantially regarded as a linear combination of multiple bands from the multispectral image. Thus, approximate PAN with weighted parameters is described as Equation (1).
where g b is a gain parameter of the b-th band that controls the injection of the extracted details. P LP is the simulated panchromatic image, P − P LP is the spatial details MS lacked. MS b is the b-th band of the upsampled version of the low spatial resolution MS. In the component substitution method, P LP is the intensity component of MS which is the weighted combination of MS bands, as Equation (2). In the SSD model, the detail image is the same for all MS bands.
w b is the weighting coefficient of the b-th MS band, B is the number of bands. Many CS-based pansharpening algorithms rely upon Equation (1), just changing the ways to estimate the injection coefficient g b and the weight w b . In IHS transform, w b = 1/B. In the BDSD model, Equation (1) could be further rewritten in the following Equation (3): here detail image extracted from PAN is calculated for each MS band by evaluating a band-dependent generalized intensity from the B MS bands.

PDSDNet Model
We propose a non-linear spatial-detail model in which the detail image is extracted for each pixel from PAN. The detail image is calculated by evaluating a pixel-dependent generalized intensity from the B MS bands, as Equation (4). Only the MS bands overlapped with the PAN band participate in calculating.
where (x, y) denotes the position of the pixel in image, * means the convolution operator, DF b,(x,y) denotes the adaptive convolution kernel or filter depending on MS band b and pixel position, which is obtained by a dynamic filter network (DFN). For each band, convolution will be achieved through a sliding window whose size is the same as the convolution kernel. According to the model of CS-based methods, the pansharpened image F b is equal to the sum of the upsampled MS band MS b and the injected details P − P LP,(x,y) . Here we ignore the injection coefficients considering that the detail image is pixel-dependent spatial-detail by adaptive filter network; therefore, the PDSDNet can be summarized as Equation (5): In low-resolution PAN P LP,(x,y) , the parameters of adaptive filters DF b,(x,y) can be learned from a large-scale training dataset by approximating P LP,(x,y) to P. Accordingly, the loss function of network for simulating PAN is designed to measure the similarity of ground truth P and the result of the network P LP,(x,y) . The loss function is described as Equation (6): where N represents the number of training examples, · F is the Frobenius norm, and P {k} is the k-th example PAN extracted from the ground truth image. Minimize Loss to train the network for simulating PAN and the pansharpened image can be calculated by Equation (5).

Dynamic Filter Network (DFN)
The adaptive filters {DF b,(x,y) } are obtained locally and dynamically depending on the input images by the dynamic filter generation network [38]. In DFN, parameters consist of model parameters and dynamically generated parameters. Model parameters, that is the layer parameters, are initialized in advance and only updated during training. When the train is finished, model parameters are fixed and are the same for all test samples. Dynamically generated parameters do not need to be initialized and are sample-specifically generated on the fly. Dynamically generated parameters denote dynamic filters in our method. The filter generating network dynamically outputs generated parameters, while its own parameters are part of the model parameters. Dynamic filters of our method are implemented by generating two convolution kernels per pixel instead of sharing a convolution kernel across the full image, thus enhancing the adaptability of the network.
The filter generation network is shown in Figure 2. The encoder-decoder block is used as the main component, which includes four units: convolution, pooling, upsampling, and subnetwork. The convolution unit is composed of three convolution layers and activation layers alternately. The pooling unit is an average pooling layer. The upsampling unit consists of upsampling layer, convolution layers and activation layer. The way of upsampling is bilinear interpolation. The convolution unit and the upsampling unit constitute the subnetwork unit. The network inputs B bands MS and outputs B groups of filters for simulating PAN. MS is convolved with these filters to obtain the simulated PAN. For the inspiration of [39][40][41], our adaptive filters DF b,(x,y) have two separable singledimensional convolution kernels for each band b and each pixel: vertical kernel and horizontal kernel. DF b,(x,y) is k × k patch in the center of x, y if the size of vertical kernel and horizontal kernel is k.

Implementation Details
We chose three datasets from different satellites for experiments in Table 1. Before the experiment, the data need to be preprocessed. Firstly, 256 × 256 MS is upsampled to the same 1024 × 1024 size as PAN. Secondly, MS and PAN are normalized with Z-Score normalization (zero-mean normalization), which makes the mean of images become 0 and the standard deviation 1. Thirdly, the data are randomly divided into two parts, 90% for the training stage and 10% for the testing stage. Further, the data for the training stage are cropped into 128 × 128 patches in two ways. One is sequentially cropping in step size 64, and the other is randomly cropping, which randomly selects 100 points on the graph as the center point. So 325 patches in size of 128 × 128 are produced from each 1024 × 1024 image. The result is 20% of patches for validation and 80% for training in the training process.
The process is illustrated as Figure 3. The experiments were implemented by PyTorch, a common deep learning framework. The input of the network is the upsampled and normalized MS, which are four bands, and the output is the simulated PAN, the reference of which is the normalized PAN. We set the batch as 20 and kernel size 5. The convolution kernel of the convolution layer is 3 × 3 with stride 1 and padding 1. The convolution kernel of the average pooling layer is 2 × 2 with stride 2 in the encoding process. Correspondingly, the scale factor of the upsampling layer is 2 in the decoding process. The optimizer is an Adam optimizer with an initial learning rate of 0.001. The loss function is mean-square error (MSE) as Equation (6).

Experiments and Results
To assess the performance of the proposed methods, we have implemented multiple experiments with real-world multi-resolution images, exploring a wide range of situations. Consider the typical case when training and test data are acquired with the same sensor but come from different scenes, three state-of-the-art algorithms are employed for comparison, which are BDSD [10], MTF-GLP-HPM-PP [22], and PanNet [28]. BDSD is one of the usual methods of CS [42], which is an accurate linear injection model in the minimum mean-square-error (MMSE) sense. BDSD extracts details by evaluating a band-dependent generalized intensity from the MS bands. CS-based methods are inclined to generate spectral distortion but usually have no obvious spatial distortion. MTF-GLP-HPM-PP is one of the effective methods of MRA [42], which is based on a generalized Laplacian pyramid (GLP) [43] with modulation transfer function (MTF)-matched filter [20], multiplicative injection model [44] and post-processing (MTF-GLP-HPM-PP) [22]. MRA-based methods are prone to spatial distortion but generally have little spectral distortion. PanNet is a fusion method based on deep learning, which trains network parameters in the high-pass filtering domain rather than the image domain. PanNet learns the details from high-frequency information of MS and PAN, then adds upsampled MS to the network output. These methods have a common pre-processing step, applying MTF to process the image. This step is not performed in PDSDNet.

Datasets for Experiments
The data of our experiments were selected from a special large-scale publicly available benchmark dataset for pansharpening from [42]. HR PAN images are 1024 × 1024 and LR MS images are 256 × 256. In addition, the geometrical registration of datasets has been performed in [42]. These data contain a variety of features such as urban, green vegetation, water scenario, and mixed features. Considering satellites, spatial resolution and number of bands, we chose datasets of three satellites from [42]. The datasets consist of 200 IKONOS, 500 QuickBird, and 160 WorldView-3 image patches in different spatial resolutions. As Table 1 shows, the multispectral image (MS) of IKONOS and QuickBird have four bands while WorldView-3 has eight. Table 2 shows the wavelength of bands of different satellite sensors. "Pan" corresponds to the band range of panchromatic image, and from "Coastal" to "NIR2" is the first to eighth band corresponding to the multispectral image of WorldView-3. "Blue", "Green", "Red", and "NIR" are the first to the fourth band of the multispectral image of IKONOS and QuickBird. The four bands of IKONOS's MS overlap with PAN, and they were all put into the network for simulating PAN, so is QucikBird. It is important to note that our method utilizes only six bands of WorldView-3 from Blue to NIR for simulating PAN because the range of Coastal and NIR2 have no overlap with PAN; however, in other cases, the data of WorldView-3 is processed with eight bands. Assuming that if an ideal MS that has the same resolution with PAN exists, each band of it should have the same spatial details as in PAN, the details obtained by simulated PAN and original PAN are added to eight bands, not six.

Evaluation Indexes
In addition to visual evaluation, we have chosen two types of five indexes for quality assessment. One type is full-resolution assessment, which infers the quality of the pansharpened image at the scale of the PAN image without resorting to a single no-reference image [45]. The index of the quality with no reference (QNR) [46], the spectral distortion D λ and the spatial distortion D S are contained in this type. QNR measures spectral and spatial consistencies by calculating mutual similarities between any couples of MS bands and each MS band and PAN. The consistencies are assumed unchanged on average before and after fusion. Another type is reduced-resolution assessment, which measures the similarity of the fused product and an ideal reference, i.e., the original MS. The index of spectra mapper angle (SAM) [47] and spatial correlation coefficient (sCC) [48] are chosen for this purpose.
QNR, SAM, and sCC are calculated as Table 3, where symbols used have the same meaning as in Section 2. MS is the multispectral image (MS), P is the panchromatic image (PAN) and P LP is the low-resolution PAN, F is the fused image, R is the reference image. B is the total number of MS bands, F i,b represents the value of position i in b-th band of the fused image, b is an integer from 1 to B. When the correlation coefficient between PAN and fusion image or MS is calculated, PAN with only one band will be replicated B times to obtain B bands for convenience of calculation. P i,b is the same when b takes any integer value between 1 and B.ˆdenotes the image processed through high-pass filter. L 1 and L 2 is the width and height of an image. σ I represents the standard variance of image vector I and σ I J is the covariance of I and J. I denotes the mean value of I.
It is noted that QNR is the unique quality index combined two distortions of the spectral distortion D λ and the spatial distortion D S as Equation (7): where usually α = β = 1, and QNR ∈ [0, 1] with 1 being the best attainable value. The spectral distortion D λ and the spatial distortion D S both are calculated through universal image quality index (UIQI) [49] defined as Equation (8): Obviously, UIQI is a comprehensive index calculating the similarity between two images, which is a combination of three factors: loss of correlation, luminance distortion, and contrast distortion. If the value of UIQI is 1, it means that the image has the best fidelity for the reference image.
The description of each index in Table 3 is described as follows: • D λ represents the spectral distortion of the image. D λ calculates the correlation of the interband between the UIQI of the fused image and the reference image. Smaller D λ means smaller spectral distortion of the fused image, and so is SAM. If D λ is 0, the fused image has no spectral distortion. • D S shows the spatial distortion of the image. D S measures the correlation of the interband between the UIQI of fused image and PAN. Smaller D S represents smaller spatial distortion. If D S is 0, the fusion image has no spatial distortion. • QNR stands for no reference quality index and measures the quality of full-resolution fused images. QNR is the combination of D λ and D S . Bigger QNR denotes better image quality. When both of D λ and D S are 0, QNR will be 1, which means the fusion image has effective quality. • SAM measures spectral mapper angle between the fused image and the reference image. Smaller spectral distortion corresponds to smaller SAM. It means perfect image when SAM goes to 0. SAM is expressed in radians in our indexes. • sCC calculates the spatial correlation coefficient between the fusion image and PAN. The spatial details of PAN and fused image are obtained by high-pass filtering, such as Sobel operator measuring horizontal edge. A greater relative relationship leads to bigger sCC. The optimal value of sCC is 1. Table 3. Quality indexes.

Index Equation Meaning
the smaller the better the bigger the better

Results of IKONOS Dataset
For the original IKONOS, the data volume is 200, 20 pairs of test images in 1024 × 1024, 11,700 pairs of validation images in 128 × 128 and 46,800 pairs of training images in 128 × 128 are produced after pre-processing. The details of pre-processing are described in Section Implementation Details. Figure 4 is the visualization of results of BDSD, MTF-GLP-HPM-PP, PanNet, and our method. Figure 5 is its subsets to show local details.   Each image in Figure 4 is a partial region in size of 400 × 400 clipped from one of the test images in size of 1024 × 1024, and patches in size of 50 × 50 in Figure 5 show more details. Figure 4 shows that our method achieves promising visual results both in spectral and spatial dimension, which can reach equal effect with the MTF-GLP-HPM-PP method and PanNet. Results of MTF-GLP-HPM-PP, PanNet, and our method have no evident spectral distortion, but some blur regions exist on results of BDSD and PanNet, especially in subsets of the second and fourth rows of (c) and (e) in Figure 5. In comparison, our method and MTF-GLP-HPM-PP show distinct spatial details, particularly details of the edges of bright objects in the subsets. Table 4 and Figure 6 are the values of quality indexes of results of different methods. The average value of quality indexes of these images is shown in Table 4; the dispersion of the values of quality indexes is shown by boxplot in Figure 6.  The boxplot represents the dispersion of the data through maximum, minimum, median, upper and lower quartiles of quality indexes of test images. The length of the box shows the interquartile range (IQR), and the smaller range means more concentrated data. The line's location in the box is the median, which represents the general state of the data. Moreover, the '×' is the position of mean value corresponds to the value in Table 4. The upper and lower edge line means maximum and minimum.
For IKONOS datasets, the visual effect of MTF-GLP-HPM-PP and our method are best; however, the conclusion becomes different when it comes to quantitative indicators. On no-reference indexes, the results of BDSD have the best performance as the first row shown in Figure 6. As Table 4 and Figure 6 show, the results of PanNet and our method are similar, better than MTF-GLP-HPM-PP. As for referenced indexes, PanNet shows better performance than other methods, while BDSD, MTF-GLP-HPM-PP and our method perform similarly. The results demonstrate that there is a gap between visual evaluation and quantitative evaluation.

Results of QuickBird Dataset
For the original QuickBird, the data volume is 500, 50 pairs of test images in 1024 × 1024, 29,250 pairs of validation images in 128 × 128 and 117,000 pairs of training images in 128 × 128 are generated from pre-processing. Figures 7 and 8 are the visualization of results of different methods on QuickBird test datasets. The third, the second, and the first band were chosen as RGB channels for visualization from four bands of QuickBird test results. The ranges of bands have been shown in Table 2. In Figures 7 and 8, (a,b) are original PAN and MS, (c-f) are the fusion result of BDSD, MTF-GLP-HPM-PP, PanNet, and our method. Every image of Figure 8 is a 400 × 400 path clipped from one of 1024 × 1024 QuickBird test images or results, and 50 × 50 patches in Figure 8 show more particulars. From Figure 7, the results prove that our method also achieves superior visual results on QuickBird test images. Our method's results still have no apparent spectral distortion and spatial distortion. Figure 8 shows that our method's results have sharper outlines than other methods, specifically on the samples in the first row and third row. The patches of PanNet are somewhat ambiguous, and the zebra crossing on the samples in the fourth row can not be distinguished easily. Table 5 and Figure 9 are the values of quality indexes of different method results. The average of quantitative indicators of all QuickBird test images is shown in Table 5. Figure 9 gives the boxplot of values of quality indexes. Compared with the performance on IKONOS, our method reach closer results on QuickBird with MTF-GLP-HPM-PP method. BDSD and PanNet show similar performance on no-reference indexes. Although PanNet achieves the best results on almost all indicators in Figure 9, its visual effects are no match for MTF-GLP-HPM-PP and our method.

Results of WorldView-3 Dataset
In pre-processing, 16 pairs of test images in 1024 × 1024, 9360 pairs of validation images in 128 × 128, and 37,440 pairs of training images in 128 × 128 are produced from 160 original WorldView-3 images. Figures 10 and 11 are the visualization of results of different methods on WorldView-3 test datasets. For convenience to facilitate the comparison of the visual effects, the fifth, third, and second band of the eight bands of WorldView-3 were combined as RGB channels for display. Figure 10 shows a 400 × 400 area on one of 1024 × 1024 WorldView-3 test images, and Figure 11 provides the local patches. As shown in Figure 7, our method also achieves outstanding visual results on WorldView-3 test images. In line with the visual effects of IKONOS and QuickBird datasets above, our method's results have no evident spectral distortion and spatial distortion; however, the spectral preservation ability of MTF-GLP-HPM-PP method decreases. The results of BDSD suffers from severe spectral distortion, such as the architecture on BDSD result in (c) of Figure 7. Many speckles and over-saturated points arise on the whole image (c). As shown in Figure 11, BDSD method's results have some slight mixed colors, for example, yellow or gray, especially in the first row and the third row of (c). Table 6 and Figure 12 are the values of quality indexes of different method results. The value of quantitive indicators of 16 WorldView-3 test images can be observed in Figure 12. The average of 16 images is shown in Table 6. Compared with the performance on IKONOS and QuickBird, our method also reach close results on WorldView-3 with MTF-GLP-HPM-PP, but the performance of our method becomes better on WorldView-3 than other two datasets. Compared with IKONOS and QuickBird, WorldView-3 dataset has eight bands rather than four. When the number of bands increases, quantitative indicators of PanNet's results are still ahead overall; however, the gap in the performance of PDSDNet and PanNet decreases, while MTF-GLP-HPM-PP shows the opposite. The difference in the performance of MTF-GLP-HPM-PP and PDSDNet increases. It means that whether all bands are utilized for simulating low-resolution PAN will have an influence on fusion results when some multispectral bands have no intersection with the panchromatic band in wavelength.   For only six bands participating in learning the simulated PAN in PDSDNet, but with the details being added to all bands, we select the eighth, third, and first band for false color composite (FCC) to observe the effect in Figure 13. The eighth band and the first band were not involved in simulating PAN. The details are shown in Figure 14, and the first row from left to right is PAN, FCC of MS, and FCC of PDSDNet result, then the second and the third row is four groups of local details of the first row images. No obvious spectral or spatial distortion appears in visualization, although the two bands have not participated in the network of producing filters.

Visualization of Filters
In order to validate whether it is necessary for the adaptive filter network to generate a set of convolution kernels for each pixel instead of sharing the convolution kernels for the full image, we select a test image and output its filter after the simulating network. For convenience for the visualization, we multiply two single-dimensional filters together, so each pixel corresponds to one 5 × 5 kernel, then tile kernels according to the pixel location. The heat map of this matrix was drawn, and 250 × 250 area on the map is shown in Figure 15 for observing, which corresponds to the 50 × 50 local test patch. The deeper red represents the bigger value, and the greater yellow denotes the smaller value. When the value is close to the middle of the 250 × 250 data distribution, the map shows white. From Figure 15, it is clear that filters of different pixels have significant differences, which are related to the edges of the object.

Comparative Analysis of Methods
We have conducted experiments on IKONOS, QuickBird, and WorldView-3 datasets with BDSD, MTF-GLP-HPM-PP, PanNet, and PDSDNet. Visualization and quantitative evaluation of the results of four methods on three datasets with five indexes have been shown. Figures 4, 7, and 10 demonstrate a whole effect of 400 × 400 areas where spectral preservation can be observed. Figures 5, 8, and 11 focus on the local details of 50 × 50 areas. Figures 6,9,and 12 show the data distribution of the quantitative indicators of the results. Tables 4-6 show the average value of the quantitative indicators of the results. Figures 13 and 14 show the performance of the proposed method on WorldView-3 dataset in the Coastal and NIR2 band, which are not involved in simulating panchromatic image. Figure 15 shows the visualization of filters corresponding to part of pixels of one band generated by the proposed method in this paper.
From the above results of the three datasets, the visual effect of our method is promising with no conspicuous spectral distortion. The results of the proposed method indicate that the spectral effect of simulating PAN obtained by the combination of MS bands is relatively close to the real PAN. The proposed method in this paper assumes that if the resolution of panchromatic and multispectral images are consistent, the panchromatic band has the same details as each multispectral band; therefore, the spatial detail is identical for each band when the details obtained from the real PAN and simulating PAN are infused to the upsampled MS. The visualization of the two other bands in the eight-band dataset presents as much detail as the others without spatial distortion. It is consistent with the previous assumption. The results showed no evident spatial distortion in the local patches. Compared with other methods, the visual effect of the proposed method demonstrates more advantages.
In terms of quantitative indicators, the results of the proposed method are not the most outstanding on the three datasets among the four methods, but no significant difference exists between our method and the best or second-best method most of the time. The quantitative performance may be related to the perspective of measurement. Although the results of BDSD are block-fuzzy, BDSD performs well in the no-reference quantitative indexes on IKONOS dataset and QuickBird dataset. The values of quantitative indicators of PanNet's results are positive, but detailed information is not abundant enough with fuzzy edge in visual. It indicates that the performance of quantitative indexes could be inconsistent with the visual effect. Whether the method is local optimal or global optimal, and whether the index is calculated by averaging the values of the local areas or measured in pixels, may affect the evaluation of the method in terms of the index.

Conclusions
In this paper, a pixel-dependent spatial-detail injection network (PDSDNet) is proposed. Based on the dynamic filter network, PDSDNet constructs the nonlinear mapping of the simulated panchromatic band from the low-resolution multispectral band through filter convolution regression. On the one hand, PDSDNet reduces the possibility of spectral distortion and spatial distortion by improving the similarity between the simulated panchromatic band and the real panchromatic band. On the other hand, the pansharpened images have rich spatial details, assuming that the panchromatic band has the same spatial details as each multispectral band.
The experimental results show that PDSDNet can generated high-quality fusion image with multi-resolution images and panchromatic images. The comparison between the proposed method and the widely utilized methods BDSD, MTF-GLP-HPM-PP, PanNet of pansharpening on IKONOS, QuickBird, and WorldView-3 datasets demonstrates that the proposed network presents superior visual effects without noticeable spectral distortion and spatial distortion. The quality indexes show that the PDSDNet only obtain result in approximate level similar to the MTF-GLP-HPM-PP's results, not the best one.
Our experimental results demonstrate that quantitative indicators in the evaluation do not match the expected visual evaluation outcomes. It is not a novel finding. Devising new quantitative indicators that match the human perceptual assessment is still an open research problem in image fusion.
Although the proposed method has superior fusion visual performance, quantitative results are not outstanding enough. In fact, in this paper, quantitative indicators were applied to evaluate fusion images, but the optimization in PDSDNet is to simulate panchromatic band. In future work, we will explore establishing a pixel-dependent spatialdetail injection model on dynamic filter networks to obtain both promising visual and quantitative results.
Author Contributions: Conceptualization, X.L. and P.T.; methodology, X.L. and P.T.; software, X.L. and X.J.; data curation, X.L. and Z.Z.; writing-original draft preparation, X.L.; writing-review and editing, P.T. and Z.Z.; visualization, X.L.; supervision, P.T. and Z.Z. All authors have read and agreed to the published version of the manuscript.