DPAFNet: A Multistage Dense-Parallel Attention Fusion Network for Pansharpening

: Pansharpening is the technology to fuse a low spatial resolution MS image with its associ-ated high spatial full resolution PAN image. However, primary methods have the insufﬁciency of the feature expression and do not explore both the intrinsic features of the images and correlation between images, which may lead to limited integration of valuable information in the pansharpening results. To this end, we propose a novel multistage Dense-Parallel attention fusion network (DPAFNet). The proposed parallel attention residual dense block (PARDB) module can focus on the intrinsic features of MS images and PAN images while exploring the correlation between the source images. To fuse more complementary information as much as possible, the features extracted from each PARDB are fused at multistage levels, which allows the network to better focus on and exploit different information. Additionally, we propose a new loss, where it calculates the L 2-norm between the pansharpening results and PAN images to constrain the spatial structures. Experiments were conducted on simulated and real datasets and the evaluation results veriﬁed the superiority of the DPAFNet.


Introduction
With the recent launch of many high-resolution (HR) optical satellites, many spatio and temporal informations retrieved by satellites can be used in various research applications such as object detection, ecological protection, and environmental monitoring [1][2][3][4]. Many of these applications require at the highest spatial and spectral resolution to achieve better results. However, due to the limitations of the sensors in terms of physical technology, it is a challenging task to capture both high-resolution spectral and spatial information with a single sensor [5]. Usually, the sensor captures data in two forms: one is HR panchromatic (PAN) image and the other one is low-resolution multispectral (LRMS) image. 1. Pansharpening is to compensate the spatial details of high-resolution panchromatic image (PAN) to low-resolution multispectral image (LRMS), and therefore to obtain multispectral image with high spatial resolution.
In recent years, deep learning-based methods have attracted a lot of attention in computer vision and image processing [6,7], such as image fusion [8], super-resolution [9], image denoising [10], image rain removal, fog removal, and image restoration [11]. In remote sensing images, several researchers employed a convolutional neural network (CNN) to deal with panchromatic image sharpening, where it can effectively extract multi-level automatically. Most existing methods employ supervised learning (SL) to achieve pansharpening since the HRMS can be constrained by Ground Truth (GT), such as PNN [12], BAM [13], and MC-JAFN [14]. Nevertheless, on the one hand, the HRMS will produce artifacts because the GT is synthesized manually. On the other hand, the source images will suffer from spatial destruction due to the decline of the resolution. Such that various unsupervised learning-based (UL) methods are developed to achieve the pansharpening task, such as [15][16][17][18][19], etc. Specifically, these methods can be designed as an encoder-decoder, where the former is to extract multi-level features, whereas the latter is employed to reconstruct the HRMS. Moreover, some attention mechanisms are proposed to further focus on the primary spatial and spectral features in the encoder. However, these methods have some problems as follows: (1) Since the low-resolution spectral range of LRMS and the spatial details of PAN images are significantly different, it is difficult to adaptively fuse the spatial details of PAN to all bands of LRMS based on the spectral features. Therefore, it is still a great challenge for the network characterization capability to fully extract spectral and spatial information and fuse them; (2) In the encoder, existing networks do not jointly pay attention to spatial structure and spectral information, and single attention can easily cause a mismatch between spatial and spectral information; (3) Most networks simply perform single-level decoding in the image reconstruction phase and pay little attention to the loss of information in the extraction phase of features. Such sharpening results in spatial distortion and information distortion easily.
To solve the above problems, we propose a novel pansharpening network, called DPAFNet, where the PAFB module is used for joint spatial and spectral attention. Moreover, a hybrid loss is developed to effectively train our framework. Specifically, the reconstruction loss is to pixel-wisely reconstruct the HRMS image, whereas the spatial loss and the spectral loss are designed from the perspective of the spatial and spectral attention mechanism to further improve the structure of HRMS, respectively. The contributions of this paper are as follows: • An end-to-end pansharpening framework. We perform primary and deep feature extraction for PAN and LRMS images. In the deep feature extraction stage, we use parallel attention residual dense block (PARDB) for multi-level extraction, which can extract spatial details and spectral correlations over a wide spectral range. Additionally, PARDB solves the first challenge, which is to promote the representation capability of the network by multi-level feature extraction; • A parallel attention residual dense block. We propose a parallel attention residual dense block (PARDB) in the encoder, which consists of a Dense Block and a Parallel attention fusion block (PAFB). The PAFB can effectively focus on spectral information and spatial information, and reduce redundancy. Note that the PAFB effectively distinguishes important and redundant information in the feature extraction phase and the fusion phase, solving the second challenge; • A multi-stage reconstruction network. In the image reconstruction stage, we propose multi-stage reconstruction of residuals (MSRR) for multi-level decoding, meanwhile to supplement the information of image reconstruction. We join the encoded information with the decoded information to act as an information supplement. This effectively solves the third challenge.

Traditional Methods
In the past decades, many methods have been proposed for sharpening panchromatic images, which can be divided into three categories: component substitution (CS), multi-resolution analysis (MRA) and super-resolution (SR) paradigm. The main idea of CS methods is to perform a specific transformation of LRMS to separate spatial and spectral information, and then replace the separated spatial information with the spatial information of PAN images. The representative methods based on CS are principal component analysis (PCA) [20], intensity-hue-saturation (IHS) [21], and Gram-Schmidt (GS) [22]. These methods generally have more accurate spatial details, whereas they suffer from significant spectral distortion owing to mismatch the spectral range between PAN and LRMS images.
MRA-based methods, such as high-pass filtering (HPF) [20], smoothing filter-based intensity modulation (SFIM) [23,24], "à trous" wavelet transform (ATWT) [25], decimated Wavelet Transform using an additive injection model (Indusion) [26], and MTF Generalized low-pass (MTF-GLP) [27], employ PAN images to infer the missing spatial details from LRMS images, where they achieve the high frequency features of PAN images through multi-scale decomposition and adaptively inject them into the up-sampled LRMS images. However, the insufficient inference and decomposition algorithms could result spatial distortion, such that they must introduce appropriate strategies.
In the SR methods, PAN and LRMS images are usually considered to be a result of HRMS images' degradation in spatial structure and spectral information. Therefore, the sharpening of panchromatic images is considered as a recovery problem to recover the HRMS from the degraded image. According to the SR methods, the PAN and LRMS images can be considered as a linear combination and blurred versions of HRMS images, respectively. Hence, these methods, including sparse representation [28], variational [29], model-based fusion using PCA and wavelets(FE-HPM) [30], model-based fusion using PCA and wavelets (PWMBF) [31], etc., recover images mainly by constructing constraint terms. Nevertheless, the sparse representation often results in spatial distortion since the destruction of the spatial structure appear spatial distortion.

Network Backbone for Pansharpening
Resorting to CNN and Generative Adversarial Network (GAN), many deep learningbased methods are introduced to achieve the pansharpening task. For some SL-based algorithms, Masi et al. [12] proposed PNN by interpolating the LRMS image with the PAN image. Jin et al. [13] proposed a simple and effective bilateral activation mechanism (BAM) to avoid simply performing a negative truncation. However, the HRMS images will suffer from unreasonable artifacts in the training phase. Additionally, how to avoid the information loss of the LRMS during the downsampling operation is always an open problem.
To alleviate the defects in SL-based schemes, UL-based methods are developed to directly reconstruct the HRMS image without Wald Protocol [32] in the encoder. Representatively, ref. [15] proposed an iterative network and a guided strategy to further extract features from the source images. Ref. [16] employed registration learning in pansharpening (UPSNet) to avoid dedicated registering the source images. Ref. [18] pre-trained their framework by SL, whereas the whole architecture is fine-tuned by UL. The other two methods, refs. [17,19], generate the HRMS images by a unified CNN-based backbone and GAN network, respectively. Nevertheless, the reconstructional results will cause spectral distortion and spatial degradation due to the uncertainty of UL. Moreover, the loss function in these methods does not further refine the constraint spatial structure and spectral information, resulting in distortion of the fused results.

Attention Mechanism for Pansharpening
A good attention mechanism is a core factor to extract spectral and spatial features, such that it is widely introduced in pansharpening. Specifically, Zhang et al. [33] designed a bidirectional pyramid network (BDPN) to ensure the network gives more attention to local information. Lei et al. [34] proposed a multi-branching-based attention network to adequately extract spatial and spectral information. Guan et al. [35] employed a dualattention-based network with a three-stream structure to fully combine the correlation and relevance of the source images. Differently, ref. [36] first obtain the spatial features of the source images by a high-pass filter, then a dual-branch attentional fusion network is proposed to enhance the spectral resolution of the HRMS image. Recently, the vision transformer (ViT) [37] has been widely used in computer vision tasks because the selfattention mechanism can focus on the global features of the source images. Based on this, Meng et al. [38] designed a self-attention-based encoder to extract both local and global information and finally reconstruct the results by stitching and upsampling operations. Although the attention networks in their methods perform effectively, they still suffer from spatial and spectral distortion since the attention mechanisms are not constrained by the loss function.
Compared with other methods, our DPAFNet, in its en-coder, proposes a multistage Dense-Parallel attention net-work to adequately extract spatial and spectral features. Moreover, we develop a hybrid attention loss according to the parallel attention mechanism to effectively train our framework.

Problem Statement
The PAN images have rich spatial information, whereas rich spectral information is exhibited by LRMS images. The work in this paper is to fuse of PAN and LRMS rich important to generate HRMS. In order to accomplish this task, we propose a new method that adaptively fuses spectral and spatial information in multiple stages. Let M ∈ R h×w×B represents LRMS image, B is the number of bands. Moreover, h × w denotes the spatial size of each band number. P ∈ R H×W present the PAN image in a single band, generally H = r × h, W = r × w, where r represents the ratio of the spatial resolution of the LRMS and PAN images. Most traditional approaches follow the following convergence framework [39]: where X ∈ R H×W×B is the pansharpened HRMS image. Additionally,M ∈ R H×W×B is the upsampled version of LRMS [27], which constitutes a high-resolution multispectral image. R ∈ R H×W×B can be considered as the residual, and the detail information of LRMS and PAN images are extracted by the function ϕ to compose HRMS. Therefore, we can rely on Equation (1) to design our network.

Network Framework
Our proposed network is shown in Figure 1. In the data preparation phase, we first up-sample the LRMS to the same size as P. Then the up-sampled image LRMS (marked asM ∈ R H×W×B with P in the spectral dimension is concatenated as the input I ∈ R H×W×(B+1) to the DPAFNet. In Figure 1, DPAFNet consists of four main components: primary feature extraction (PFE), Deep-level feature extraction (DLFE), multi-level feature fusion (MLFF), multi-stage reconstruction of residuals (MSRR). First, we use PFE module for primary feature extraction, which consists of a basic stack of convolutional layers and activation functions: where H PFE (·) denotes the PFE module, then the feature F S is fed into the DLFE module. DLFE module consists of a stack of i PARDBs. The output F i passing through i-th PARDBs can be calculated by: where H PARDB (·) denotes the i-th PARDB block. The output of each PARDB is fed into the MLFF module, which is represented as follows: where H MLFF (·) represents MLFF module, F MLF is multi-level feature fusion. Finally, we feed F MLF into the MSRR module, which is calculated as follows: where R i is the output of the MSRR module.

DLFE
We design a new module PARDB to extract deep features, the spectral attention module and spatial attention module modules are first introduced, and then the composition of the PAFB module is described. Finally, it introduces how to embed PAFB into Dense block to form PARDB module.

Spectral Attention Module
The primary features extracted by PFE contain different cross-channel information and contribute differently to the fusion process. Therefore, more focuses on the spectral information feature maps that are highly correlated with the input information, and redundant information other than that should be suppressed.
The module structure of spectral attention is shown in Figure 2, To focus more on the spectral information, we use a bottleneck strategy to suppress redundant information, for which we set the size of the convolution kernel to 1 × 1 × C × (C/r) and 1 × 1 × (C/r) × C. Then, gradually aggregate the spectral information through a convolution block. Finally, the features are compressed into a vector (M spe ) by averaging pooling, and then each value of the vector is compressed to [0, 1] by the sigmoid function. The size of the attention vector is the same as the U ori channel. The output of the spectral attention module is as follows: where U ori represents the input of the spectral attention module, ⊗ is the multiplication, U spe depicts the output of the spectral attention module.
With the spectral attention module, the network can better suppress redundant information, improve the correlation between channels, and reduce the spectral distortion of the fusion process.

Spatial Attention Module
Unlike the spectral attention module that compresses spatial information into a single channel, spatial attention module aims to highlight the most spatially informative regions of the input, while each feature map contains different spatial information. Usually the highfrequency information in textured areas is difficult to sharpen, while the low-frequency information in smooth areas is ignored. Therefore, we take advantage of the spatial relationships by adding a spatial attention module.
For the spatial attention module as shown in Figure 2, it is composed of a convolutional layer, two bottleneck blocks and two following convolutional layers. The bottleneck blocks are similar to residual structures, mainly to reduce the number of parameters and to suppress redundant information. Finally, each value of the feature vector (M spa ) is compressed to [0, 1] by sigmoid. The output of the spatial attention module is as follows: where U ori represents the input of the spatial attention module, ⊗ is the multiplication, U spa is the output of the spatial attention module. By improving the location information related to sharpening with the spatial attention module, the feature representation of our network is improved.  ⊗ and ⊕ denote the elementwise multiplication and the elementwise sum operation, respectively.

PAFB
To better fuse the spectral and spatial information, we connect the spectral attention and spatial attention modules in parallel, while concatenating the results of both with the input. Finally, the stacked features are further encoded by two convolutional layers. The output of the PAFB module is as follows: where [[·]] denotes the concatenation operation, f i denotes the convolutional layer with a kernel size of 3 × 3, δ(·) represents the activation function ReLU. Additionally, b 1 and b 2 are the bias of the convolutional layers, respectively.

PARDB
We embed the PAFB into the dense block, so that the PARDB has a better fusion effect, as shown in Figure 3.
Let F i−1 be the input of the i-th PARDBs. Firstly, the local features are extracted by several dense layers and ReLU. Then the different local features are fused and downscaled to form the local feature U ori . We add PAFB after generating the local features and adaptively let the refinement of the local features fuse (the result is as U PA ). Finally, a skip connection is used to add U PA and F i−1 to achieve a feature complementary function. This connection can effectively fuse low-level features with high-level features. The output F i of the i-th PARDBs is as follows: where ⊕ denotes the elementwise sum operation. PARDB combines the advantages of both dense block and PAFB, which can fully extract the features of different layers and effectively fuse the spectral and spatial information.

MLFF Module
To obtain useful spectral and spatial information, we use MLFF to adaptively combine multi-level features. The output of the MLFF module is as follows:  Figure 3. Diagram of PARDB, which is composed of PAFB and Dense Block.

MSRR Module
In the image reconstruction part, in order to obtain a better reconstructed image. We superimpose multiple RR i blocks while using skip connections to sum the outputs of different PARDB blocks with them, and the final output is the reconstructed residuals. The output of the MSRR module is as follows: where H RR i (·) denotes the RR i block, which is stacked by two convolution and ReLU functions.

Loss Function
Note that the loss function is an important factor in training deep networks. In this paper, we propose hybrid loss to optimize our network.

Reconstruction Loss
Most previous sharpening methods utilize the L 2 -norm as the loss function for parameter optimization of the network, such as [40], etc. However, the L 2 -norm suffers from ambiguity and over-sharpening, such that L 1 -norm have been generally employed in pansharpening [35,36]. Inspired by these methods, we use L 1 as the reconstruction loss, as follows: where GT i denotes the i-th Reference image, and X i represents the i-th image predicted by the network. N denotes the number of trained pairs of images.

Spatial Loss
To preserve spatial structure in pansharpening, we transform the sharpening result into a single band by applying a band transformation. Then, we use L 2 parametric constraints to constrain the difference between the single band result and the PAN image at the pixel level. our spatial structural loss is defined as: where P i denotes the i-th source PAN image,X i represents each band of the i-th multispectral image predicted by the network and N denotes the number of trained pairs of images.

Spectral Loss
SAM is employed to quantify the spectral distortion, here, we use the spectral vector at each position as a spectral feature. Then, SAM loss [41] is introduced to constrain the spectral distortions, which can be defined as: where X i and GT i are the i-th spectral vectors of two images, · is the inner product, || · || 2 denotes the L 2 -norm of a vector, and SAM(X i , GT i ) is the average of SAM values overall pixel location. The loss function we used for training is as follows: where L represents the total loss function, L rec denotes the reconstruction loss, L spatial denotes the spatial loss, L spectral denotes the spectral loss, α, β, γ are regularization constants.

Datasets and Setup
In our experiments, the two real datasets are from IKONOS and WorldView-2 (WV-2) sensors. The details are shown in Table 1. Based on the number of bands of different satellites, we trained the networks supporting 4-band and 8-band respectively. The dataset of IKONOS includes 200 PAN and LRMS pairs with the spatial size of 1024 × 1024 and 256 × 256, meanwhile, the dataset of WV-2 contains 500 PAN/LRMS pairs with the spatial size of 1024 × 1024 and 256 × 256. Due to a lack of ground truth, we followed Wald's protocol [32] to generate the simulated dataset, the spatial resolution of the simulated dataset is 1/4 of the real dataset. Note that, in the training phase, the Wald's protocol is always conducted to the dataset to ensure resolution reduction. Meanwhile, original LRMS are used as ground truth (GT) for training the network. For each simulated dataset, we selected 80% and 20% of the dataset for training and testing, respectively.
Our model is trained using the Pytorch package on a computer with Nvidia GeForce RTX 2080 GPUs. We use Adamw to learn to minimize loss, with the related parameters β 1 = 0.5, β 2 = 0.999, and = 1 × 10 −8 . Moreover, all the convolutional layers are with bias and the learning rate is set to 1 × 10 −5 , whereas the relevant parameters are set to β = 0.07, γ = 0.03, respectively.

Evaluation Metrics
In the process of pansharpening, the lack of reference images limits the results of image evaluation. To solve this challenge, two comparison methods are proposed. The first one is to downsample MS and PAN images to reduce the resolution based on Wald's protocol [32] and use the original MS image as the reference image. Another one is to perform quality evaluation directly on the real dataset without the reference image. Due to the presence of reference images, some metrics were proposed to evaluate the quality of sharpening of reduced resolution images. To evaluate our pansharpening results more comprehensively, we chose spectral angle mapper (SAM) [42] and Q4 [43] or Q8 [44] to measure the spectral distortion, spatial correlation coefficient (SCC) [45] to measure the spatial distortion. Universal image quality index averaged over the bands(Q) [46] and Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS) [47] are used for global indexing. Note that Q4, Q8, Q, and SCC are from 0 to 1, and the larger of these metrics indicate a better HRMS image. SAM and ERGAS are from 0 to any positive number, whereas the smaller they are, the better the fused result. When evaluating metrics at their original resolution, using the quality without reference (QNR) [48] index, at the same time it contains two components D λ and D s , which the quality the spectral and spatial distortion, respectively. Similarly, all of these three metrics range from 0 to 1, in which lower references of D λ , D s , and higher references of QNR denote a preferable fused image.

Visual and Quantitative Assessments
We show the image sharpening visual results and quantitative analysis between DPAFNet and other state-of-the-art 15 methods at reduced-resolution and full-resolution.
Reduced Resolution Analysis: We started the analysis at a reduced resolution when the original MS was considered the ground truth (GT). The spatial resolution of the input MS/PAN image is reduced to 1/4 of the original image. The detailed index evaluation results are shown in Tables 2 and 3.
From Tables 2 and 3, it can be seen that the deep learning approach has significantly improved over the traditional approach in terms of quantitative metrics. Among the five DL methods mentioned in this paper, our method has the best performance, especially in the WV-2 dataset.
The results of the IKONOS and WV-2 dataset visualizations are shown in Figure 4 and Figure 5, respectively. In the first row, we show the GT image and the results of different methods of sharpening. Since there is a reference image, to better compare the detail loss, in the second row we show the error images corresponding to each method. From Figures 4 and 5, it can be seen that the sharpening results of traditional methods always have severe spectral distortion and their color saturation is lower than that of the reference image. Additionally, the spatial resolution of the sharpening results produced by conventional methods is severely distorted, especially in PCA [20] and IHS [21], which can be seen more clearly in the error images. By contrast, we can see that all the deep learning methods can better preserve the spectral information and spatial structure, meanwhile, producing sharpening results with rich edge information and saturated colors. According to the error image, our sharpening result is closer to the reference image, which proves that our method can well preserve spectral information and spatial structure.   Full Resolution Analysis: In the reduced resolution experiments, we proved to be methodologically optimal by qualitative and quantitative analysis. However, in real applications, GT is not available and both MS and PAN images are at the resolution of satellite capture. To evaluate the performance at full resolution, we show the quantitative analysis of IKONOS and WV-2 dataset in Table 4. In full-resolution evaluation, deep learning methods always give better results, especially in the IKONOS dataset. In addition to that, deep learning methods also obtain better results when in the WV-2 dataset, but not always better than traditional methods, such as Indusion [26]. Although our proposed method does not obtain the optimal D λ and D s , we obtain the best QNR on two different datasets. Figures 6 and 7 show the visualization results of different methods of sharpening at full resolution. For the IKONOS dataset shown in Figure 6, pansharpened images of PCA, IHS, GS, HPF, Indusion, FE-HPM, PWMBF loss some image details. In particular, GS not only loses some information but also produces severe spectral distortions. By contrast, the resulting images of the five deep learning methods are given without severe spectral distortions.  Additionally, to better compare the priority of each method, we enlarged the sharpened image of each method (marked by the red box ) and displayed it in the appropriate position. From the enlarged area of EXP in Figure 6, we can see that our network could recover some color features, indicating that the network has good spectral preservation ability. From the zoomed-in region in Figure 7, it can be seen that the conventional methods cannot fully recover the details of PAN images, and the sharpening result of PNN, DRPNN, PercepPan, PGMAN, BAM and MC-JAFN looks unnatural. In contrast, our proposed network preserves more details and looks more natural. By visual and quantitative comparison at full resolution, our proposed network outperforms both traditional and state-of-the-art deep learning methods in terms of spectral preservation.

Parameter Analysis
The proper loss function is very important for the training of neural networks. To achieve good sharpening results, we analyzed the two parameters β and γ of the loss function in Table 5. Hence, we set β to 0.05, 0.07, and 0.09, respectively. Tthe other parameter was set to 0.01, 0.03, and 0.05 to achieve good fusion results. We used different parameters to quantify the analysis. Table 5 shows the results of quantitative indexes with different parameters. The SAM, ERGAS, SCC can achieve the best results when β = 0.07 and γ = 0.03. To summarize, we chose β = 0.07 and γ = 0.03 to achieve the best sharpening results.

Ablation Study
To verify the impact of different modules on our network, we conducted ablation experiments of network and loss. They are shown in Table 6.

Ablation to Network
To verify the usefulness of the different modules, we divided the network ablation into five parts. The first one is the network with "no skip connections and no attention", i.e., no feature compensation in reconstructed images no dual-attention mechanism. The second one is the net-work with "no skip connection", whereas the third one denotes the network with only "skip connection", i.e., the PAFB module was removed in the encoder. As shown in Table 6, the quantitative results show that the skip connection and parallel attention block provide contributions to our pansharpening network. Moreover, we also perform the single attention block in the encoder, such that the network turns to only spatial and spectral attention mechanism, respectively. We can see that each metric is inferior to the parallel-attention network, whereas they are all superior to no PAFB module. That is to say, each attention part and the skip connection is a core factor to generate a better HRMS image.

Ablation to Loss Function
We performed three ablations of the loss function to validate its availability, with only reconstruction loss (L rec ), no spectral loss (L rec + L spatial ), and no spatial loss (L rec + L spectral ). As illustrated in Table 6, our method produces the best results on each metric, especially SAM and ERGAS improve by 2.69% and 1.53%, respectively. These results indicate that the L spatial and L spectral further constrain the fidelity of spatial and spectral information on the basis of reconstruction loss, and improve the fidelity of spatial and spectral information.

Discussion of Spatial Loss
In this sub-section, we are dedicated to discussing the effectiveness of spatial loss. Specifically, we turn the spatial loss (L spatial ) into the structural similarity loss (L ssim ) [49]. Table 7 illustrates the quantitative evaluation of these two losses, in which L spatial performs better than L ssim in terms of SAM, ERGAS, and SCC, while the L ssim provides better results for Q4 and Q than L spatial . However, it is noted that the difference between the two losses in Q4 and Q is not obvious, because these two metrics only focus on a single band. Although the L ssim can enhance the spatial structure to some extent, it suffers from mismatching with the corresponding bands, such that the overall spectral trend is significantly descending, i.e., ERGAS and SAM. These quantitative results imply that the proposed spatial loss can further improve fusion quality.

Conclusions
In this paper, we proposed a novel panchromatic image sharpening framework, called DPAFNet. By utilizing PARDB, our network can simultaneously learn the unique information of the image and the correlation between MS and PAN images. Instead of single-scale pansharpening, we performed the PAFB module for multi-level feature fusion in the training phase. Furthermore, a novel spatial loss was introduced; the results were able to preserve more spatial structure features from the PAN images. Experiments were carried out on simulated data sets and real datasets, where the visual and quantitative results verified the advantages of our DPAFNet.