remote Remote Sensing Pansharpening by Full-Depth Feature Fusion

: Pansharpening is an important yet challenging remote sensing image processing task, which aims to reconstruct a high-resolution (HR) multispectral (MS) image by fusing a HR panchromatic (PAN) image and a low-resolution (LR) MS image. Though deep learning (DL)-based pansharpening methods have achieved encouraging performance, they are infeasible to fully utilize the deep semantic features and shallow contextual features in the process of feature fusion for a HR-PAN image and LR-MS image. In this paper, we propose an efﬁcient full-depth feature fusion network (FDFNet) for remote sensing pansharpening. Speciﬁcally, we design three distinctive branches called PAN-branch, MS-branch, and fusion-branch, respectively. The features extracted from the PAN and MS branches will be progressively injected into the fusion branch at every different depth to make the information fusion more broad and comprehensive. With this structure, the low-level contextual features and high-level semantic features can be characterized and integrated adequately. Extensive experiments on reduced- and full-resolution datasets acquired from WorldView-3, QuickBird, and GaoFen-2 sensors demonstrate that the proposed FDFNet only with less than 100,000 parameters performs better than other detail injection-based proposals and several state-of-the-art approaches, both visually and quantitatively.


Introduction
Due to the limited light energy and the sensitivity of remote sensing imaging sensors such as WorldView-3, QuickBird, and GaoFen-2, only multispectral (MS) images with low spatial-resolution (LR-MS) and panchromatic (PAN) gray-scaled image with high spatial-resolution (HR-PAN) can be obtained directly from optical devices. However, what is highly desirable in a wide range of applications, including change detection, classification, and object recognition, are images with rich spectral information and spatial details. The task of pansharpening is namely to obtain such high-resolution multispectral (HR-MS) images by fusing the known HR-PAN and LR-MS images, improving the spatial resolution of MS images while maintaining the high resolution in the spectral domain. Recently, pansharpening has been an active field of research, getting more and more attention in remote sensing image processing. The competition [1] initiated by the Data Fusion Committee of the IEEE Geoscience and Remote Sensing Society in 2006, and many recently published review papers proved the rapid development trend of pansharpening. In scientific research, pansharpening has also received extensive attention in the industry of some companies, e.g., Google Earth, DigitalGlobe, etc.
HR-PAN images and LR-MS images both in spatial and spectral dimensions, making the fusion procedure more difficult. In [41], Zhang et al. designed two independent branches for HR-PAN and LR-MS images to explore their features separately and finally performed feature fusion operations on their respective deep features. In this case, BDPN will fail to fuse the feature map in shallow depth and medium depth of the network, attenuating the fusion ability.
To address the above problem, we propose an efficient full-depth feature fusion network (FDFNet) for remote sensing pansharpening. The main contributions of this paper can be summarized as follows.

1.
A novel FDFNet is designed to learn the continuous features for the PAN image, MS images, and the fusion images through three branches, separately. These three branches are arranged in parallel. The transfer of feature maps and information interaction among the three branches are carried out at different depths of the network, enabling the network to generate specific representations for images of various properties and the relationships between them.

2.
The features extracted from the MS branch and PAN branch will be injected into the fusion branch at every different depth, promoting the network to characterize better and integrate the detailed low-level features and high-level semantic features.
Extensive experiments on reduced-and full-resolution datasets captured by WorldView-3, QuickBird, and GaoFen-2 satellites prove that the proposed FDFNet with less than 100,000 parameters could exceed the other competitive methods. And the comparisons with the LR-MS and high performance DMDNet [34] are shown in Figure 1 for a WorldView-3 dataset.

Related Works
In this section, a brief review of several DL-based methods [34][35][36][37] for pansharpening will be presented.
The successful use of deep CNNs in a wide range of computer vision tasks has more recently led researchers to exploit their nonlinear fitting capabilities for image fusion problems such as pansharpening. In 2016, Masi et al. [37] firstly attempted to apply CNNs to solve the pansharpening problem, particularly by replicating three convolution layers. The pansharpening neural network (PNN) was designed and trained on big data. Though with such a simple network structure, its performance surpasses almost all traditional methods, indicating the great potential of CNNs for pansharpening, and it motivates many researchers to carry out further research based on deep learning. In 2017, Yang et al. [35] proposed a neural network with residual learning modules called PanNet, which are easier for retrieving training results and could reach convergence more quickly than PNN. Another important innovation of their work is that the known HR-PAN image and LR-MS image are high-pass filtered before being input into PanNet so that the network can focus more on the feature extraction of edge details of the images. Thanks to its high-frequency operation and simple network structure, PanNet has good generalization ability, making it competent for different datasets.
In 2019, a lightweight network named detail injection-based convolutional neural networks (DiCNN1) was designed by He et al. [36], which discards the residual structure used in PanNet. It injects the LR-MS image into the HR-PAN image and then inputs it to the network that contains only three convolution layers and two ReLU activation layers. Though the number of parameters of DiCNN is small, its performance is superior to PanNet, and it also surpasses PanNet in terms of processing speed, making it more efficient in real application scenarios.
Most recently, Hu et al. [71] proposed multi-scale dynamic convolutions that extract detailed features of PAN images at different scales to obtain effective detail features. In [70], a simple multibranched feature extraction architecture was introduced by Lei et al. They used a gradient calculator to extract spatial structure information of panchromatic maps and designed structural and spectral compensation to fully extract and preserve the spatial structural and spectral information of images. Jin et al. [42] proposed a Laplacian pyramid pansharpening network architecture which is designed according to the sensors' modulation transfer functions.
In addition, Fu et al. [34] proposed a deep multi-scale detail network (DMDNet), which adopts grouped multi-scale dilated convolutions to sharpen MS images. By grouping a common convolution kernel, the computational burden can be reduced with almost no reduction in feature extraction and characterization capabilities. In addition, the use of multi-scale dilated convolution can not only expand the receptive field of the network but also perceive spatial features on different scales. Innovative use of dilated convolution and multi-scale convolution to replace the general convolution for feature characterization make DMDNet achieve state-of-the-art performance.
The above four DL-based methods can be uniformly expressed as follows: where N θ (·) represents the process in DL-based methods with the parameters θ, and [ ; ] represents the concatenation of PAN and LRMS.

Proposed Methods
In this section, we will state the motivation of our work, and then introduce the details of the proposed FDFNet. Figure 2 shows the architecture of our proposed network.

PAN Branch MS Branch
Fusion Branch Figure 2. The overall architecture of the proposed full-depth feature Fusion Network (FDFNet). Please note that the kernel size of the convolution in FDFNet is 3 × 3, the channels of the PAN branch and MS branch is 16, and the channels of fusion branch is 32. For more details, please refer to Section 3.2.

Motivation
Although the methods mentioned above provided various empirical approaches to depict the relationships between images, three main limitations have not been addressed. First, they neglected the difference between the LR-MS image and HR-PAN images in terms of the spectral and spatial information, and were just fused through concatenation or summation at first, and then the fused image directly inputted into the network, leading to the features being separately contained in the HR-PAN image and LR-MS image, which cannot be effectively extracted. Second, the existing methods only perform feature fusion at the first or last layers in the network. Thus the resulting fusion image may be inadequate for discriminative representation and integration reasonably. Third, separate feature extraction and fusion operations will make the network structure complex and computationally expensive, resulting in a cumbersome model.
In response to the above concerns, we managed to extract the rich textures and details contained in the spatial domain of the PAN image and the spectral features contained in the MS image through two independent branches to maintain the integrity of the spectral information of the multispectral image and reduce the distortion of the spatial information. In order to reduce the computational burden of feature fusion, the features obtained from the PAN branch and MS branch at the same depth are injected into the fusion branch parallel to the other two branches. While performing feature extraction, the full-depth feature fusion of the network is realized. In this way, the network can maximize the use of features at different depths and branches, that is, low-level detailed texture features and high-level semantic features to restore distortion-free fusion images.

Parallel Full-Depth Feature Fusion Network
Consider a PAN image PAN ∈ R H×W×1 and a MS image MS ∈ R H 4 × W 4 ×b , where b represents the number of band in the MS image. Firstly, MS will be upsampled to the same size as PAN by a polynomial kernel with 23 coefficients [76], and let MS ∈ R H×W×b represent the upsampled image. Next, PAN and MS will be concatenated together as an original fusion product M ∈ R H×W×(b+1) . Then, PAN, MS, and M will be sent to three parallel convolutional layers respectively to increase the number of channels for later feature extraction.
The three feature maps, I pan , I ms , and I fuse , obtained by the above operation will be fed into the head structure of three parallel branches. The three branches developed accordingly are called the PAN branch, MS branch, and fusion branch. Moreover, the features exacted from the PAN branch and MS branch will be injected into the fusion branch through the constructed parallel feature fusion block (PFFB). The details about PFFB can refer to Section 3.3 and Figure 3. In particular, there are 4 PFFBs contained in the proposed FDFNet. The underlying detailed information and deep semantic information are fused through the distribution of PFFB in each depth of the network characteristics. We believe that such a full-depth fusion is beneficial to improving the network's feature representation ability. After 4 PFFBs, the feature map from the fusion branch will be selected out and sent to a convolutional layer with beforehand ReLU activation to reduce its channels to the same as that of MS. The output feature is denoted as S ∈ R H×W×b . Finally, we add S and MS that transferred by a long skip to yield the final super-resolution image SR ∈ R H×W×b . The whole process can be expressed as follows: where F θ (·) represents the FDFNet with its parameters θ. For more details about FDFNet, refer to Figure 2.

Parallel Feature Fusion Block
In order to realize the transfer and fusion of feature maps among the three branches, i.e., PAN branch, MS branch, and the fusion branch, we designed a parallel feature fusion block (PFFB). To facilitate the description, let I pan ∈ R H×W×C pan , I ms ∈ R H×W×C ms , and I fuse ∈ R H×W×C fuse , respectively, represent the input feature of PAN branch, the MS branch, and the fusion branch, while O pan ∈ R H×W×C pan , O ms ∈ R H×W×C ms , and O fuse ∈ R H×W×C fuse represent output feature of PAN branch, the MS branch, and the fusion branch respectively, where H and W are the size in spatial dimension, and C pan , C ms , and C f use denote the channels of the feature maps.
Firstly, I pan and I ms will be subjected to feature extraction operations and get their output features O pan and O ms . After that, O pan , O ms , and I fuse will be concatenated together as I f use ∈ R H×W×(C pan +C ms +C f use ) . The process can be expressed as follows: O ms = C s (A(I ms )) ; where C s (·) and A(·) represents the convolutional layer in which the input channel is the same as the output channel, and A(·) represents ReLU activation. Finally, I f use will be subjected to feature extraction operation and added with I fuse by a short connection as the output feature O fuse . The final process in PFFB can be expressed as follows: where C r (·) represents the convolutional layer which will reduce the channels of the exacted feature to the same as I fuse . For more details about PFFB, refer to Figure 3. We show the output of 4 PFFBs, i.e., O fuse i , i = 1, . . . , 4 in Figure 4. It can be seen that as the depth becomes deeper, more and more details are perceived. The contours of buildings and streets are also more clearly portrayed, ultimately in the form of high-frequency information. Compared with the previous three images, the last feature map O fuse 4 shows a greater difference between its maximum and minimum values. More sporadic red portions imply more chromatic aberration, sharper edge outlines, and high-frequency information, all of which are in line with our expectations.
The closer to the bottom layer, the more blurred our information and the smaller the targets that can be detected, as the O fuse 1 shows in Figure 4, which is also indispensable. To get in-depth high-frequency information while retaining shallow information simultaneously, we try to skip connections between neighboring PFFB modules. By leveraging on the skip connection, the product of the previous block can be transferred to the deep layers, which retains the shallow features and enriches the deep semantics.

Loss Function
To depict the difference between SR and the ground-truth (GT) image, we adopt the mean square error (MSE) to optimize the proposed framework in the training process. The loss function can be expressed as follows: where N represents the number of training samples, and · F is the Frobenius Norm.

Experiments
This section is for experimental evaluation. The proposed FDFNet is compared with some recent competitive approaches on various datasets obtained by WorldView-3 (WV3), QuickBird (QB), and GaoFen-2 (GF2) satellites. First, the preprocessing of the dataset and the training details will be described. Then, the quantitative metrics and visual results on reduced-resolution and full-resolution will be presented to illustrate the effectiveness of the full-depth feature fusion scheme. Finally, extensive ablation studies analyze how the proposed full-depth fusion scheme benefits the fusion process. Moreover, once our paper is accepted, the source code of training and testing will be open-sourced.

Dataset
To benchmark the effectiveness of FDFNet for pansharpening, we adopted a wide range of datasets, including 4-band datasets captured by QuickBird (QB) and GaoFen-2 (GF2) satellites and 8-band datasets captured by WorldView-3 (WV3). The former 4-band dataset contains four standard colors (red, green, blue, and near-infrared). Based on these, the 8-band dataset adds four new bands (coastal, yellow, red edge, and near-infrared). The spatial resolution ratio between PAN and MS is equal to 4. As the ground truth (GT) images are not available, Wald's protocol [77] is performed to ensure the baseline image generation. In particular, the steps for generating the training and testing data for performance assessment are as follows: (1)  It is worth mentioning that our patches all come from a single acquisition and there is no network generalization problem.

Training Details and Parameters
Due to the different number of bands, we make separate training and test datasets on WV3, QB, and GF2, as described in Section 4.1, train the network, and test separately on each dataset. All DL-based methods are fairly trained on the same dataset on NVIDIA GeForce GTX 2080Ti with 11 GB. Besides, we set 1000 epochs for the FDFNet training under the Pytorch framework, while the learning rate is set to 3 × 10 −4 for the first 500 epochs and 1 × 10 −4 for the last 500 epochs, which is set and adjusted empirically according to the loss curve during training. C pan and C ms are set as 16, C f use is set as 32, and four PFFBs are included in the network. We employed Adam [78] as the optimizer to optimize the parameters, with batch size 32 while β 1 and β 2 are set as 0.9 and 0.999, respectively. The batch size has little effect on the final result. As for betta1 and 2, both are the default settings for the optimizer Adam, and we achieved satisfactory results without adjusting them. We use the source codes provided by the authors or re-implement the code with the default parameters in the corresponding papers for the compared approaches.
Quality evaluation is carried out at reduced and full resolutions. For the reducedresolution test, the relative dimensionless global error in synthesis (ERGAS) [80], the spectral angle mapper (SAM) [81], the spatial correlation coefficient (SCC) [82], and quality index for 4-band images (Q4) or 8-band images (Q8) [83] are used to assess the quality of the results. In addition, to evaluate the performance of all involved methods on full-resolution, the QNR, D λ , and D s [84,85] indexes are applied.

Performance Comparison with Reduced-Resolution WV3 Data
We compare the performance of all the introduced benchmarks on the 1258 testing samples. For each testing example, the sizes of PAN, MS, and GT images are the same as that of the training examples, i.e., 64 × 64 for the PAN image, 16 × 16 × 8 for the original low spatial resolution MS image, and 64 × 64 × 8 for the GT image. Table 1 reports the average and standard deviation metrics of all compared methods. It is clear that the proposed FDFNet outperforms other advanced methods in terms of all the assessment metrics. Specifically, the result obtained by our network exceeds the average value of DMDNet in SAM and ERGAS by almost 0.3, which is a noticeable improvement. Since SAM and ERGAS are the measures for spectral and spatial fidelity, respectively, it is easy to know that FDFNet can strike a satisfying balance between spectral and spatial information. In addition, it can be seen that DL-based methods outperform traditional CS/MRA methods, but on the other hand, this superiority is based on large-scale training data. Therefore, we also introduce a new WorldView-3 dataset that captured Rio's scenarios, which never fed into the networks in their training phase. The Rio dataset holds 30-cm resolution, and the size of the GT, HR-PAN, and LR-MS image is 256 × 256 × 8, 256 × 256, and 64 × 64 × 8, respectively. Then, we test all the methods on the Rio dataset, and the results are shown in Table 2. Consistent with previous results, our method performs best on all indicators.
Moreover, we compare the testing time of all methods on the Rio dataset to prove its efficiency. The recording is reported in the last column of Table 2. It is obvious that FDFNet takes the shortest time compared to other DL-based methods, reflecting the high efficiency of full-depth integration and parallel working. We also display a visual comparison of our FDFNet with other state-of-the-art methods, as shown in Figure 5. To facilitate the distinction between the quality of the results, we also show the corresponding residual map in Figure 6, which takes the GT image as a reference. The FDFNet yields more details with less blurring, especially in areas with dense buildings. These results verify that FDFNet indeed exploits the rich texture from the source images. Compared with other methods, FDFNet performs feature fusion at all depths, which covers the detailed features of the shallow layer and the semantic features of the deep layer. It is worth noting that in this case, LPPN and DMDNet are not so far from the proposal.

Performance Comparison with Full-Resolution WV3 Data
In this section, we assess the proposed framework on full-resolution data to test its performance on real data, since the various methods of pansharpening are ultimately applied in the actual scene without the reference images. Similarly to the experiments on reduced-resolution, both the quantitative and visual comparison are operated.
The results of quantitative experiments can refer to Table 5. The proposed FDFNet has achieved optimal or suboptimal results on several indicators. It is worth noting that although some DL-based methods perform well in reduced-resolution, some indicators are even inferior to some traditional techniques in full-resolution, which also verifies the importance of network generalization. Furthermore, through the visual experiment of Figures 11 and 12, the pros and cons of various methods can be more intuitively reflected. Obviously, the super-resolution MS image texture obtained by our method is clearer, and there are no artifacts as DMDNet and PANNet yield. This also demonstrates that FDFNet has good generalization capabilities and can deal with pansharpening problems in actual application scenarios more effectively. Table 5. The average quantification compared to the relative standard deviation (std) of 50 fullresolution WV3 samples. The best performance is shown in bold and second place is underlined.

Performance Comparison with Full-Resolution 4-Band Data
We also compare the proposed method on 4-band full-resolution datasets, including QB and GF2 data. The quantitative results in terms of all indicators are reported in Tables 6  and 7. Furthermore, through the visual experiment of Figures 13 and 14, the advantages and disadvantages of alternative strategies can be represented more naturally. It can be seen that our proposed FDFNet can achieve better results at the full resolution of different sensors, which also shows the effectiveness of our proposed method. But at the same time, it should be noted that traditional ones, such as BDSD have better generalization in some indicators.  Table 6. The average quantification compared to the relative standard deviation (std) of 100 fullresolution QB samples. The best performance is shown in bold and second place is underlined.

Ablation Study
Ablation experiments were done to further verify the efficiency of PFFB. In this subsection, the importance of each branch, the number of PFFB modules, and the number of channels will be discussed.

Functions of Each Branch
In particular, we utilize the FDFNet as the baseline module. For different compositions, their abbreviations are as follows: All three variants are uniformly trained on the WV3 dataset introduced in Section 4.1, the training details are consistent with FDFNet. Then, we perform the test on the Rio dataset. The results of the ablation experiments are shown in Table 8. We can see that the performance of FDFNet surpassed the other three ablation variants in all indicators. Besides, both FDFNet-v2 and FDFNet-v3 performed better than the FDFNet-v1, which demonstrates that the MS branch and PAN branch can promote the fidelity of spectral and spatial features and boost the fusion outcomes for pansharpening. In addition, this also shows that it is a good choice for the MS image and PAN image to separately design branches for feature extraction and distinct representation.

Number of PFFB Modules
Given that the dominant point of this paper is the introduction of the PFFB, what we need to discuss first is to compare the effect of the depth of the network by testing the baseline framework containing different numbers of PFFB. In this case, we set the number of PFFB from 2 to 6 and 10, respectively. As the number of modules increases and the network deepens, the amount of training parameters also increases correspondingly. Figures 15  and 16 present quantitative and parameter comparisons among the different numbers of the PFFB structure on the Rio dataset. The results show that a better performance can be obtained when the number of PFFB is larger. It is worth noting that, obviously, when the number of modules is within the range of 2 to 6, more PFFBs can better achieve the full-depth feature fusion of the network, but the memory burden is increased due to the corresponding increase in the number of parameters, leading to a slight decrease in the test results. However, when the number of PFFB is 10, the fusion performance of PFFB has greater advantages, so the training effect rises again. Therefore, in order to balance performance and efficiency, we chose a PFFB number of 4 as the default setting.

Number of Channels
We also test the effect of the number of channels on the MS and PAN branch. Based on the previous discussion on the number of PFFB, we set the number of blocks to 4 and the number of channels as 8, 16, 32, and 64, respectively, and carried out experiments on the Rio dataset. We also plotted the performance in each indicator with their parameters in Figure 17. Obviously, with the increase in the number of channels, the number of parameters gradually increased. Thus more spectral information could be explored. It worked best when the number of channels was 64. However, considering the pressure of a large number of parameters on the memory load, in order to balance the network performance and memory load and maximize the advantages of both, we chose 16 as the default setting of the number of channels. ERGAS

Parameter Numbers
The number of parameters (NoPs) of all the compared DL-based methods are presented in Table 9. It can be seen that the amount of parameters of FDFNet has not increased much than the other compared DL-based methods, but the best results are achieved. This is because our network can perform more efficient fusion at full-depth, which leads to promising results that achieve a reasonable balance between spectral and spatial information, and also proves the efficiency of extracting features from parallel branches.

Conclusions
In this work, we introduced an effective full-depth feature fusion network (FDFNet) for remote sensing that contains three distinctive branches called the PAN-branch, MSbranch, and fusion-branch, respectively. The fusion of these three branches is operated at every different depth to make the information fusion more comprehensive. Furthermore, the parallel feature fusion block (PFFB) that composes FDFNet can also be treated as a basic module, which can be applied to other CNN-based structures used to solve remote sensing image fusion problems. Extensive experiments validate the superiority of our FDFNet on reduced-and full-resolution images in comparison to state-of-the-art pansharpening methods with relatively few parameters.