Next Article in Journal
Studies and Analysis of Combining BDS-3/GNSS Ultra-Rapid Orbit Products from Different IGS Analysis Centers
Next Article in Special Issue
Anisotropic Weighted Total Variation Feature Fusion Network for Remote Sensing Image Denoising
Previous Article in Journal
Sub-Decimeter Onboard Orbit Determination of LEO Satellites Using SSR Corrections: A Galileo-Based Case Study for the Sentinel-6A Satellite
Previous Article in Special Issue
Aerial Image Dehazing Using Reinforcement Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DiTBN: Detail Injection-Based Two-Branch Network for Pansharpening of Remote Sensing Images

1
School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China
2
Shaanxi Key Laboratory of Complex System Control and Intelligent Information Processing, Xi’an University of Technology, Xi’an 710048, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2022, 14(23), 6120; https://doi.org/10.3390/rs14236120
Submission received: 20 October 2022 / Revised: 28 November 2022 / Accepted: 29 November 2022 / Published: 2 December 2022
(This article belongs to the Special Issue Deep Reinforcement Learning in Remote Sensing Image Processing)

Abstract

:
Pansharpening is one of the main research topics in the field of remote sensing image processing. In pansharpening, the spectral information from a low spatial resolution multispectral (LRMS) image and the spatial information from a high spatial resolution panchromatic (PAN) image are integrated to obtain a high spatial resolution multispectral (HRMS) image. As a prerequisite for the application of LRMS and PAN images, pansharpening has received extensive attention from researchers, and many pansharpening methods based on convolutional neural networks (CNN) have been proposed. However, most CNN-based methods regard pansharpening as a super-resolution reconstruction problem, which may not make full use of the feature information in two types of source images. Inspired by the PanNet model, this paper proposes a detail injection-based two-branch network (DiTBN) for pansharpening. In order to obtain the most abundant spatial detail features, a two-branch network is designed to extract features from the high-frequency component of the PAN image and the multispectral image. Moreover, the feature information provided by source images is reused in the network to further improve information utilization. In order to avoid the training difficulty for a real dataset, a new loss function is introduced to enhance the spectral and spatial consistency between the fused HRMS image and the input images. Experiments on different datasets show that the proposed method achieves excellent performance in both qualitative and quantitative evaluations as compared with several advanced pansharpening methods.

Graphical Abstract

1. Introduction

Some well-known remote sensing satellites, such as Ikonos, GeoEye-1, and WorldView-3, usually acquire two types of images: a low spatial resolution multispectral(LRMS) image with high spectral resolution and the panchromatic (PAN) image with high spatial resolution and low spectral resolution. These images have unilateral characteristics and cannot meet the needs of most practical applications. Thus, it is necessary to fuse the LRMS image and the PAN image to obtain a fused image that combines the complementary characteristics of two types of images. This fusion process is called pansharpening, which aims to obtain a fused image with rich spatial and spectral information by fusing an LRMS image and a PAN image in the same scene. Pansharpening is used in various fields related to remote sensing images, such as target detection, environmental monitoring, and object classification [1,2,3,4,5].
In recent years, pansharpening technology has made considerable progress. According to different basic models, the existing pansharpening techniques can be mainly divided into three categories: component substitution (CS) methods, multi-resolution analysis (MRA) methods, and model-based methods. The CS methods firstly separate the spatial information from the up-sampled LRMS image through a certain transformation, then replace the separated spatial information with the PAN image, and finally go through the inverse transformation to obtain a pansharpened image. This type of method is simple and fast, but the generated image usually has obvious spectral distortion. The representative CS methods include intensity–hue–saturation (IHS) transform [6], principal component analysis (PCA) [7], Gram–Schmidt transform (GS) [8], etc. The MRA methods are based on the assumption that the missing spatial information in the LRMS image can be inferred from the high-frequency information of the PAN image. It firstly obtains high-frequency components from PAN image through multi-resolution decomposition tools such as wavelet transform [9], high-pass filtering [10], Laplacian pyramid [11], etc., and then injects the component into the upsampled LRMS image. The fused image generated by such methods generally has higher spectral fidelity, but is prone to produce blurred details and artifacts.
Model-based methods have received more attention in recent years. A representative model is the convolutional neural network (CNN) in deep learning (DL). These methods construct the relationship between the source images and the ideal HRMS image and then make predictions through the trained model. The successful practices of CNN in many fields have inspired more and more researchers to use CNN for pansharpening. For example, Masi et al. [12] pioneered the use of CNN for pansharpening. In PNN, they firstly concatenated the upsampled LRMS and PAN images, and then utilized a simple three-layer network to fuse the concatenated data. To improve the network performance, Scarpa [13] introduced residual learning and L1 loss function based on PNN. Yang et al. [14] proposed a network called PanNet that performs a fusion process in the high-pass domain of the source images to enhance the structural consistency between the fused image and the source images. In addition, they introduced skip connections to deepen the network structure and maintain the spectral consistency between the fused image and the upsampled LRMS image. Liu et al. [15] proposed a two-stream fusion network (TFNet) that uses two sub-networks to extract features from LRMS and PAN images, respectively, and then fused the feature maps in the feature domain. Liu et al. [16] proposed a detail injection-based convolutional neural network (DiCNN) method for pansharpening the MS image, in which the details are formulated in end-to-end manner. DiCNN1 is one method of DiCNN, which extracts the details from the PAN image and the MS image. Deng et al. [17] proposed a new detail injection-based method named FusionNet, which learns the detail information from the difference between PAN images and upsampled LRMS images. Yang et al. [18] proposed a progressive cascaded deep residual network (PCDRN). PCDRN firstly performs a preliminary fusion on the upsampled LRMS image and the downsampled PAN image, and then further fuses the initial fused result with the original PAN image. The other CNN models have also made significant contributions in dealing with pan-sharpening problem and its related applications, including MSDRN [19], SSE-Net [20], MCANet [21], and MMDN [22]. Another famous deep learning model is Generative Adversarial Networks (GAN), which is widely applied in pansharpening. Liu et al. [23] proposed a remote sensing image pansharpening method based on GAN, which is named PSGAN. Ma et al. [24] proposed a unsupervised framework for pansharpening based a GAN, which aims to overcome the lack of ground truth and insufficient use of the panchromatic information. Benzenati et al. [25] proposed a network called as DI-GAN, which uses a GAN to acquire the optimal high frequency details to improve the fusion performance. Compared with the traditional pansharpening algorithms, the DL-based methods can effectively preserve the spectral information and spatial information in the source images and obtain a higher quality fused image.
Drawing on the detail injection strategy of the CS and MRA algorithms, a detail injection-based two-branch network (DiTBN (Code is available at https://github.com/Another1947/DiTBN) (accessed on 28 November 2022)) for pansharpening has been proposed. First, two parallel network branches are used to extract the features of two source images, and then the obtained feature information is continuously applied to the subsequent fusion process so as to make full use of the unique information contained in two source images. This strategy makes up for the loss of information in the fusion process and promotes the flow of information. Additionally, to alleviate the problems of high-frequency detail loss and spectral distortion of the fused images when using the common MSE loss function to train the network, this paper proposes a new loss function, which not only keeps the pixel value consistency with the reference image, but also imposes gradient constraints and spectral constraints on the fused image. The main contributions of this paper are summarized as follows:
(1) The proposed method constructs a detail injection-based two-branch network in the traditional CS or MRA fusion frameworks. The proposed network aims to adaptively acquire the optimal detail information, which can further enhance the spatial consistency between the fused image and the input PAN image. Compared to the end-to-end CNN-based pansharpening methods, the proposed method takes advantages of the traditional fusion framework and the CNN-based fusion framework, which produces robust and superior fusion performance.
(2) Many 1 × 1 convolution kernel is used to maintain spatial information of the input images or feature maps so as to reduce the parameters of the model and increase the diversity of features on different convolution kernels. This strategy can make the fusion network lighter and reduce the complexity of the network model.
(3) A new loss function is proposed. It not only includes the pixel value similarity between the fused image and the reference image, but also considers the spatial and spectral similarity between the fused image and two input images. The spectral loss and the spatial loss can make the fused images having better spectral and spatial quality.
The rest of this paper is organized as follows: Section 2 introduces the related work and background. Section 3 introduces the network structure and the loss function in detail. Experiments and performance evaluations are conducted in Section 4. Section 5 concludes this paper.

2. Background and Related Work

Let M R H / r × W / r × B and P R H × W be the observed original LRMS and PAN images, respectively, where H and W respectively denote the height and width of the PAN image, r is the ratio of the spatial resolution between MS and PAN images, and B is the number of bands of the LRMS image. Let M ˜ R H × W × B and M ^ R H × W × B be the interpolated LRMS image with the same size as the PAN image and the estimated pansharpened image, respectively. The task of pansharpening is to integrate the spectral information of the original MS image and the spatial information of the PAN image to obtain a fused image with the same spatial resolution as the PAN image and the same spectral resolution as the MS image.

2.1. The CS and MRA Methods

The CS methods are based on the assumption that the spatial information of the upsampled LRMS images can be separated by some transformation. After the spatial information is separated from the upsampled LRMS image, the spatial component is replaced with the PAN image, and then the pansharpened image is obtained through inverse transformation [26]. Mathematically, forward and inverse transforms can be simplified. The general CS model is expressed as follows:
M ^ b = M ˜ b + g b ( P I ) , b = 1 , 2 , , B
where M ^ b and M ˜ b represent the bth band of M ^ and M ˜ b , respectively; g b is the gain coefficient that regulates the amount of detail information injected into each band of the upsampled LRMS image; I is the intensity component of the upsampled LRMS image, which is usually defined as a linear combination of all the upsampled LRMS image bands, i.e., I = i 1 B w i M ˜ ( w i is the weight coefficients). Most CS methods follow the above formula, so their key lies in the design of the weight coefficients w i and gain coefficients g b , both of which significantly affect the performance of fused images. The CS methods usually suffer from severe spectral distortion due to changes in low spatial frequencies of the MS image or incomplete separation of spatial information during transformation.
The MRA methods assume that the missing spatial information in the upsampled LRMS image is available from the PAN image. These methods utilize multi-resolution analysis tools to separate high-frequency spatial information from the PAN image and inject it into the upsampled LRMS image. The process can be summarized in the following form:
M ^ b = M ˜ b + g b ( P P L ) , b = 1 , 2 , , B
where P L is the low-pass filtered component of the PAN image. The key to the MRA methods is the acquisition of P L and the design of g b . Due to the change of spatial information in the process of obtaining P L , the MRA methods are prone to produce spatial distortion.
The acquisition of the intensity component in the CS methods, the acquisition of the low-frequency component of the PAN image in the MRA methods, and the gain coefficients all need to be manually designed, so these two methods have limited fusion performance.

2.2. The CNN Methods for Pansharpening

In recent years, deep learning technology is increasingly popular, and many researchers have applied deep learning technology to various fields and achieved considerable achievements. In image-related fields such as image classification [27], image super-resolution [28], image deblurring [29], object detection [30], etc., CNNs have been widely used. In recent years, CNNs have also been used for pansharpening. The idea of CNN-based pansharpening methods is to use the CNN model to construct the mapping relationship between the source images and the ideal HRMS image. The trained model can be acquired by minimizing the loss (such as L2 norm) between the pansharpened image and the corresponding ideal HRMS image. Finally, the trained model is used to predict the pansharpened image. If the L2 norm is used as the loss function, the above process can be described as
θ ^ = arg min θ M ^ F 2 2 = arg min θ φ ( M ˜ , P , θ ) F 2 2 M ^ = φ ( M ˜ , P , θ ^ )
where F is the ideal HRMS image; φ ( · ) , and θ represents the CNN model and its parameters.
Pansharpening can be regarded as a super-resolution reconstruction of the LRMS images. The difference is that pan-sharpening requires the PAN images as an auxiliary. Inspired by this, Masi et al. [12] pioneered the use of CNN for pansharpening (PNN). In their work, the upsampled LRMS and PAN images are concatenated and fed into a shallow network with a similar structure to Super-Resolution CNN (SRCNN) [28]. The network learns under the constraints of labels and then outputs the HRMS images through the trained model. In addition, the authors introduced the nonlinear radiometric indices to guide the learning process and improve network performance. Zhong et al. [31] proposed a strategy combining the SRCNN method and the GS method for pansharpening. They firstly used SRCNN to improve the spatial resolution of the LRMS image, and then used the GS method [8] to fuse the improved LRMS image and PAN image.
The nonlinear degree and fitting ability of the shallow network are limited. To address this problem, He et al. [32] proposed the idea of residual learning. They adopted a simple and effective structure called “skip connection” that identically maps the input of the residual unit to its output so that the residual unit only needs to learn the residual between the output and the input. Since the residual is generally sparse and the mapping within residual unit is more sensitive to changes in residuals than changes in output, the learning process becomes easier. In addition, the residual network can effectively alleviate the gradient disappearance problem caused by deepening the network, making it possible to build a deeper network. The proposal of residual learning promotes the development of CNN in related application fields. In the field of remote sensing, the CNNs that incorporate residual learning have achieved higher performance. For example, Wei et al. [33] proposed a deep residual network (DRPNN) in which residual learning was used, and the network structure was deepened. Yang et al. [14] proposed PanNet for fusion in the high-frequency domain. In PanNet, the high-pass filtered version of the source images are taken as inputs to preserve more details and edges. Then, a deep network composed of multiple residual units is employed to perform feature extraction and feature fusion on the input images. Finally, the upsampled LRMS image is added to the network output through skip connection to obtain the final fused image that can effectively preserve spectral information. Shao et al. [34] proposed a two-branch network that employed a deeper network to fully extract spatial features from PAN images, and a shallower network to extract features from upsampled LRMS images. Then, the feature maps of two images are concatenated and fused through a convolutional layer to obtain a detail image. Finally, the upsampled LRMS image is added to the residual image through skip connection to obtain the fused image. Yang et al. [18] proposed a progressive cascaded deep residual network (PCDRN). In PCDRN, the upsampled LRMS image by a ratio of 2 and the downsampled PAN image by a ratio of 2 are fed into a residual network to obtain a preliminary fusion result. Then, the fusion result is upsampled by a ratio of 2 and fused with the PAN image to obtain the final fused HRMS image. Compared with traditional CS and MRA methods, these CNN-based pansharpening methods achieve a better balance between spectral quality and spatial quality and obtain higher performance.

3. Detail Injection-Based Two-Branch Network

3.1. Motivation

According to the introduction in Section 2.1, the CS and MRA methods have one thing in common, i.e., the details are firstly obtained from the PAN image, and then they are injected into the upsampled LRMS image to obtain the HRMS image, which can be mathematically described as
M ^ b = M ˜ b + g b d = M ˜ b + D b
where d represents the details obtained from the PAN image, and D b represents the details that need to be injected into the upsampled LRMS image. Equations (1) and (2) indicate that the details from the PAN image are obtained by linear models (linear combination or filtering). However, this assumption has not been confirmed. Meanwhile, there is overlap between the spectral responses of each band in the MS image. It may not be reasonable for injecting details according to Equation (4). On the one hand, the CNN can effectively capture the features of the input images and build a model with a high degree of nonlinearity. On the other hand, this approach eliminates the need for the design of injection gains, and it automatically adjusts the injected details to adapt to the expected output by learning.
Most of the existing CNN-based pansharpening methods regard pansharpening as an image super-resolution problem, i.e., the upsampled LRMS and PAN images are taken as inputs to directly learn the mapping between the input and the desired HRMS image. However, these methods ignore the difference between two source images. As an image fusion problem, pansharpening should make full use of the spatial information in the PAN image and inject the required parts into the upsampled LRMS image. To better mine and utilize the information in the source images, this paper considers extracting features from MS and PAN images separately, and then fuses the features of both to reconstruct a pansharpened image. In addition, to preserve edges and details and enhance the structural consistency between the PAN image and the HRMS images, this paper uses the high-frequency content of PAN images as one of the inputs [14].

3.2. The Network Architecture

The framework of the proposed method for pansharpening is shown in Figure 1. The implementation of the proposed framework includes three stages: feature extraction, feature fusion, and HRMS image reconstruction. First, the LRMS image is upsampled to the same size as the PAN image, and the high-frequency content of the PAN image, denoted as P H , is obtained by subtracting its low-pass version from the PAN image. The upsampled LRMS image and the high-frequency content of the PAN image are used as inputs.
In the feature extraction stage, two sub-networks composed of three consecutive convolutional layers are employed to extract the features from the high-frequency component of the PAN image and the upsampled LRMS image, respectively. Specifically, the first layer adopts a 1 × 1 kernel to preserve the spatial information of the inputs and reduce the number of model parameters. Let F P and F M be the feature maps of two images as F P and F M , respectively. The output feature maps of the l-th layer can be expressed as
F P ( l ) = f P ( l ) ( P H ) F M ( l ) = f M ( l ) ( M ˜ ) , l = 1 , 2 , 3
where f P ( l ) and f M ( l ) represent the networks composed of the first layers of the feature extraction sub-networks of P H and M ˜ , respectively.
Next, the extracted feature maps F P ( 3 ) and F M ( 3 ) are concatenated along the channel direction and fed into the feature fusion sub-network. The feature fusion network integrates the feature information from the high-frequency component of the PAN image and the upsampled LRMS image to obtain the fused feature maps. It performs further feature extraction on the fused feature maps. Considering that the spatial information of the reconstructed HRMS image mainly comes from the PAN image and the spectral information mainly comes from the upsampled LRMS image, it is difficult to recover the texture details since the high-level features encode the semantic and abstract information of the image. Thus, this paper considers to continuously utilize the feature maps from the high-frequency component of the PAN image and upsampled LRMS image in the process of feature fusion through skip connections. This iterative strategy can compensate for the loss of spectral and spatial information in the feature fusion process and continuously inject spectral and spatial information into the fused feature maps, which is conducive to the reconstruction of HRMS images. The feature fusion sub-network consists of three consecutive residual units, each of which consists of a 1 × 1 convolutional layer, two 3 × 3 convolutional layers, and a skip connection. Its structure is shown in the dashed box as shown in Figure 1. The first residual unit receives the output feature maps of two feature extraction sub-networks as input. In addition to receiving the output of the previous residual unit as input, the second and third residual units also receive the feature maps from the output of the first convolutional layer of two feature extraction sub-networks. The output F i of the i-th residual unit can be expressed as
F i = R i ( [ F P ( 3 ) , F M ( 3 ) ] ) i = 1 R i ( [ F i 1 , F P ( 1 ) , F M ( 1 ) ] ) i = 2 , 3
where [ · ] represents the concatenate operation along the channel dimension, and R i represents the mapping function of the residual unit.
The feature fusion sub-network outputs a high-level fused feature map F 3 , which is fed into the image reconstruction sub-network to obtain a detail image. First, a convolutional layer with 1 × 1 kernel is used to reduce the channel dimension of F 3 . Then, a convolutional layer with 3 × 3 kernel is used to map the dimensionality-reduced feature map to the detail image that needs to be compensated to the upsampled LRMS image. The detail image can be represented as
D = H ( F 3 )
where H represents the image reconstruction sub-network composed of the above two convolutional layers. Finally, the upsampled LRMS image M ˜ is passed to the output of the image reconstruction sub-network through skip connection. Thus, the obtained HRMS image is
M ^ = M ˜ + D

3.3. The Loss Function

In fact, the ideal HRMS image does not exist. For this reason, Wald et al. [35] provided a feasible training scheme, i.e., the reduced-resolution PAN image and LRMS image are used as network inputs, and the original LRMS image is used as the reference. Denote the PAN and LRMS images at reduced resolution as p and m , respectively, and denote the corresponding network output as m ˜ . A dataset with N pairs of training samples can be represented as P ( i ) , m ( i ) , M ( i ) i = 1 N .
Many CNN-based pansharpening methods aim to minimize the difference between the network output m ^ and the corresponding reference image M , i.e.,
min L ( m ^ M )
where L represents the loss function. With a reference image, pansharpening can be regarded as a regression problem. The commonly used loss function in regression problems is mean square error (MSE). The loss function can be expressed as
L M = 1 N i = 1 N m ^ ( i ) M ( i ) 2 2
Although the MSE loss function has achieved good performance in most CNN-based pansharpening methods, some problems still exist. There is usually a loss of high-frequency details in the fused image, such as the artifact phenomenon [36], especially for some small objects in the image. Meanwhile, there is a spectral performance gap between the fused image and the reference image, which is more obvious for small objects. Furthermore, the MSE loss function only considers the pixel value loss between the output pansharpened image by the network and the reference image. It ignores the spatial or spectral loss and does not consider the consistency between the pansharpened image and the source images. Therefore, the loss function can be further improved. Focusing on these problems, this paper proposes a new loss function. In addition to the MSE loss between the sharpened HRMS image and the reference image, the proposed loss function also includes two other loss terms: the spatial constraint of the PAN image for the pansharpened image, and the spectral constraint of the LRMS image for the pansharpened image.
The variational methods applied to pansharpening usually model the energy function of spatial correlation between PAN and HRMS images and the energy function of the spectral correlation between LRMS and HRMS images. They optimize these functions to obtain the sharpened HRMS image. Based on the idea of the variational methods, this paper adds spatial and spectral constraints on the sharpened HRMS images during network training. Some variational methods [37] believe that the gradient of each spectral band of the sharpened HRMS image should be consistent with that of the PAN image, i.e.,
M ^ b = P , b = 1 , 2 , , B
where ▽ is the gradient operator, and it is defined as x = x x 2 + x y 2 for a single-band image, where x x and x y represent the horizontal and vertical gradients, respectively. Accordingly, this paper imposes spatial information constraints on the sharpened HRMS image during the network training phase, which are defined as
L S = 1 N i = 1 N m ^ ( i ) r e p m a t ( p ( i ) ) 2 2
where r e p m a t is the operation of extending the gradient image of p along the band direction to the same number of bands as m ^ . Equation (12) forces the gradient of the sharpened HRMS image to be consistent with that of the PAN image, thereby enhancing the spatial consistency between the HRMS and the PAN image.
Typically, Wald’s protocol refers to the synthesis property [35,38]. Meanwhile, Wald also proposed a consistency property, i.e., a spatially degraded pansharpened image should be as consistent as possible with the observed LRMS image. For better results, the implementation of spatial degradation should use a filter that matches the modulation transfer function (MTF) of the sensor from which the LRMS image is acquired. To improve the consistency between the sharpened HRMS image and the LRMS image, the input LRMS image is utilized to impose a spectral constraint on the output HRMS image in the training process according to the consistency property, which is defined as
L λ = 1 N i = 1 N f d e g r a d e ( m ^ ( i ) m ( i ) ) 2 2
where f d e g r a d e represents the use of a spatial degradation operation with MTF filtering [38] on the image. By the constraint of Equation (13), the output image after spatial degradation will be forced to be consistent with the input LRMS image, and the spectral consistency of two images will be enhanced.
In summary, in addition to the MSE loss between the sharpened HRMS image and the reference image, the above spatial constraint and spectral constraint are imposed as auxiliary. In addition, to avoid overfitting, L2 regularization is applied for weight decay. Denote the weight decay coefficient as γ . Combining Equations (10)–(12), the total loss function is
L = L M + α · L S + β · L λ + γ θ 2 2
where α and β are the weight coefficients of the spatial loss term L S and the spectral loss term L λ , and θ is the model parameters. Different types of loss terms are normalized by the number of elements in a single training sample (i.e., height × width × channels).

4. Experiments

4.1. Experimental Datasets

To evaluate the performance of DiTBN, this paper selects datasets from three satellites, i.e., Ikonos, GeoEye-1, and WorldView-3, to test the model. The Ikonos satellite is the first civil satellite with a resolution better than 1 m in the world. It provides the PAN images with a spatial resolution of 1 m and the LRMS images with a spatial resolution of 4 m. The LRMS images include spectral information in four bands of red, green, blue, and near-infrared, and the PAN images only have a single band. The GeoEye-1 satellite carries two sensors, in which the panchromatic sensor acquires the PAN images with a spatial resolution of 0.5 m, and the multispectral sensor acquires the LRMS images with a spatial resolution of 2 m. The LRMS images also have spectral information in four bands of red, green, blue, and near-infrared. The WorldView-3 satellite provides the PAN images with a spatial resolution of 0.31 m and the LRMS images with a spatial resolution of 1.24 m. In addition to the four standard spectral bands, the LRMS images also include four bands of coastal, yellow, red edge, and near-infrared. The radiometric resolution of three types of satellite images is 11 bits. The main characteristics of three satellites are listed in Table 1.
For each type of satellite source data, this paper first segments it into 200 × 200 LRMS image patches and corresponding 800 × 800 PAN image patches. The image patches do not overlap each other. Then, these image patches are randomly divided into training data, validation data, and test data at a ratio of 8:1:1. During the training phase, this paper follows the Wald’s protocol to generate datasets, i.e., using the spatially degraded source images as inputs and using the original LRMS image as the reference image. The subsampling factor for different datasets is four. For training data and validation data, spatial degradation with MTF filtering is performed on each group of source image patches, and then the degraded images (inputs) and the source LRMS images (reference image) are cropped to small-size image patches in a certain stride to obtain a large number of training and validation datasets. The division of each type of dataset is shown in Table 2. This paper interpolates the LRMS image using a polynomial interpolation method [39] with 23 coefficients.

4.2. Implement Details

In this article, the Tensorflow framework is used to build, train, and test the network model in the environment of Python 3.6. The training is performed on a computer equipped with an NVIDIA Quadro P4000. The loss function is optimized by the Adam [40] method. The maximum number of iterations is set to 100,100. The initial learning rate is set to 0.001 and decays by 50 % every 20,000 iterations. The training batch size is set to 16, and the weight decay coefficient is set to 10 5 . The weight coefficients of the spatial loss term L s and the spectral loss term L λ are set to 0.1 and 1, respectively.

4.3. The Evaluation Indicators and Comparison Algorithms

In addition to the visual evaluation of the fused images, some quantitative evaluation indicators are also used. Performance evaluation is performed on two scales: reduced-resolution scale (also known as simulated experiments) and full-resolution scale (also known as real experiments). For the former, this paper selects six widely used indicators, including spectral angle mapper (SAM) [41], erreur relative globale adimensionnelle de synthèse (ERGAS) [42], relative average spectral error (RASE) [43], spatial correlation coefficient (SCC) [44], universal image quality index (Q) [45], and structural similarity (SSIM) index [46]. For the latter, four indicators are used, including spectral distortion index D λ , spatial distortion index D S , quality no reference index (QNR) [47], and SAM.
In this paper, eight popular algorithms are taken for comparison. The comparison algorithms of the CS methods include the Gram–Schmidt mode 2 algorithm with a generalized Laplacian pyramid (GS2-GLP) [11], the robust band dependent spatial detail (BDSD-PC) [48]. The comparison algorithms of the MRA methods include the generalized Laplacian pyramid with an MTF-matched filter (MTF-GLP) [11], the additive wavelet luminance proportional with haze correction (AWLP-H) [49]. The comparison algorithms of the CNN-based methods include a deep network for pansharpening (PanNet) [14], first detail injection CNN (DiCNN1) [16], FusionNet [17], and TFNet [15]. The TFNet method is tested on the Ikonos and GeoEye-1 datasets. As for the implementation of the first four traditional algorithms, the toolbox provided by Vivone et al. [50] is used. The implementation codes of the PanNet, FusionNet and TFNet methods can be found on open-source websites (Code link: https://xueyangfu.github.io/; https://github.com/liangjiandeng/FusionNet; https://github.com/liouxy/tfnet_pytorch (accessed on 1 May 2022)). As to the DiCNN1 method, the reproduced version provided by Deng et al. [17] is used. To be fair, all CNN-based methods operate under the same hardware and software environmental conditions. Meanwhile, since most CNN-based methods do not provide trained models, this paper retrains the CNN-based comparison algorithms using the same dataset as applied to the proposed network.

4.4. The Ablation Study of Different Network Structures

The High-Pass Filtering (HPF) is used on the input PAN image, and the feature map reuse is implemented in our network structure. To explore the impact of different network settings on model performance, this paper performs an ablation study on the simulated Ikonos dataset. As shown in Figure 1, HPF2 means that the HPF is applied on two input images at the same time. HPF1 means that the HPF is only applied on the input PAN image. Reuse2 denotes that the feature maps from two branches are reused. Reuse1 denotes that the feature maps of the PAN branch are reused. “No reuse” means that the feature maps are not reused. The model performance of the above settings is compared under four different combinations, where “HPF1, reuse2” indicates the settings used in the proposed method. Table 3 lists the average values of all the indicators, the best value is marked in bold.
It can be seen from Table 3 that the combined setting adopted by the proposed method achieves the best results in spectral performance and competitive results in spatial performance. In the case of reusing the feature maps of both branches, only using the HPF on the input PAN image shows better performance than using the HPF on both input images. This is perhaps because using the HPF on the input MS image will filter out the low-frequency information. However, using only a small amount of high-frequency information that is not essential to the reconstructed detail image may lead to redundant information, which in turn causes the distortion of the fused image. Reusing feature maps from both branches obtains the best spectral performance with only HPF on the input PAN image. This indicates that only reusing the feature maps of the input PAN image may inject too much spatial information, which results in spectral distortion of the fused image. Furthermore, simultaneously reusing the feature maps of the input MS image can effectively transfer the low-frequency information contained in it to the feature fusion process, which can improve the spectral quality of the fused image. Therefore, in the case of using the HPF on the input PAN image, reusing the feature maps of the MS image branch can improve the performance when only the feature maps of the PAN image branch are reused, which illustrates the effectiveness of reusing the feature maps of two branches simultaneously.

4.5. The Ablation Studies of the Proposed Loss Function and Kernel Size

This paper introduces a loss function that constrains the spectral and spatial information of the fused image according to the information of the input images. In addition to the commonly used MSE loss, it also includes the spatial constraint of the PAN image to the sharpened HRMS image and the spectral constraint of the LRMS image to the sharpened HRMS image. To verify the effectiveness of the proposed loss function, ablation studies are conducted on the simulated Ikonos dataset. Under the same model structure, the MSE loss function, the loss function without L s , the loss function without L λ , and the proposed loss function are used, respectively, to conduct experiments. The test results are listed in Table 4, the best value is marked in bold.
It can be seen from Table 4 that training the network with the proposed loss function improves the model performance in both spatial and spectral quality as compared to training the network with the MSE loss function. This is because the MSE loss between the fused image and the reference image only considers the similarity of pixel values. Although it can constrain the spatial and spectral information of the fused image to a certain extent, it does not explicitly point out the spatial and spectral consistency between the fused image and the input images. In addition, the loss function without L s and the loss function without L λ are compared in this experiment. The quantitative assessment results demonstrate that the introduction of spatial and spectral constraints in the proposed loss function makes the model obtain more spatial and spectral clues from the input PAN and LRMS images in the optimization process and strengthens the spatial and spectral consistency between the fused image and the input images.
In this subsection, we also give the ablation study of different kernel sizes, i.e., 1 × 1 and 3 × 3. Table 5 lists the quantitative assessment results on the Ikonos dataset, the best value is marked in bold. It can be observed that the proposed method with 1 × 1 kernel size has better fusion performance. In addition, the kernel size 1 × 1 can make the fusion network have less parameters.

4.6. Performance Evaluation at Reduced-Resolution Scale

In this subsection, the proposed method and the comparison algorithms are evaluated at the reduced-resolution scale. The experiments at the reduced scale generate datasets based on Wald’s protocol. Figure 2 shows the fusion results of different algorithms on a group of Ikonos reduced-resolution test images. To better illustrate the difference between the fused images and the reference image, the residual images are shown in Figure 3. It can be observed from Figure 2 and Figure 3 that the residual image of the GS2-GLP method exhibits serious spectral distortion, especially for the red buildings. In addition, it shows some edge and texture information, indicating that the fused image has spatial distortion. The residual image of the BDSD-PC algorithm also shows serious spectral distortion and spatial distortion. The fused images of the AWLP-H and MTF-GLP algorithms show a certain blurring that is reflected in the corresponding residual images. In addition, they are accompanied by some spectral distortion. Compared with the residual images of traditional algorithms, those obtained by CNN-based methods exhibit less residual information, indicating that CNN-based methods produce fused images with higher spectral and spatial quality. Specifically, the PanNet, DiCNN1, and TFNet methods produce slight spectral distortion and spatial distortion, while the FusionNet method and our proposed method produce the least distortion.
To give an objective evaluation of all algorithms, Table 6 lists the indicator values of the fusion results in Figure 2 and the average indicator values of 24 groups of test images on the Ikonos dataset, where the best value is marked in bold and the second best value is underlined. As shown in Table 6, for the image shown in Figure 2, our analysis of the fused images is basically consistent with the indicator values. For all test images, the CNN-based methods have better performance than the traditional algorithms. Among the traditional algorithms, AWLP-H achieves the leading performance. The proposed algorithm obtains the best results, indicating that it can effectively reduce the spectral and spatial distortions in the fusion process.
Figure 4 and Figure 5 respectively show the fusion results and the corresponding residual images of a group of reduced-resolution samples on the GeoEye-1 dataset. It can be seen from the figures that the fused image of the BDSD-PC method produces serious spectral distortion, especially in the water area and vegetation area, and the vegetation area is over-saturated. The GS2-GLP also produces some spectral distortions, especially in the vegetation area below the image. The fused images of the AWLP-H and MTF-GLP methods have better spectral quality. However, there is some blurring, especially for the MTF-GLP method. Compared with the fused images of the traditional algorithms, those of the CNN-based methods are closer to the ground truth image in spectrum and spatial. However, some spectral and spatial distortions can be still observed from the residual images.The PanNet, DiCNN1 and TFNet methods show relatively more residual information, while FusionNet and the proposed method reflect less residual information.
To give an objective evaluation of the fusion results in Figure 4 and the performance of each algorithm on the GeoEye-1 dataset, Table 7 lists the indicator values of the fusion results in Figure 4 and the average indicator values of the 25 groups of test images on the GeoEye-1 dataset, where the best value is marked in bold and the second best value is underlined. All subsequent table expressions follow this instruction. According to Table 7, among the traditional algorithms, the AWLP-H algorithm obtains competitive results on the image shown in Figure 4, and it obtains leading results in terms of average indicator values. The CNN-based methods generally achieve better results than the traditional algorithms. Specifically, DiCNN1 and FusionNet compete fiercely, while our method achieves the best results.
Figure 6 and Figure 7 show an example of the fusion results on the reduced-resolution WorldView-3 dataset and the corresponding residual images, respectively. It can be seen that the BDSD-PC algorithm produces severe spectral distortion, especially on land areas and red buildings. The GS2-GLP algorithm also produces some spectral distortion, mostly for red and blue buildings. The fused image of the AWLP-H algorithm has more natural color but lacks more texture information. The MTF-GLP algorithm suffers from spectral distortion on red and blue buildings, as well as some detail blurring. The CNN-based methods all suffer from slight spectral distortion on red buildings. The proposed method produces the least spectral and spatial loss.
Table 8 lists the indicator values of the fusion results in Figure 6 and the average indicator values of 25 groups of test images on the WorldView-3 dataset. As to the images shown in Figure 6, it can be observed that the performance of four traditional algorithms is very close. The proposed method and the FusionNet method achieve the best and sub-best results, respectively. For all the test images, the performance of four traditional algorithms is still relatively close. Four CNN-based methods generally perform better than the traditional algorithms. In particular, the proposed method obtains the best results in terms of SAM, ERGAS, RASE, SCC, Q, and SSIM.

4.7. Performance Evaluation at Full-Resolution Scale

For practical applications, it is necessary to conduct experiments on remote sensing images at full-resolution scale. In this subsection, the original datasets are used as the test data to be fused. The performance of each algorithm is evaluated through visual evaluation and objective assessment metrics. Since there is no reference image, the values of SAM indicator in the quantitative evaluation are calculated between the spatially degraded fused images and the original LRMS images.
Figure 8 shows the fusion results of each algorithm on a group of samples on the full-resolution Ikonos dataset. The area enclosed by a green box is enlarged and placed at the right bottom of each image. Compared with the upsampled LRMS image, the fused images produced by all methods show a high spatial quality improvement, which can be easily observed from the local enlarged areas at the bottom of the images. The GS2-GLP algorithm produces some spectral distortion, mainly in vegetated areas. The fused image of the BDSD-PC algorithm exhibits over-saturation, especially for red tones. The MTF-GLP algorithm also produces some spectral distortion. Compared with the previous algorithms, the fused image of AWLP-H has more natural colors. The FusionNet method exhibits some artifacts at the edges of buildings in the local enlarged area. Several other CNN-based methods obtain relatively similar spectral and spatial features, which are close to those of the upsampled LRMS image and PAN image.
Table 9 lists the indicator values of the fusion results in Figure 8 and the average indicator values of 24 groups of full-resolution test images on the Ikonos dataset. It can be seen that the CNN-based methods generally achieve higher scores than the traditional CS and MRA methods. For the images shown in Figure 8, the proposed method provides the best SAM, QNR, and D λ values. It loses to the TFNet method in terms of the D S index. For all the test images, it can be observed that the proposed method obtains the best values in terms of SAM, QNR, and D λ . It loses to the PanNet algorithm in terms of the D S index.
Figure 9 shows the fusion results of all the pansharpened algorithms on the full-resolution GeoEye-1 dataset. As shown in Figure 9, the GS2-GLP, BDSD-PC, MTF-GLP, and FusionNet methods produce the fused images with certain spectral distortion as compared with the upsampled LRMS image, especially in the vegetation area. Among them, the fused images of GS2-GLP and MTF-GLP have low saturation, while those of BDSD-PC and FusionNet methods are over-saturated. Compared with the aforementioned algorithms, the AWLP-H algorithm achieves relatively higher spectral quality. The PanNet, DiCNN1, TFNet, and the proposed method have the spectral quality that is closer to the upsampled LRMS image.
Table 10 lists the indicator values of the fusion results shown in Figure 9 and the average indicator values of 25 groups of full-resolution test images on the GeoEye-1 dataset. For the fused images shown in Figure 9, it can be seen that the BDSD-PC algorithm obtains the best value in terms of QNR and D S , and the proposed method obtains the best value in terms of the SAM index and the sub-optimal values in terms of the QNR, D λ and D S indexes. For all the test images, the BDSD-PC algorithm achieves leading results among the traditional algorithms. Among the CNN-based methods, the FusionNet method achieves excellent performance. The proposed method obtains the best values except for the D λ index.
Figure 10 shows the fusion results of a group of samples on the full-resolution WorldView-3 dataset. The fused images for different algorithms are not very different in terms of the buildings. However, it can be seen from the color of the building materials in the left upper part of the image that the GS2-GLP, BDSD-PC, and AWLP-H algorithms produce some spectral distortions. In addition, in the green vegetation area, the GS2-GLP, BDSD-PC, and MTF-GLP algorithms also produce different degrees of spectral distortion. The AWLP-H algorithm presents better colors. Compared with the traditional algorithms, the CNN-based algorithms achieve better spectral quality in the vegetation area.
Table 11 presents the indicator values of the fusion results in Figure 10 and the average indicator values of 25 groups of full-resolution test images on the WorldView-3 dataset. For the images shown in Figure 10, the indicator values of the CNN-based algorithms are mostly better than those of the traditional algorithms, and the competition between the FusionNet method and the proposed method is fierce. For all test images, the FusionNet method obtains the best value in terms of QNR, D λ and D S , while the proposed method provides the best SAM values. In general, the proposed method has superior performance in preserving the spectral and spatial information.

4.8. Parameters and Complexity

In this subsection, we compare the proposed method with other CNN-based methods from the perspective of parameter numbers and floating-point operations (FLOPs) [51]. We calculate parameter numbers and FLOPs for each fusion network model, respectively. The results are listed in Table 12. It can be seen that DiCNN1 has the minimum number of FLOPs because its network architecture is shallow. The network architectures of PanNet, FusionNet, and TFNet are more complex than that of DiCNN1. The number of parameters and FLOPs increases significantly as the number of layers increases. The proposed method has relatively more number of parameters and FLOPs than that of other methods. Although the proposed method has relatively higher complexity, it outperforms the other methods from the perspective of fusion performance.

5. Conclusions

In this paper, a new CNN-based pansharpening method is proposed based on the detailed injection strategy of traditional CS and MRA methods. Specifically, the model is directly learning the missing detail information in the upsampled MS image, in which two independent network branches are used to extract features from the high-frequency component of the PAN image and upsampled LRMS image. The obtained feature information is repeatedly applied to the fusion process through skip connections. Meanwhile, a new loss function is proposed, which includes pixel value similarity, spatial information similarity, and spectral information similarity. The loss function can better maintain the spectral consistency and spatial consistency between the fused image and the input images. The performance of the proposed method is evaluated on the Ikonos, GeoEye-1, and WorldView-3 datasets. Compared with several traditional algorithms and current advanced CNN-based methods, the proposed method achieves leading performance on simulated data experiments and competitive performance on real data experiments. The experimental results indicate the effectiveness and superiority of the proposed method in improving the spectral quality and spatial quality of fused images. In future work, we will explore unsupervised training ways and possible compression and optimization space of the proposed network structure.

Author Contributions

Conceptualization, W.W., Z.Z. and H.L.; methodology, W.W., Z.Z. and X.Z.; software, Z.Z., T.L. and X.Z.; validation, W.W., X.Z. and T.L.; investigation, Z.Z.; resources, W.W.; writing—original draft preparation, Z.Z. and X.Z.; writing—review and editing, W.W., Z.Z., T.L. and L.L.; visualization, T.L.; supervision, H.L. and L.L.; funding acquisition, W.W., H.L. and L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the National Natural Science Foundation of China 61703334, 61973248, 61873201, and U2034209, the Project funded by the China Postdoctoral Science Foundation under Grant No. 2016M602942XB, and Key Projection of Shaanxi Key Research and Development Program under Grant No. 2018ZDXM-GY-089. (Corresponding author: Han Liu).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing is not applicable to this article.

Acknowledgments

We sincerely thank the reviewers who participated in the review of the paper for their valuable comments and constructive suggestions on this paper. In addition, we also express our appreciation for the editorial services provided by MJEditor.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wu, X.; Feng, J.; Shang, R.; Zhang, X.; Jiao, L. CMNet: Classification-oriented multi-task network for hyperspectral pansharpening. Knowl.-Based Syst. 2022, 256, 109878. [Google Scholar] [CrossRef]
  2. Wu, X.; Feng, J.; Shang, R.; Zhang, X.; Jiao, L. Multiobjective Guided Divide-and-Conquer Network for Hyperspectral Pansharpening. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5525317. [Google Scholar] [CrossRef]
  3. Ding, Y.; Zhang, Z.; Zhao, X.; Cai, W.; Yang, N.; Hu, H.; Huang, X.; Cao, Y.; Cai, W. Unsupervised self-correlated learning smoothy enhanced locality preserving graph convolution embedding clustering for hyperspectral images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5536716. [Google Scholar] [CrossRef]
  4. Ding, Y.; Zhang, Z.; Zhao, X.; Cai, Y.; Li, S.; Deng, B.; Cai, W. Self-supervised locality preserving low-pass graph convolutional embedding for large-scale hyperspectral image clustering. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5536016. [Google Scholar] [CrossRef]
  5. Ding, Y.; Zhang, Z.; Zhao, X.; Hong, D.; Li, W.; Cai, W.; Zhan, Y. AF2GNN: Graph convolution with adaptive filters and aggregator fusion for hyperspectral image classification. Inf. Sci. 2022, 602, 201–219. [Google Scholar] [CrossRef]
  6. Tu, T.M.; Su, S.C.; Shyu, H.C.; Huang, P.S. A new look at IHS-like image fusion methods. Inf. Fusion 2001, 2, 177–186. [Google Scholar] [CrossRef]
  7. Kwarteng, P.; Chavez, A. Extracting spectral contrast in Landsat Thematic Mapper image data using selective principal component analysis. Photogramm. Eng. Remote Sens. 1989, 55, 339–348. [Google Scholar]
  8. Laben, C.A.; Brower, B.V. Process for Enhancing the Spatial Resolution of Multispectral Imagery Using Pan-Sharpening. U.S. Patent 6,011,875, 4 January 2000. [Google Scholar]
  9. Mallat, S.G. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 1989, 11, 674–693. [Google Scholar] [CrossRef] [Green Version]
  10. Chavez, P.; Sides, S.C.; Anderson, J.A. Comparison of three different methods to merge multiresolution and multispectral data- Landsat TM and SPOT panchromatic. Photogramm. Eng. Remote Sens. 1991, 57, 295–303. [Google Scholar]
  11. Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A.; Selva, M. MTF-tailored multiscale fusion of high-resolution MS and Pan imagery. Photogramm. Eng. Remote Sens. 2006, 72, 591–596. [Google Scholar] [CrossRef]
  12. Masi, G.; Cozzolino, D.; Verdoliva, L.; Scarpa, G. Pansharpening by convolutional neural networks. Remote Sens. 2016, 8, 594. [Google Scholar] [CrossRef] [Green Version]
  13. Scarpa, G.; Vitale, S.; Cozzolino, D. Target-adaptive CNN-based pansharpening. IEEE Trans. Geosci. Remote Sens. 2018, 56, 5443–5457. [Google Scholar] [CrossRef]
  14. Yang, J.; Fu, X.; Hu, Y.; Huang, Y.; Ding, X.; Paisley, J. PanNet: A deep network architecture for pan-sharpening. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5449–5457. [Google Scholar]
  15. Liu, X.; Liu, Q.; Wang, Y. Remote sensing image fusion based on two-stream fusion network. Inf. Fusion 2020, 55, 1–15. [Google Scholar] [CrossRef] [Green Version]
  16. He, L.; Rao, Y.; Li, J.; Chanussot, J.; Plaza, A.; Zhu, J.; Li, B. Pansharpening via detail injection based convolutional neural networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 1188–1204. [Google Scholar] [CrossRef] [Green Version]
  17. Deng, L.J.; Vivone, G.; Jin, C.; Chanussot, J. Detail injection-based deep convolutional neural networks for pansharpening. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6995–7010. [Google Scholar] [CrossRef]
  18. Yang, Y.; Tu, W.; Huang, S.; Lu, H. PCDRN: Progressive cascade deep residual network for pansharpening. Remote Sens. 2020, 12, 676. [Google Scholar] [CrossRef] [Green Version]
  19. Wang, W.; Zhou, Z.; Liu, H.; Xie, G. MSDRN: Pansharpening of multispectral images via multi-scale deep residual network. Remote Sens. 2021, 13, 1200. [Google Scholar] [CrossRef]
  20. Zhang, K.; Wang, A.; Zhang, F.; Diao, W.; Sun, J.; Bruzzone, L. Spatial and spectral extraction network with adaptive feature fusion for pansharpening. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5410814. [Google Scholar] [CrossRef]
  21. Lei, D.; Chen, P.; Zhang, L.; Li, W. MCANet: A Multidimensional Channel Attention Residual Neural Network for Pansharpening. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5411916. [Google Scholar] [CrossRef]
  22. Tu, W.; Yang, Y.; Huang, S.; Wan, W.; Gan, L.; Lu, H. MMDN: Multi-Scale and Multi-Distillation Dilated Network for Pansharpening. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5410514. [Google Scholar] [CrossRef]
  23. Liu, Q.; Zhou, H.; Xu, Q.; Liu, X.; Wang, Y. PSGAN: A generative adversarial network for remote sensing image pan-sharpening. IEEE Trans. Geosci. Remote Sens. 2020, 59, 10227–10242. [Google Scholar] [CrossRef]
  24. Ma, J.; Yu, W.; Chen, C.; Liang, P.; Guo, X.; Jiang, J. Pan-GAN: An unsupervised pan-sharpening method for remote sensing image fusion. Inf. Fusion 2020, 62, 110–120. [Google Scholar] [CrossRef]
  25. Benzenati, T.; Kessentini, Y.; Kallel, A. Pansharpening approach via two-stream detail injection based on relativistic generative adversarial networks. Expert Syst. Appl. 2022, 188, 115996. [Google Scholar] [CrossRef]
  26. Wang, W.; Liu, H. An Efficient Detail Extraction Algorithm for Improving Haze-Corrected CS Pansharpening. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5000505. [Google Scholar] [CrossRef]
  27. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
  28. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [Green Version]
  29. Nah, S.; Hyun Kim, T.; Mu Lee, K. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3883–3891. [Google Scholar]
  30. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
  31. Zhong, J.; Yang, B.; Huang, G.; Zhong, F.; Chen, Z. Remote sensing image fusion with convolutional neural network. Sens. Imaging 2016, 17, 10. [Google Scholar] [CrossRef]
  32. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  33. Wei, Y.; Yuan, Q.; Shen, H.; Zhang, L. Boosting the accuracy of multispectral image pansharpening by learning a deep residual network. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1795–1799. [Google Scholar] [CrossRef] [Green Version]
  34. Shao, Z.; Cai, J. Remote sensing image fusion with deep convolutional neural network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 1656–1669. [Google Scholar] [CrossRef]
  35. Wald, L.; Ranchin, T.; Mangolini, M. Fusion of satellite images of different spatial resolutions: Assessing the quality of resulting images. Photogramm. Eng. Remote Sens. 1997, 63, 691–699. [Google Scholar]
  36. Choi, J.S.; Kim, Y.; Kim, M. S3: A spectral-spatial structure loss for pan-sharpening networks. IEEE Geosci. Remote Sens. Lett. 2019, 17, 829–833. [Google Scholar] [CrossRef] [Green Version]
  37. Chen, C.; Li, Y.; Liu, W.; Huang, J. Image fusion with local spectral consistency and dynamic gradient sparsity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2760–2765. [Google Scholar]
  38. Palsson, F.; Sveinsson, J.R.; Ulfarsson, M.O.; Benediktsson, J.A. Quantitative quality evaluation of pansharpened imagery: Consistency versus synthesis. IEEE Trans. Geosci. Remote Sens. 2015, 54, 1247–1259. [Google Scholar] [CrossRef]
  39. Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A. Context-driven fusion of high spatial and spectral resolution images based on oversampled multiresolution analysis. IEEE Trans. Geosci. Remote Sens. 2002, 40, 2300–2312. [Google Scholar] [CrossRef]
  40. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  41. Yuhas, R.H.; Goetz, A.F.; Boardman, J.W. Discrimination among semi-arid landscape endmembers using the spectral angle mapper (SAM) algorithm. In Summaries of the Third Annual JPL Airborne Geoscience Workshop. Volume 1: AVIRIS Workshop; JPL and NAS; Colorado University: Boulder, CO, USA, 1992. Available online: https://ntrs.nasa.gov/citations/19940012238 (accessed on 1 April 2022).
  42. Wald, L. Data Fusion: Definitions and Architectures: Fusion of Images of Different Spatial Resolutions. Presses des MINES. 2002. Available online: https://hal-mines-paristech.archives-ouvertes.fr/hal-00464703 (accessed on 1 April 2022).
  43. Choi, M. A new intensity–hue–saturation fusion approach to image fusion with a trade-off parameter. IEEE Trans. Geosci. Remote Sens. 2006, 44, 1672–1682. [Google Scholar] [CrossRef] [Green Version]
  44. Zhou, J.; Civco, D.L.; Silander, J. A wavelet transform method to merge Landsat TM and SPOT panchromatic data. Int. J. Remote Sens. 1998, 19, 743–757. [Google Scholar] [CrossRef]
  45. Wang, Z.; Bovik, A.C. A universal image quality index. IEEE Signal Process. Lett. 2002, 9, 81–84. [Google Scholar] [CrossRef]
  46. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [Green Version]
  47. Alparone, L.; Aiazzi, B.; Baronti, S.; Garzelli, A.; Nencini, F.; Selva, M. Multispectral and panchromatic data fusion assessment without reference. Photogramm. Eng. Remote Sens. 2008, 74, 193–200. [Google Scholar] [CrossRef] [Green Version]
  48. Vivone, G. Robust band-dependent spatial-detail approaches for panchromatic sharpening. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6421–6433. [Google Scholar] [CrossRef]
  49. Vivone, G.; Alparone, L.; Garzelli, A.; Lolli, S. Fast reproducible pansharpening based on instrument and acquisition modeling: AWLP revisited. Remote Sens. 2019, 11, 2315. [Google Scholar] [CrossRef] [Green Version]
  50. Vivone, G.; Alparone, L.; Chanussot, J.; Dalla Mura, M.; Garzelli, A.; Licciardi, G.A.; Restaino, R.; Wald, L. A critical comparison among pansharpening algorithms. IEEE Trans. Geosci. Remote Sens. 2014, 53, 2565–2586. [Google Scholar] [CrossRef]
  51. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Figure 1. The framework of the proposed DiTBN.
Figure 1. The framework of the proposed DiTBN.
Remotesensing 14 06120 g001
Figure 2. Fusion results of a group of reduced resolution samples on the Ikonos dataset.
Figure 2. Fusion results of a group of reduced resolution samples on the Ikonos dataset.
Remotesensing 14 06120 g002
Figure 3. Residual images of the fusion results shown in Figure 2.
Figure 3. Residual images of the fusion results shown in Figure 2.
Remotesensing 14 06120 g003
Figure 4. Fusion results of a group of reduced-resolution samples on the GeoEye-1 dataset.
Figure 4. Fusion results of a group of reduced-resolution samples on the GeoEye-1 dataset.
Remotesensing 14 06120 g004
Figure 5. Residual images of fusion results shown in Figure 4.
Figure 5. Residual images of fusion results shown in Figure 4.
Remotesensing 14 06120 g005
Figure 6. Fusion results of a group of reduced-resolution samples on the WorldView-3 dataset.
Figure 6. Fusion results of a group of reduced-resolution samples on the WorldView-3 dataset.
Remotesensing 14 06120 g006
Figure 7. Residual images of fusion results shown in Figure 6.
Figure 7. Residual images of fusion results shown in Figure 6.
Remotesensing 14 06120 g007
Figure 8. Fusion results of a group of full-resolution samples on the Ikonos dataset.
Figure 8. Fusion results of a group of full-resolution samples on the Ikonos dataset.
Remotesensing 14 06120 g008
Figure 9. Fusion results of a group of full-resolution samples on the GeoEye-1 dataset.
Figure 9. Fusion results of a group of full-resolution samples on the GeoEye-1 dataset.
Remotesensing 14 06120 g009
Figure 10. Fusion results of a group of full-resolution samples on the WorldView-3 dataset.
Figure 10. Fusion results of a group of full-resolution samples on the WorldView-3 dataset.
Remotesensing 14 06120 g010
Table 1. The main characteristics of three satellites.
Table 1. The main characteristics of three satellites.
SatellitesSpectral Range/nmSpatial Resolution/m
CoastalBlueGreenYellowRedRed EdgeNirNir2PANPANMS
Ikonos-450–530520–610-640–720-760–860-450–90014
GeoEye-1-450–510510–580-655–690-780–920-450–9000.52
WorldView-3400–450450–510510–580585–625630–690705–745770–895860–1040450–8000.311.24
Table 2. The division of the dataset (The sizes of the LRMS and PAN images).
Table 2. The division of the dataset (The sizes of the LRMS and PAN images).
SatellitesNumber of Source Images/GroupData TypeNumber of GroupsPatch Number
Ikonos240 × (200 × 200,
800 × 800)
Train19212,288 × (8 × 8, 32 × 32)
Valid241536 × (8 × 8, 32 × 32)
Test2424 × (50 × 50, 200 × 200)
GeoEye-1250 × (200 × 200,
800 × 800)
Train20012,800 × (8 × 8, 32 × 32)
Valid251600 × (8 × 8, 32 × 32)
Test2525 × (50 × 50, 200 × 200)
WorldView-3250 × (200 × 200,
800 × 800)
Train20012,800 × (8 × 8, 32 × 32)
Valid251600 × (8 × 8, 32 × 32)
Test set2525 × (50 × 50, 200 × 200)
Table 3. Ablation study of different network structures on the Ikonos dataset.
Table 3. Ablation study of different network structures on the Ikonos dataset.
SAM↓ERGAS↓RASE↓SCC↑Q↑SSIM↑
HPF2, reuse22.64611.85377.58170.94720.86910.9543
HPF1, reuse22.50971.78467.28150.95300.87370.9578
HPF1, no reuse2.51961.78617.28360.95310.87390.9577
HPF1, reuse12.52981.79297.31190.95300.87470.9577
Table 4. Ablation study of different loss functions on the Ikonos dataset.
Table 4. Ablation study of different loss functions on the Ikonos dataset.
LossSAM↓ERGAS↓RASE↓SCC↑Q↑SSIM↑
The MSE loss function2.54451.80607.36460.95180.87180.9569
The loss function without L s 2.51911.79627.32500.95260.87330.9575
The loss function without L λ 2.52631.79937.34560.95250.87380.9571
The proposed loss function2.50971.78467.28150.95300.87370.9578
Table 5. Ablation study of different kernel sizes on the Ikonos dataset.
Table 5. Ablation study of different kernel sizes on the Ikonos dataset.
Kernel SizeSAM↓ERGAS↓RASE↓SCC↑Q↑SSIM↑
1 × 12.50971.78467.28150.95300.87370.9578
3 × 32.51031.78707.29040.95310.87430.9577
Table 6. Comparison of indicator evaluation results on the reduced-resolution Ikonos dataset.
Table 6. Comparison of indicator evaluation results on the reduced-resolution Ikonos dataset.
MethodsSAM↓ERGAS↓RASE↓SCC↑Q↑SSIM↑
The
indicator
values in
Figure 2
GS2-GLP3.38483.375213.02790.93310.93500.9191
BDSD-PC3.59793.511313.59170.93000.93080.9171
AWLP-H3.17703.461213.36890.93260.93370.9248
MTF-GLP3.56673.805414.64810.93010.92070.9057
PanNet2.62502.20858.54220.97830.96180.9581
DiCNN12.51992.11168.16360.98110.96210.9612
FusionNet2.45682.07558.03860.98150.96330.9628
TFNet2.76082.647310.24550.96990.95090.9468
Proposed2.38222.06047.97410.98170.96440.9636
The average
indicator
values
GS2-GLP3.88242.674910.76690.88180.78670.9070
BDSD-PC3.86362.682010.79020.88850.79650.9117
AWLP-H3.52072.640510.62030.89310.80980.9201
MTF-GLP4.09592.786311.70740.87160.77470.8959
PanNet2.67121.88047.69970.94440.86730.9527
DiCNN12.65291.86967.61260.94660.86710.9535
FusionNet2.57911.81697.42680.95000.86870.9562
TFNet3.02252.20308.91090.92980.83920.9401
Proposed2.50971.78467.28150.95300.87370.9578
Table 7. Comparison of indicator evaluation results on the reduced-resolution GeoEye-1 dataset.
Table 7. Comparison of indicator evaluation results on the reduced-resolution GeoEye-1 dataset.
MethodsSAM↓ERGAS↓RASE↓SCC↑Q↑SSIM↑
The
indicator
values in
Figure 4
GS2-GLP3.84394.159120.31770.77410.85990.9256
BDSD-PC5.32663.695117.92160.83550.86400.9230
AWLP-H3.60433.877018.05430.82260.87590.9313
MTF-GLP3.84354.168420.20970.77490.85940.9251
PanNet2.05231.99899.10070.94690.90630.9727
DiCNN11.82401.90168.70540.95350.91900.9761
FusionNet1.84411.89558.58590.95350.91100.9767
TFNet2.12182.280610.36250.93150.87520.9675
Proposed1.63721.78138.10480.95810.92160.9793
The average
indicator
values
GS2-GLP3.27442.853913.33180.81680.82060.8929
BDSD-PC3.08812.546511.85650.87940.86010.9222
AWLP-H2.43472.347010.81510.91240.87890.9427
MTF-GLP2.93262.700312.21470.87660.84310.9186
PanNet1.51371.48086.71330.96090.93200.9691
DiCNN11.47901.45676.71260.96170.93260.9704
FusionNet1.46531.46576.58330.96310.92940.9708
TFNet1.65391.68897.61060.94990.91370.9633
Proposed1.36421.37726.27410.96630.93800.9734
Table 8. Comparison of indicator evaluation results on the reduced-resolution WorldView-3 dataset.
Table 8. Comparison of indicator evaluation results on the reduced-resolution WorldView-3 dataset.
MethodsSAM↓ERGAS↓RASE↓SCC↑Q↑SSIM↑
The
indicator
values in
Figure 6
GS2-GLP5.27733.658310.97920.90000.94110.9060
BDSD-PC5.53793.900711.98960.89810.94020.9034
AWLP-H5.45643.995511.37660.87690.94010.9099
MTF-GLP5.11293.649811.03620.89570.94210.9090
PanNet3.44052.27656.88480.95940.97850.9598
DiCNN13.28872.18646.68070.96430.98030.9634
FusionNet3.24032.16876.66820.96520.98030.9642
Proposed3.08292.08696.28650.96650.98240.9668
The average
indicator
values
GS2-GLP5.27734.474510.07560.86020.89720.9125
BDSD-PC6.39064.663110.53810.86590.89370.9064
AWLP-H4.95144.502210.20200.86510.90940.9263
MTF-GLP5.11424.43909.95650.85870.89980.9184
PanNet3.38032.56516.10250.95900.95030.9681
DiCNN13.26562.49165.91040.96280.95200.9700
FusionNet3.13622.47905.97830.96420.95180.9709
Proposed2.93232.33745.61410.96740.95560.9737
Table 9. Comparison of indicator evaluation results on the full-resolution Ikonos dataset.
Table 9. Comparison of indicator evaluation results on the full-resolution Ikonos dataset.
MethodsSAM↓QNR↑ D λ D S
The indicator values in Figure 8GS2-GLP1.78430.74400.13130.1436
BDSD-PC2.31850.81210.06940.1274
AWLP-H1.73750.78120.11140.1208
MTF-GLP1.78960.74690.13200.1395
PanNet1.80680.86550.06530.0740
DiCNN11.98460.82820.05970.1193
FusionNet1.38410.88750.02830.0867
TFNet1.63320.90120.06260.0385
Proposed1.22430.91140.02560.0647
The average indicator valuesGS2-GLP1.52720.73010.14070.1663
BDSD-PC2.00430.80210.08230.1389
AWLP-H1.48290.75830.13320.1410
MTF-GLP1.57460.71860.15100.1680
PanNet1.53220.84430.07050.0999
DiCNN11.60770.83370.06470.1203
FusionNet1.14870.84360.05690.1146
TFNet1.27750.79400.10570.1162
Proposed1.07020.84990.04840.1144
Table 10. Comparison of indicator evaluation results on the full-resolution GeoEye-1 dataset.
Table 10. Comparison of indicator evaluation results on the full-resolution GeoEye-1 dataset.
MethodsSAM↓QNR↑ D λ D S
The indicator values in Figure 9GS2-GLP1.40540.83360.05480.1181
BDSD-PC2.42080.96940.01720.0137
AWLP-H1.24120.88250.04170.0791
MTF-GLP1.35700.81880.06640.1229
PanNet1.32150.94650.01680.0374
DiCNN11.29470.95670.00340.0400
FusionNet0.77540.92890.03100.0414
TFNet0.65950.94410.02430.0322
Proposed0.65590.96790.00590.0264
The average indicator valuesGS2-GLP0.81680.83730.05320.1170
BDSD-PC1.22190.90360.02870.0702
AWLP-H0.71320.88370.04370.0764
MTF-GLP0.78950.80560.07010.1348
PanNet0.79970.91550.02890.0575
DiCNN10.79840.92100.02260.0582
FusionNet0.52100.92350.02150.0564
TFNet0.53240.91990.02620.0553
Proposed0.43900.93190.02390.0459
Table 11. Comparison of indicator evaluation results on the full-resolution WorldView-3 dataset.
Table 11. Comparison of indicator evaluation results on the full-resolution WorldView-3 dataset.
MethodsSAM↓QNR↑ D λ D S
The indicator values in Figure 10GS2-GLP1.59440.89700.03790.0677
BDSD-PC1.80560.93220.01510.0536
AWLP-H1.66470.90340.04120.0578
MTF-GLP1.50490.88120.04910.0733
PanNet1.61820.96320.01060.0264
DiCNN11.62090.94700.01200.0416
FusionNet1.30640.96470.01020.0254
Proposed1.19290.95480.00550.0400
The average indicator valuesGS2-GLP1.32760.84050.06450.1027
BDSD-PC1.93170.87080.04830.0867
AWLP-H1.32350.84040.07290.0959
MTF-GLP1.33260.82400.07630.1090
PanNet1.40540.89250.04520.0673
DiCNN11.50310.87840.04440.0834
FusionNet1.12910.91730.02560.0599
Proposed1.04750.90720.02840.0677
Table 12. Performance comparison of CNN-based methods in terms of number of parameters and number of FLOPs.
Table 12. Performance comparison of CNN-based methods in terms of number of parameters and number of FLOPs.
PanNetDiCNN1FusionNetTFNetProposed
Parameters0.083 M0.047 M0.079 M2.363 M0.403 M
FLOPs20.72 B6.12 B20.58 B18.77 B52.76 B
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wang, W.; Zhou, Z.; Zhang, X.; Lv, T.; Liu, H.; Liang, L. DiTBN: Detail Injection-Based Two-Branch Network for Pansharpening of Remote Sensing Images. Remote Sens. 2022, 14, 6120. https://doi.org/10.3390/rs14236120

AMA Style

Wang W, Zhou Z, Zhang X, Lv T, Liu H, Liang L. DiTBN: Detail Injection-Based Two-Branch Network for Pansharpening of Remote Sensing Images. Remote Sensing. 2022; 14(23):6120. https://doi.org/10.3390/rs14236120

Chicago/Turabian Style

Wang, Wenqing, Zhiqiang Zhou, Xiaoqiao Zhang, Tu Lv, Han Liu, and Lili Liang. 2022. "DiTBN: Detail Injection-Based Two-Branch Network for Pansharpening of Remote Sensing Images" Remote Sensing 14, no. 23: 6120. https://doi.org/10.3390/rs14236120

APA Style

Wang, W., Zhou, Z., Zhang, X., Lv, T., Liu, H., & Liang, L. (2022). DiTBN: Detail Injection-Based Two-Branch Network for Pansharpening of Remote Sensing Images. Remote Sensing, 14(23), 6120. https://doi.org/10.3390/rs14236120

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop