A Pansharpening Generative Adversarial Network with Multilevel Structure Enhancement and a Multistream Fusion Architecture

: Deep learning has been widely used in various computer vision tasks. As a result, researchers have begun to explore the application of deep learning for pansharpening and have achieved remarkable results. However, most current pansharpening methods focus only on the mapping relationship between images and the lack overall structure enhancement, and do not fully and completely research optimization goals and fusion rules. Therefore, for these problems, we propose a pansharpening generative adversarial network with multilevel structure enhancement and a multistream fusion architecture. This method ﬁrst uses multilevel gradient operators to obtain the structural information of the high-resolution panchromatic image. Then, it combines the spectral features with multilevel gradient information and inputs them into two subnetworks of the generator for fusion training. We design a comprehensive optimization goal for the generator, which not only minimizes the gap between the fused image and the real image but also considers the adversarial loss between the generator and the discriminator and the multilevel structure loss between the fused image and the panchromatic image. It is worth mentioning that we comprehensively consider the spectral information and the multilevel structure as the input of the discriminator, which makes it easier for the discriminator to distinguish real and fake images. Experiments show that our proposed method is superior to state-of-the-art methods in both the subjective visual and objective assessments of fused images, especially in road and building areas.


Introduction
Due to the limitation of technology, a single sensor cannot simultaneously obtain remote sensing images with high resolution in both the spectral and spatial domains. Currently, high-resolution panchromatic (PAN) components and low-resolution spectral components are usually used instead [1]. However, a single information component cannot match the effect of remote sensing images with high-resolution spectral domains and spatial domains in many fields. Therefore, in practical applications, it is better to combine the spectral and spatial components [2], that is, to obtain high-resolution spectrograms by fusing low-resolution spectrograms and high-resolution PAN images [3].
A classic and simple pansharpening method is to perform component replacement [4]. It is mainly divided into two categories. The first category is to transform the multispectral (MS) image in the appropriate domain and use the high-resolution PAN image to replace the components in the domain (e.g., principal component analysis (PCA) [5], the intensity-hue-saturation (IHS) transform [6], and the band-dependent spatial-detail (BDSD) Different types of structural information are input into specific subnets to better maintain structural and spectral information.
(3) We designed a comprehensive loss function. The loss function comprehensively considers the spectral loss, multi-level structure loss and adversarial loss. Among them, the multi-level structure loss combines two types of gradient operators to better give the optimization direction of network training, so that the extraction of structure information is more sufficient.
(4) To make it easier for the discriminator to distinguish real and fake images, we provide as much spectral and structural information as possible.
The remainder of the paper is organized as follows. Section 2 introduces the related work. Section 3 describes the method proposed in this paper. Section 4 presents the experiment and discussion. Section 5 is the conclusions.

Related Work
As pan-sharpening has attracted much attention, deep learning methods have been widely used in it. Researchers have proposed a lot of pan-sharpening methods based on deep learning according to different strategy modes, which have shown excellent nonlinear expression ability. Some methods choose simple shallow convolutional network as the architecture of training network, and extract the features from the input data using different techniques and strategies. For example, PCNN uses a simple three-layer convolutional network, and manually extracts important features such as the normalized water index (NDWI) as the input of the network [15]. Some methods choose to introduce excellent modules or architectures that are widely used in other fields of deep learning. For example, PANNET introduces residual network and uses high-pass filtering to extract the features of high-pass filtering domain from the input images, so that the network only needs to recover high-frequency information and can migrate between satellites with different numerical imaging ranges [17]. In [22], the author introduced Densely connected convolutional networks [23], which improved the ability to express spectral and spatial characteristics. In [24], the author proposed a multi-scale channel attention mechanism for panchromatic sharpening based on the channel attention mechanism originally applied to image classification work. This method considers the interdependence between channels and uses the attention mechanism to recalibrate, so as to perform feature representation more accurately. In both PSGAN and Pan-GAN [25], the author introduces generative confrontation network as the main architecture. PSGAN proposes to use dual-stream input to allow image feature-level fusion instead of pixel-level fusion. Pan-GAN adopts a method of establishing confrontational games between the generator and the spectral discriminator and the spatial discriminator, so as to retain the rich spectral information of the multi-spectral image and the spatial information of the panchromatic image. Some other methods use strategies to improve the loss function to optimize the training direction of the network. For example, in [26], the author proposed a perceptual loss function and further optimized the model based on advanced features in the near-infrared space. In general, the purpose of panchromatic sharpening is to obtain high-resolution multispectral images through fusion, and to preserve the spectral information of the multispectral images and the spatial information of the panchromatic images to the greatest extent. The methods mentioned above focus on the improvement of a certain aspect, or the simple application of a certain technology. These methods lack comprehensive leverage of image preprocessing, feature extraction, attention module, and loss function improvement. It is critical that how to use more than two technologies in one pansharpening method reasonably. This idea inspired our work.

Pansharpening Based on a Variational Model and a GAN
The purpose of pansharpening is to fuse LRMS images and high-resolution PAN images. P ∈ R H×W represents the PAN image, M = (M 1 , M 2 , . . . , M B ) ∈ R (H/r)×(W/r)×B represents the LRMS image, M ↑= (M 1 ↑, M 2 ↑, . . . , M B ↑) ∈ R H×W×B represents the LRMS image after upsampling, X = (X 1 , X 2 , . . . , X B ) ∈ R H×W×B represents the image obtained by fusion, Y = (Y 1 , Y 2 , . . . , Y B ) ∈ R H×W×B represents the real HRMS image, where b = 1, 2, . . . , B represents the number of channels of the image. r is 4 in this paper; that is, the resolution ratio of the PAN image to the MS image is 4:1.
Some pansharpening methods base on variational model have transformed the pansharpening process into an optimization problem solution process with reasonable hypotheses to achieve a good balance between spectral preservation and spatial restoration in fused images. The variational approaches usually assumes that the spatial information associated with each band of a fused image is consistent with that in a PAN image and the spectral information after the downsampling of the fused image is consistent with that in an LRMS image. For instance, Chen et al. [27] and Zeng et al. [12] use first-order finite difference operator to extract the sparse spatial structure information from the PAN image. Wang et al. [28] use second-order finite difference operator to extract spatial positions from the PAN image, such as corners, strongly textured regions, and edges. Inspired by these methods, we use two kinds of finite difference operators to extract the sparse spatial structure information from the PAN image. The first operator is the first-order finite difference operator. For the sake of simplicity, we call it as first-level gradient operator. The second operator is the second-level gradient operator. In fact, second-level gradient operator is also first-order finite difference operator. The main difference between the two operators is whether there is an interval of one pixel for the differential operation. The two operators are shown in Figure 1. We use ∇ and ∇∇ to represent the first-level gradient operator and second-level gradient operator, respectively. ∇ h P and ∇ v P represents the gradient information in two directions obtained by the first-level gradient operator. Among them, the subscript h represents the horizontal direction, and the subscript v represents the vertical direction. ∇∇ h P and ∇∇ v P represents the gradient information in two directions obtained by the second-level gradient operator. Through experiments, we found that the structural information fo the PAN image extracted by the second-level gradient operator is still rich, so we try to use the new structure to enhance the pansharpening performance. The two types of spatial structure inforamtion are shown in Figure 2. The general form of the objective function in the fusion process is: f (•) can be regarded as a mapping function of X obtained from (M, ∇ h P, ∇ v P, ∇∇ h P, ∇∇ v P), which means taking M, ∇ h P, ∇ v P, ∇∇ h P, ∇∇ v P as the input; after a series of feature extractions, it is reformed into a full-sharpening model of HRMS image X. Θ represents the parameter set in the model. Therefore, we can reformulate pansharpening as an image generation problem that can be processed with a GAN. We first use a generator to map the joint distribution p d (M, ∇ h P, ∇ v P, ∇∇ h P, ∇∇ v P) to the target distribution p r (Y). Then, a corresponding discriminator is designed to estimate the probability that the sample comes from the training data and the generation model G so that the generator performs adversarial training to obtain a pansharpened image X that is closer to the target image Y.
To make the presentation more convenient, we use h and v to represent all the gradient information in different directions (detailed expressions presented only when needed). Therefore, it can be expressed by the min-max problem of Equation (2): where •] represents the concatenation operation, which is used to superimpose two or more tensors in the channel dimension. ω b represents the relative weight coefficients of different satellite sensors obtained by the modulation transfer function (MTF) [10]. In addition, we improve the constraints of the generator based on the assumption of prior information and define the loss function of the generator as follows.
L adv represents the adversarial loss between the generator G and the discriminator D. We define L adv as follows.
L c is used to minimize the gap between the fused image and the real image, which is measured in terms of the spectra and structure.The reason for this design mainly comes from the inspiration of the variational model: First, the use of structural information; second, the use of its energy function to design the loss function. The first term is the spectral fidelity term. The second and third terms are the structure fidelity terms, which calculate the multilevel structure loss between the fused image and the PAN image. λ is a hyperparameter used to balance L adv , and L c . µ 1 and µ 2 are used to weigh the L c weight of the information loss of two sparse structures. The details are as follows: where N represents the number of training samples. We derive the discriminator loss function based on the GAN principle: In summary, we proposed the use of multi-level gradient operators to extract different levels of spatial features, and designed the corresponding loss function. Among them, the loss function considers spectral loss and adversarial loss, and also combines two types of gradient operators to design a multi-level structure loss, which better gives the optimization direction of network training.

Multi-Stream Structure Generator and Discriminator
According to the overall design of the algorithm in Section 3.1, this paper proposes a generator combining a multistream structure, as shown in Figure 3. Different from traditional deep learning algorithms, the proposed method introduces the constraint of gradient information. That is, the first-level gradient operator and second-level gradient operator are used to extract the structure of the PAN image. We upsample the MS image to the resolution of the PAN image to obtain the MS↑ image. The MS↑ image, two gradient constraints and the original MS image are taken together as the input of the generator. Unlike the basic CNN, which directly stitches multiple images that need to be fused as input, we use subnetworks at the bottom of the network to extract hierarchical features for the MS↑ image and two types of structural information. The spectral and spatial information are extracted to obtain rich primary features. Then, the two types of structural information are combined with the spectral feature concatenation results after a series of feature extractions and supplementations. Finally, the joint features of the two results are mapped, and the required fusion image is reconstructed through transposed convolution decoding. A convolution kernel with a size of 2 × 2, stride = 2, and padding = SAME is used to replace the downsampling operation, which better retains the perfect features. We used the leaky rectified linear unit (Leaky ReLU) activation function proposed in [29]. Inspired by U-Net [30], we adjust the network structure through skip connections. That is, the features of the lower layer are added to the higher layer by skipping the connection operation. The detailed architecture and convolution parameters are shown in Figure 3. The blue box represents the convolutional layer without downsampling. The yellow box represents the convolutional layer with downsampling. The red box represents the deconvolutional layer with upsampling. In order to play zero sum game in the process of generating fused image, this paper designs a discriminator with neural network architecture. The discriminator is a simple five-layer convolutional neural network, which is used for distinguish whether each sample is a real HRMS image or a fused MS image. The detail architectures are shown in Figure 4. The yellow box represents the convolutional layer with downsampling. The blue box represents the convolutional layer without downsampling. Because of the particularity of sigmoid function in our network, we draw it after the final convolutional layer. The purple box represents the sigmoid function. The spectral information contains the upsampled MS image, the fused image X or the real image Y. The structural information contains the two-level horizontal and vertical gradient information of the PAN image, the fused image or the real image. For the Gaofen-2 dataset, the input data has 14 channels. We present the specific architecture in Figure 4. We use a simple CNN as the backbone structure of the discriminator. From the first layer to the fourth layer, a convolution kernel with a size of 3 × 3, a step size of 2, is used for feature extraction. To reduce the effects of noise, all of our convolutional filters do not use padding operation. The last layer uses a sigmoid function to calculate the probability of the pixels in an image belong to a real image. The remaining convolutional layers are all activated by the Leaky ReLU activation function. In the implementation of the algorithm, we will input twice and pass the discriminator network twice. The main difference between the two inputs is the fused image X and the real image Y. We will calculate the log difference expectation for the two results. The final result is defined as the loss of the discriminator.

Experimental Setup
We use the Gaofen-2 and WorldView-2 datasets to verify the effectiveness of the proposed method. The PAN images of the Gaofen-2 and WorldView-2 datasets have only one band, and the image resolutions are 0.81 m and 0.5 m, respectively. The MS image of the Gaofen-2 dataset has four bands, namely, red, green, blue, and near-infrared (NIR) bands, and the image resolution is 3.24 m. The MS image of the WorldView-2 dataset has 8 bands, namely, blue, green, red, coastal zone, yellow, red-edge and two sets of NIR spectra bands, and the image resolution is 1.8 m. Our experiment consists of three parts: Reduced resolution experiment, ablation experiment and full resolution experiment [31]. The comparative experiments select some algorithms introduced in the first section of this article. Specifically, for the multiresolution analysis ATWT algorithm, MTF-Generalized LP (MTF-GLP) [10], component replacement BDSD algorithm, and variational approach using spectral consistency and dynamic gradient sparsity (DGSF) [27]. Inspired by the deep learning-based methods and variational approaches, we have proposed a generative adver-sarial network with structural enhancement and spectral supplement for pan-sharpening (represented by "self-comparison" in our comparative experiment) [21]. For the parameters of each method, the settings recommended by the authors of the corresponding references are selected to make each method achieve the best results.
All the experiments of the deep learning-based methods in this paper involve training with an NVIDIA Tesla V100 SXM2 16 GB GPU and an Intel Xeon Gold 6148 at 2.40 GHz CPU. We use the pytorch framework to implement the deep learning-based methods and compare the computational times of each network. We use Adam [32] algorithm as the optimizer. For training our proposed network, the batch size is set to 16, the learning rate is set to 0.0002, the parameters of the Adam optimizer are set to 0.5 and 0.99, and a total of 20 epochs training are executed. According to the results of many experiments, the weight hyperparameters λ, µ 1 and µ 2 are set to 90, 80 and 40, respectively. The hyperparameters of other models are consistent with those in the original paper. The trained network can be reused for a long period of time from the same source of data for inference. The inference time required for our test set is usually within 1-2 s, which is at the same level as traditional component replacement algorithms. We list the time cost, parameter amount, and FLOPs for network training of reduced-resolution experiment on the WorldView-2 dataset, as shown in Table 1.

Reduced-Resolution Experiment
We conducted a reduced-resolution experiment according to the Wald protocol [33], in which we used the original LRMS images as a reference. Before downsampling LRMS and PAN images, we smooth all original images using a filter that matches the MTF of the sensor [10,15,31]. Before smoothing, we trim the orignal LRMS images into patches with size of 128 × 128, called as HRMS images, which should be used for reference, and the orignal PAN images into patches with size of 512 × 512. Then we smooth them using Gaussian kernels with MTF, and downsample them to images with size of 32 × 32 and size of 128 × 128, respectively. Finally, we construct the corresponding data set for training and testing, in which one MS image with size of 32 × 32, one PAN image with size of 32 × 32 and one HRMS image with size of 128 × 128 form a sample pair. We expect to get the fused images with size of 128 × 128, which should be as identical as possible to the HRMS images. In our reduced-resolution experiment, we upsample the MS images in training set using the interpolation kernel proposed in [9] as input. In the samples obtained from the Gaofen-2 dataset, we selected 12,800 sample pairs as the training samples and 256 sample pairs as the testing samples. Since WorldView-2 has more data than Gaofen-2, we selected 12,800 sample pairs as training samples and 576 sample pairs as testing set. In terms of the results evaluation, we mainly use the spectral angle mapper (SAM) [31], relative dimensionless integrated global error in synthesis (ERGAS) [33], generalized image quality index (UIQI) to n-band extension (Q n ) [34], and the spatial correlation coefficient (SCC) [35] indexes for quality assessment. In the experimental results obtained in the testing set, we sampled ten small images of size 256 × 256 and measured the quality indexes of these ten locations separately, calculated the average across all the results and compared the algorithms. For special objects, such as land, vegetation areas, buildings, and roads, in the fusion image, we conduct local-area experiments. Our process of conducting reduced-resolution experiment is shown in Figure 5.
It can be seen from Figure 5 that the setting conditions are slightly different based on the traditional method and the method based on deep learning. The traditional method is to use the image before cropping for fusion operation. There are two main reasons for this design.The first reason is that traditional methods, such as BDSD and DGSF, need to set hyperparameters to generate reasonable pan-sharpened images. However, taking the Gaofen-2 dataset as an example, our testing image size is 2048 × 2048, which can be split into 256 patches of size 128 × 128. It's hard to adjust the model hyperparameters for each patch. The second reason is that the methods based on deep learning takes into account factors such as memory and time complexity, and usually cuts testing images into small image patches for processing. In addition, the operation of sharpening image patches and then splicing them into testing images has higher requirements for the pan-sharpening algorithm. Because the image patches does not have the gradient information around the edge, the fused image is prone to grid effect. Therefore, compared to sharpening image patches, the index of the deep learning methods will generally improve when directly sharpening testing image. Compared with sharpening testing image, the index of traditional learning methods are generally lower when directly sharpening image patches. Moreover, in our experiments, images generated by deep learning methods are generally better than those generated by traditional methods. Therefore, we only respectively guarantee the experimental conditions consistency of deep learning methods and traditional methods. Then select an area with a resolution of 512 × 512 for display, and zoom in on some details. It can be seen from the effect display diagram, especially Figure 6, that the four deep learning methods have better spectral and spatial structure information preservation than other methods. Traditional methods perform significantly poorly in areas such as vegetation and soil. The result of DGSF image fusion has obvious spots. ATWT and MTF-GLP have over-sharpening in some areas and distortion of texture details. The BDSD spectrum performs well, but there is obvious spatial detail distortion, showing a large area of blur. While methods based on deep learning outperform the traditional methods, the PCNN and PanNet still result in insufficient feature extraction and structure preservation. It can be observed from the effect display diagram, especially the Gaofen-2 display diagram, that the PCNN and PanNet perform poorly with respect to the image details. In the hyperspectral region, we can observe that the PSGAN removes some high-frequency details as noise, and there is excessive denoising. The proposed method can observe subtle differences in the enlarged detail area of the image. In general, the proposed algorithm achieves the best fusion effect in terms of the performance of the hyperspectral region and the reduction in the overall spectral and spatial information. The details are also shown in the residual diagrams shown in Figure 8 (Gaofen-2) and Figure 9 (WorldView-2). The residual image is obtained by subtracting the fusion result from the original LRMS image. In theory, the less texture the MS image contains, the better the fusion result is.
More detailed comparisons are shown in Table 2 (Gaofen-2) and Table 3 (WorldView-2). We mark the best indicator in bold in each subsequent table. From these tables, we can see that the traditional algorithms perform poorly in terms of the various quality indexes. The indexes of the neural network methods are significantly improved compared with those of the traditional algorithms. Considering the overall effectiveness of the image fusion step, the proposed method outperforms all the methods considered.

Ablation Experiment
For the ablation experiment, we extracted the main functional modules. We use only the first-level gradient operator to extract the structural information, and we remove the loss of the sparse structural information, extracted by the second-level gradient operator, in L c (represented by "Only_spatial1" in the experimental part) to verify the function of the proposed new structural information extraction operator. In addition, we input two kinds of sparse structural information together with the spectral information into the generator with only one subnet to verify the function of the multistream structure compensation generator (represented by "One_subnet" in the experimental part).
We used Gaofen-2 and WorldView-2 datasets to conduct ablation experiments to prove the effectiveness of the module. Specifically, we use backbone to represent the basic network after removing the relevant modules. We use One_subnet to verify the role of the multi-stream structure generator, that is, to splice two types of structure information into a network with only one branch generator. We use Only_spatial1 to verify the effectiveness of the second-level gradient operator, that is, only use the first-level gradient operator to bring it into the multi-branch generator, and remove the constraint on the second-level gradient operator in the loss. In addition, we used the final network model with all modules as a comparison. The same data set as the simulation experiment was used for testing and verification. The experimental results show that the proposed final network method has achieved good results. In addition, when we control the use of related modules, we can see that some indicators show significant changes. The performance of specific indicators is shown in Table 4, and the performance of the effect diagram and residual diagram is shown in Figures 6-9. The results in Table 4 show that the addition of the second-level gradient operator has a significant effect on the improvement of SCC indicators, especially the performance in the WorldView-2 data set. The multi-stream structure generator has an obvious effect on the overall effect, especially the improvement of the Qn index. In the details of the renderings and residual images, a conclusion consistent with the performance of the indicator can be observed.

Full-Resolution Experiment
For full-resolution experiments, we directly use the original LRMS images and PAN images as input, and bring them into the reduced-resolution model obtained by training. We still cut the MS image to a size of 32 × 32 and the PAN image to a size of 128 × 128. Different from the reduced-resolution experiment, the full-resolution experiment has no reference image to evaluate the advantages and disadvantages of the fusion effect. Therefore, we use the quality without reference (QNR) index [36] to evaluate the quality of the results. The QNR index includes an index D λ for evaluating the loss of spectral detail and an index D s for evaluating the loss of spatial detail. Figure 10 shows the full-resolution image fusion result of the testing set obtained by WorldView-2. We still zoomed in on the 100 × 100 area. Judging from the index results in Table 5, the neural network algorithm, especially the algorithm in this paper, is significantly better than other traditional algorithms. The algorithm in this paper uses the gradient operator to extract the structural information of the PAN image, which can better preserve the spatial structure information, and the corresponding spatial structure information loss index D s has been greatly improved. Moreover, because the method in this paper designs a reasonable loss function, the D λ spectral loss index has achieved better performance than other neural network algorithms. The overall indicators show that DGSF and MTF-GLP perform poorly in full-resolution experiments, whether it is the spectral loss indicator D λ or the spatial loss indicator D s . ATWT spectral loss D λ performance is acceptable, but D s has not achieved very good performance. BDSD has achieved good results in traditional methods, even better than PCNN, but the overall performance is not as good as other neural network-based methods. A conclusion consistent with the index results can be observed at the details of the zoomed-in image. As can be seen from Figure 10, the fusion result of PanNet algorithm has achieved good results in terms of structural information and spectral information, but it is not good in terms of high-saturation color performance and structural details. While the PSGAN algorithm has achieved certain advantages in reduced-resolution experiments, some indicators, but for clearer full-resolution experiments, there is an oversharpening phenomenon. Compared with other algorithms, the algorithm proposed in this paper can reduce the distortion of the spectrum to a greater extent, and it is more sufficient in the preservation of structural texture information.

Local Area Experiment
To prove the advantages of the proposed method, we perform more experiments on local areas in the WorldView-2 images. We take the land, vegetation areas, buildings, and roads in the fused image as the main objects and test the quality indexes for both the reduced-resolution and the full-resolution experiments. In these experiments, we select ten areas for each main object, which mainly contain the corresponding object, and then take the average of ten values as the final quality index value. We select three pansharpening methods based on deep neural network with excellent performance for the comparative experiments.
For the final experimental results, we selected a certain area of different objects for display, as shown in Figures 11-18, and displayed the index test results in Tables 6-13. Among them, it can be seen that the resolution reduction experiment, the method proposed in this paper is better than the latest method in most quality indicators. In particular, the Q n index for overall image quality evaluation, and the SCC index for better characterization of spatial quality. This is due to the design of the multi-level gradient operator, loss function and generator we proposed. It is worth mentioning that in our proposed method, the improvement of vegetation and land is minimal for Q n index, vagetation and land. The road area improvement is the largest, which is about 1.5% higher than the other best quality indicators. In the resolution reduction image display image, the effect is limited by the naked eye, but it can still observe a subtle difference.For example, in the red part of the building area, it can be observed that the color saturation of PCNN and PANNET is insufficient, and the performance of the architectural details is poor. The color depth level of PSGAN in the vegetation area is not rich. For full-resolution experiments, we mainly focus on the results of index evaluation, because there is no image to refer to. The method we proposed performed best on the road area index, especially the QNR index rose by about 5%. In the performance of the full-resolution fusion image, PCNN performs poorly in areas with high color saturation, and there are traces of wire stitching in local areas. While PANNET is superior to PCNN in terms of spectral information, it is still insufficient in terms of preserving the details of buildings and road areas. PSGAN has achieved good visual effects in full-resolution experimental images, but the image processing is over-smooth. In addition, the retention of spectral information and the level of detail of structural information are not enough.

Conclusions
This paper proposes a panchromatic sharpening generation confrontation network with multi-level structure enhancement and multi-stream fusion architecture. Different from other neural network methods, we use multi-level gradient operators to obtain sparse structure information when processing panchromatic images. Moreover, we specifically designed a multi-stream fusion CNN architecture to build a GAN generator to better maintain structural information. In addition, we no longer use a single minimization strategy to minimize the gap between the fused image and the reference image. On this basis, we regard the loss of GAN and the information loss corresponding to the multilevel structure as the input of the optimization function. The appeal mentioned that our generator network does not use a simple shallow network, and the fusion result has more sufficient spectral information than the shallow neural network method. Due to our reasonable design of the generator and loss function for the multi-level gradient operator, we have better structural information retention for the corresponding deep neural network method. In the experimental part, we use the representative remote sensing image data sets Gaofen-2 and WorldView-2 to verify and analyze the proposed method. Experimental results show that our method is much better than the state-of-the-art methods, especially in the fields of construction and roads. The success of our method shows that extracting as much structural information of the panchromatic image as possible and using a multistream network structure can effectively improve the quality index. Unfortunately, there are multiple hyperparameter settings in the loss function design of our method, which brings complexity to the application in different fields. In the future, we will design more innovative network architectures and reduce the involvement of hyperparameters.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to that the data has been pre-processed and involves laboratory intellectual property rights.

Conflicts of Interest:
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, "A Pansharpening Generative Adversarial Network with Multilevel Structure Enhancement and a Multistream Fusion Architecture".