UPGAN: An Unsupervised Generative Adversarial Network Based on U-Shaped Structure for Pansharpening

: Pansharpening is the fusion of panchromatic images and multispectral images to obtain images with high spatial resolution and high spectral resolution, which have a wide range of applications. At present, methods based on deep learning can fit the nonlinear features of images and achieve excellent image quality; however, the images generated with supervised learning approaches lack real-world applicability. Therefore, in this study, we propose an unsupervised pansharpening method based on a generative adversarial network. Considering the fine tubular structures in remote sensing images, a dense connection attention module is designed based on dynamic snake convolution to recover the details of spatial information. In the stage of image fusion, the fusion of features in groups is applied through the cross-scale attention fusion module. Moreover, skip layers are implemented at different scales to integrate significant information, thus improving the objective index values and visual appearance. The loss function contains four constraints, allowing the model to be effectively trained without reference images. The experimental results demonstrate that the proposed method outperforms other widely accepted state-of-the-art methods on the QuickBird and WorldView2 data sets.


Introduction
With the rapid development of remote sensing technology, the increasing number of satellites has made access to multiple types of information easier.Remote sensing images provide assistance in research on resources, geological disasters, and the environment [1][2][3], and they are of great help in the investigation of ground objects.Unfortunately, due to the limitations of satellite storage and image transmission technologies, most commercial satellites apply two modalities and capture two kinds of images with complementary characteristics: low-resolution multispectral (LRMS) images are characterized by multiple spectral bands, whereas panchromatic (PAN) images have precise spatial resolution but contain little spectral information.As improving the resolution of remote sensing images through the development of satellite hardware is a difficult task, one approach to obtain ideal high-resolution images is pansharpening, in which the complementary information from LRMS images and PAN images is fused and redundant information is suppressed [4].The pansharpened results are highresolution multispectral (HRMS) images, which have come to play increasingly important roles in target detection, semantic segmentation, land object classification, and other downstream tasks [5][6][7].
After the launch of the Earth Observation System SPOT-1 in the 1980s, pansharpening developed rapidly over the following 30 years.The goal of pansharpening is to improve both spectral information and spatial resolution.Generative adversarial networks (GANs) [8] are widely used in image generation tasks and are suitable for pansharpening.On the one hand, the generator learns the feature representation from the training image and generates new images without reference to the training sample.On the other hand, the discriminator distinguishes whether the fused image comes from the generator or the real sample, which can be regarded as a kind of annotation.Meanwhile, the double discriminators satisfy the needs of spectral and spatial constraints, respectively.The adversarial mechanism of the generator and discriminator improves the performance of unsupervised pansharpening.
While unsupervised learning exploits full-scale information, information at different scales still cannot be ignored.The U-shaped structure is a multi-scale structure [9], which utilizes both low-resolution semantic information and high-resolution detail information.For the study of pansharpening, the spectral information of low-resolution images is rich, and the spatial information of high-resolution images is rich.Through the fusion of lowlevel feature maps and high-level feature maps, the effective understanding of high-level semantic features and low-level spatial information is realized.The symmetrical U-shaped structure makes the feature fusion more thorough and further enhances the performance of the model.Moreover, there are plenty of tubular structures with morphological changes in remote sensing images; as such, if tubular structures can be reconstructed distinctly, the quality of the image can be improved.
In this article, we propose a powerful pansharpening method based on a GAN model whose generator is based on the U-shaped structure.Dual-discriminator and hybrid loss are defined to obviate the lack of reference images, which balance the performance in spatial and spectral indices.Our major contributions can be summarized as follows: 1.
An unsupervised pansharpening method is proposed based on a GAN, called UPGAN, which can be trained without relying on reference images.The model consists of a generator with a U-shaped structure as the backbone and two discriminators for spatial and spectral identification.

2.
Dynamic snake convolution is introduced into the multi-scale detail feature learning module in order to learn tubular features in different directions.A cross-scale attention fusion block takes advantage of the rich features of adjacent scales, which are grouped and fused to obtain fine large-scale features.

3.
The hybrid loss function constrains the pansharpened images at high resolution and low resolution, respectively.The proposed method greatly improves the performance on the no-reference images and provides state-of-the-art results in both qualitative and quantitative senses.
The following parts of this paper are organized as follows.Section 2 is devoted to the related literature.The proposed network architecture is detailed in Section 3. The description of the experimental results and the related discussions are introduced in Section 4. Finally, the conclusions are summarized in Section 5.

Background and Related Works 2.1. Pansharpening Methods
Pansharpening methods can be roughly divided into four categories [10]: component substitution (CS) methods, multiresolution analysis (MRA) methods, variational optimization (VO) methods, and machine learning (ML)-based methods.The first two classes have a fundamental role as conventional methods.As the technology has evolved, significant improvements have been made in recent years, with deep learning (DL)-and variational optimization-based methods gradually flourishing.
CS-based methods, also called spectral methods, assume that the spatial details of the image can be split and replaced.The LRMS image is projected to a suitable transformed domain, such as intensity-hue-saturation (IHS) space.Then, the separated spatial components, either partial or total, are replaced with that of PAN images.Due to its simplicity and high speed, IHS [11] is well known for the fusion of PAN and LRMS images with three channels.If four or more channels are concerned in the data, GIHS [12] can be generalized to pansharpening.Moreover, common methods in this context include the principal component analysis (PCA) transform [13], Brovey transform [14], and smoothing filter-based intensity modulation (SFIM) [15].Much effort has been also devoted to improvement of the injection rules, which focus on exploiting the relationship between the pixel values of PAN and MS images, such as the partial replacement adaptive CS (PRACS) [16] and adaptive GS (GSA) [17] implementations.The hypothesis that the linear combination of MS image bands approximates PAN images has been widely accepted, which neglects the inherent spatial and spectral properties.Hence, the inappropriate definition of weights can lead to serious distortion.
MRA-based methods, referred to as spatial methods, generally decompose PAN and MS images into multiple scales.The extracted spatial information is then injected into the MS image at different scales.Typical decomposition algorithms include wavelet transforms [18,19], Laplacian pyramids [20], and curve transform [21].Considering the specificity of the acquisition sensor, refs.[22,23] introduced the information of the acquisition sensor into the decomposition scheme.Moreover, the performance was improved through the introduction of a nonlinear method and optimization of the injection coefficient.The advantage of MRA-based methods is that less spectral distortion is produced.However, these methods are sensitive to spatial information.The injection of high-frequency information may result in aliasing effects and the blurring of contours and textures.Synthesizing these two classical fusion methods, hybrid technology using CS and MRA approaches has emerged, including CS followed by MRA (CS+MRA) and MRA followed by CS (MRA+CS).The most important hybrid technology is CS+MRA, which carries out decomposition in the transformation domain and then projects back into MRA classes.
Pansharpening is regarded as an optimization problem in VO-based methods.Specifically, the relationship between the HRMS images and the observed images is established according to the sensor model, which is estimated from PAN and LRMS images.As reconstruction from low-to high-resolution images is ill-conditioned, which can lead to noise amplification, several types of regularization approaches have been introduced to mitigate this ill conditioning.The estimation problem lies in the establishment of a cost function, including a fidelity term that describes the relationship between the HRMS image and the observed image, and a regularization term that incorporates certain prior beliefs about the HRMS image into the optimization process.Ballester [24] first exploited P+XS with the three assumptions, which were all groundbreaking.Both sparse regularization [25] and Bayesian [26] methods fall into the VO family.Fasbender et al. [27] hypothesized a joint Gaussian model for the unknown MS image and PAN method.The earliest work on sparse representation was proposed by Li and Yang [28], whose idea was to represent unknown HRMS images as sparse linear combinations of dictionary elements.The sparse representation theory was introduced in SR-D [29], which involved the development of a signal reconstruction procedure using a reduced number of measurements.However, most VO methods rely on one or more regularization parameters that need to be selected by the user.Moreover, the energy function and prior knowledge require complex calculation and time consumption, especially when considering images at large scales.
Deep learning is a new milestone in the field of pansharpening research.The promising capability of deep learning models to capture complex nonlinear relationships and extract features based on multi-layer neural networks has resulted in their widespread use in various fields of computer vision [30,31], such as image classification, image superresolution, and image colorization.The modified sparse denoising autoencoder (MSDA) algorithm [32] was the first attempt to conduct pansharpening leveraging a convolutional neural network.Subsequently, methods based on deep learning for pansharpening have continued to emerge.
Most of the existing DL-based methods follow the supervised learning paradigm, which satisfies the synthesis properties of Wald's protocol [33].First, once the fused image is downgraded to its original resolution, it should be as identical as possible to the original image.Second, the fused images should be as identical as possible to the image observed by the corresponding sensor at the highest resolution.Third, the multispectral set of fused images should be as identical as possible to the multispectral set of images that the corresponding sensor would observe with the highest resolution.The simulated data sets are acquired from degraded original high-resolution images.Relying on the reference images, the network is trained to update its parameters through minimizing the loss between the fused results and pseudo-ground truth MS images.Afterwards, the full-resolution data are used to test the pre-trained network.In 2016, a pansharpening method (PNN) received widespread attention [34], which consisted of a three-layer convolutional neural network (CNN).Scarpa et al. [35] introduced residual connections based on the PNN structure and adopted the training mode of target adaptive fine-tuning to enhance its generalization ability on several data sets.The PanNet [36] combines domain-specific knowledge with neural networks to train network parameters in the high-frequency domain.However, the methods mentioned above simply apply a single branch for feature extraction, ignoring the spatial and spectral features of the source image.Liu et al. [37] investigated a network with two branches to carry out fusion in the feature domain, which first encodes input images into high-level feature representations and then reconstructs high-resolution images.A unified two-stage spatial and spectral network has been proposed, called UTSN [38], which contains a spatial enhancement network, which was trained and shared on hybrid data sets, and a spectral adjustment network, which is used to capture the spectral characteristics of a specific satellite.However, it should be noted that supervised learning models generate simulated results with limited real-world applicability; furthermore, the process of training fails to make full use of the original high-resolution information, potentially resulting in scale mismatches.
As for unsupervised learning frameworks, which are built based on the concept of consistency in Wald's protocol, the problem of the unavailability of reference images can be tackled by designing appropriate loss functions and backbones.Luo et al. [39] designed an unsupervised network that can be modeled by PAN-guided feature fusion.The PAN images serve as the guidance for spatial information construction in order to recover details at high spatial resolution.Due to the complex spectral characteristics of MS images, an unsupervised pansharpening method with a self-attention mechanism [40] was proposed.The stacked self-attention network contains an attention representation layer that naturally identifies the spectral characteristics of mixed pixels with sub-pixel accuracy.Re-blurring blocks and graying blocks are applied in LDP-Net [41], allowing it to learn degradation processes at different resolutions.The speed of inference can be improved through the use of a target-adaptive inference scheme.Therefore, target-adaptive processing has been introduced into many methods, such as Lambda-PNN [42] and Fast Z-PNN [43].Faced with the challenges associated with limited training data, a zero-shot semi-supervised method for pansharpening (ZS-Pan) [44] was exploited, which served as a plug-and-play module.The SURE loss function, based on Stein's unbiased risk estimate [45], was introduced into an unsupervised network to avoid overfitting.MetaPan [46] solved the problem of setting key hyperparameters manually.The meta-learning stage optimizes for an internal representation of network parameters that is adaptive to specific image pairs.
In addition, there are a large number of unsupervised networks based on generative adversarial networks (GANs) [8].As a pioneering work, Ma [47] proposed an unsupervised pansharpening method, which is termed PanGAN.MDSSC-GAN SAM [48] focuses on high-frequency information and utilizes dual discriminators: a geometric discriminator, which optimizes image texture and geometry, and a chromaticity discriminator, which preserves the spectral resolution.Motivated by the cycle-consistent adversarial network (CycleGAN [49]), Li et al. [50] proposed a self-supervised framework in which the fused images are successively passed through two generators for improved performance.Zhou et al. [51] also proposed a cycle-consistent generative adversarial network (UCGAN) to bridge the gap between reduced and full resolution.ZeRGAN [52] is a zero-reference generative adversarial network whose structure consists of a set of multi-scale generators and discriminators.The training process involves only a pair of images, and accurate fused results are generated.Ozcelik et al. [53] adopted a new perspective that regarded pansharpening as the task of colorization.The self-supervised mode overcomes the shortcomings of spatial detail loss and ambiguity in CNN-based models.

Attention Mechanism
Due to their limited attention resources, humans generally capture the discriminative regions of an image rather than obtaining its whole information at once.Similarly, it is also necessary to emphasize important parts of the image to obtain critical information in the process of training.It is worth noting that the attention mechanism [54] can suppress irrelevant redundant information and recognize complementary properties between PAN and MS images.
In 2014, Google Mind introduced visual attention into the image classification task [55], thus pioneering the attention mechanism.Jaderberg et al. [56] proposed the spatial transformer network (STN) to learn the affine transformations of images and predict important regions of the input.As a plug-and-play module, STN can be seamlessly integrated into a model to improve its robustness to a certain extent.The lightweight squeeze-and-excitation (SE) block [57] can explicitly model the dependencies among channels in order to enhance the capability of feature representation.Unfortunately, the complexity of this module is high due to the use of full connection layers.Therefore, subsequent studies have improved and exploited GSoP-Net [58], SRM [59], and other modules.As a representative hybrid attention module, the convolutional block attention module (CBAM) [60] consists of a spatial attention mechanism and a channel attention mechanism to learn contextual information.At the same time, the bottleneck attention module (BAM) [61] utilizes dilated convolution to infer both feature and location information.Self-attention was utilized for the natural language processing task for the first time, which showed great potential for development.Subsequently, the vision transformer (ViT) [62] and swim transformer [63] were designed in succession in order to train models in parallel and capture the global context features of images.

The Proposed Network
In this section, the proposed method is summarized in detail.First, the overall architecture is introduced, which is followed by the structure of the generator, discriminator, and loss function.Let MS ∈ (h×w×c) be a low-resolution MS image with c channels and a size of h × w.PAN ∈ (H×W) represents a high-resolution PAN image with size H × W. Furthermore, MS ∈ H×W×c denotes the pansharpening image.

Overall Network Framework
Figure 1 displays an overview of UPGAN.The detailed architecture can be divided into two parts: the generator and the dual discriminators.The input of the generator is the stack of PAN and MS images.The distinct source images lead to a gap in spatial resolution.Thus, the MS images are interpolated to the same spatial resolution as the PAN images using bicubic interpolation.This alignment helps the network extract features and preserve complete structural information.In the generator, a convolutional layer is applied to extract the shallow features, which maps the input pair into the feature domain.Then, the fused image is progressively generated through the stages of feature extraction and image reconstruction, which is composed of cascade dense connection attention modules (DCAMs).The DCAM comprises two types of modules: multi-scale detail feature learning (MSDFL) and spatialspectral attention (SSA) blocks.The feature fusion stage consists of cross-scale attention fusion modules (CSAFs).Finally, the number of channels received after convolution is set to c for the output.The flow of the model mentioned above is expressed in Algorithm 1, where F i represents the feature maps, i represents the number of features passing through the DCAM modules, and CSAF i−3 denotes the output of image fusion.The i = 0 condition denotes that shallow features are extracted through the convolutional layer for input features.When 0 < i ≤ 3, it represents the stage of feature extraction, and when 3 < i ≤ 6, it means that the feature maps are in the stage of image reconstruction in which feature fusion is conducted.Two discriminators are employed to improve the spatial and spectral information, respectively, which judge whether the input is real or fake.As an unsupervised training method, the composition of the loss function without the reference image is crucial.A hybrid loss function with various constraints optimizes the training process, which consists of four parts: index loss, adversarial loss, spectral loss, and spatial loss.

Generative Network
In remote sensing images, tubular structures with complex morphological changes are present in different regions.As shown in Figure 2, slender and fragile local structures are presented in all images, such as straight roads in the city, tubular paths in the country, ridges in cultivated land, and slender paths in the desert.While the images present an unknown morphological structure, the model may overfit these features, resulting in weak generalization.Considering the above problems, dynamic snake convolution is introduced [64] in order to adaptively focus on finely local structures, enhance the perception ability of the model, and optimize the features of tubular structures in different directions.The deformation of the coordinate information is augmented with the x-axis and y-axis based on standard convolution, such that the convolution kernel has more flexibility to adapt to specific complex geometric features.For instance, the nearby pixel position on the x-axis with the convolution kernel size of 9 and K i as the center position is expressed as K i±s = (x i±s , y i±s ), where s = {0, 1, 2, 3, 4} represents the horizontal distance between the current location and the center grid.The selection of each grid position is related to the previous pixel position with the offset ∆ = {δ|δ ∈ [−1, 1]}, which can be regarded as the process of dynamic programming.As the offsets gradually accumulate, the convolution kernel is calculated as a linear morphological structure.Specifically, the horizontal spatial locations of K i±s and the vertical spatial locations of K j±s can be expressed as As the offset is usually not an integer, bilinear interpolation is implemented as follows: where K represents the fractional positions of Equations ( 1) and ( 2), K ′ enumerates all integer space positions, and B is the bilinear interpolation kernel.It is divided into two one-dimensional kernels as follows: Due to the variation in the xand y-axes, the convolution covers a 9 × 9 range during deformation.
The overall flowchart of DCAM is graphically illustrated in Figure 3, which consists MSDFL modules and an SSA block.As illustrated in Figure 4, the MSDFL module is employed to perceive the significant features and extract the precise details.Standard convolution combined with dynamic snake convolutions in a two-dimensional manner is considered to learn local deformations.Among the three kinds of convolution, different local information is stacked to obtain the feature map on a large scale.Subsequently, convolution is performed to fuse the adaptive features.The dense connections feed the output of the current layer to all other layers that follow in the architecture, resulting in maximizing the transfer of information across all layers.
After the feature integration of two cascaded MSDFL blocks, iteratively optimized during the training process, the features are enhanced by the SSA module.Considering the spectral and spatial fidelity in remote sensing images, the dual-branch structure of spatial and channel attention is utilized for attention learning, as shown in Figure 5.In spatial attention, global average pooling and maxpooling are combined and employed for the channel dimensions.The feature maps output from the pooling layers goes through a 3 × 3 convolutional layer before a sigmoid function to obtain the weight matrix.Figure 6 shows the channel attention mechanism in the SSA block.Average pooling is conducted to compress the number of channels to 1 and preserve the key features in parallel branches.The channel coefficient is obtained by squeezing and exciting 1D convolution.Meanwhile, the feature maps are averaged and multiplied by the coefficient matrix.As there is a certain relationship between spectral and spatial information, the two weights are multiplied by each element to generate adaptive weights and ensure the integrity of fusion.The mixed attention coefficient matrix is integrated into the weight coefficient of the convolution kernel.Finally, the inner product operation is performed on the input features to refine the spectral-spatial features.As shallow contextual features and deep semantic features are distinguished, crossscale information fusion explores global features from diverse scales.Therefore, CSAF modules are adopted, as shown in Figure 7. First, the attention coefficient matrix is obtained for both high-resolution features and low-resolution features, for which the number of channels is compressed to 4 through global average pooling and the coefficients are normalized using softmax.Due to misalignment, small-scale features are upsampled to the spatial size of high-resolution features.In the subsequent procedure, the upscaled deep features, the high-resolution features, and different attention coefficients are divided into four groups for convolution in order to gradually incorporate rich features.Eventually, the concatenation of the four groups effectively increases the learned representations and maximizes the preservation of detailed information in the image.

Discriminative Network
The use of two discriminators can simultaneously force the spectral and spatial information to be preserved in the result.The discriminator framework is presented in Figure 8.The discriminator of pix2pix is utilized as the spectral discriminator, which takes the downsampled fused images and low-resolution MS images as input.The patch discriminator divides the image into N × N patches for discrimination and evaluates each patch of the image, which is applied for spectral restriction to further refine the intensity and contrast of the image.Five convolution layers increase the number of feature maps from 3 to 256 and the output features are reduced to 1, which are used to capture spectral information and generate the representation feature mapping.A 4 × 4 kernel with a stride of 2 is adopted in all of the convolutional layers, except for the last two layers, to reduce the image size and expand the feature dimensions; this not only effectively simplifies the model but also allows it to learn more distinct and accurate high-frequency features.The leakyrelu activation function and instance normalization reduce the possibility of gradient vanishing, thus ensuring the stability of training.
The spatial discriminator follows the architecture of the pixel discriminator, which is completely composed of convolutional layers.Its input is PAN images or single-channel fused images (through maximum pooling).All layers are equipped with 1 × 1 filters, and the number of extracted feature maps was set to 64, 128, 128, and 1.Similarly, instance normalization is also conducted after convolution.The adversarial training process effectively improves the performance of the resulting images.

Loss Functions
Due to the absence of ground truth, the hybrid loss function is designed to constrain the spatial and spectral features between the fused images and the source images.In general, the loss is continuously optimized in the process of training.The smaller the value, the closer the fused image is to the original image.Therefore, an appropriate loss function is also a critical component.The components of the loss function are detailed as follows: 1.

Spectral Reconstruction Loss
As the generated MS image has high resolution and the input MS image has low resolution, the spectral responses differ between different scales.The spectral constraints are introduced for the two respective scales, which allows the distribution of spectral information to be matched.With regard to the high-resolution image, low-pass information is extracted from the upsampled MS and the pansharpened output.
where l p(•) denotes the low-pass information extracted from the image, which is calculated using the average pooling operation with the convolution of kernel size 5 and padding of 2; ↑ represents the operation of converting the input MS image to the same resolution as the fused image; and, accordingly, ↓ represents interpolation to degrade the images spatially.In terms of the low-resolution image, the MS image and the downsampled output image are utilized for constraint learning, which can be represented as

Spatial Reconstruction Loss
A spatial constraint of the pansharpened image and PAN image in the high-pass domain is applied for spatial preservation.High-frequency information represents detailed texture information, which facilitates the production of fake images by the generator.Specifically, low-pass information is obtained from the average filter, which is subtracted from the input image.The spatial reconstruction loss can be formulated as follows: where hp(•) denotes the high-frequency information extracted from the image, which is acquired by subtracting the low-frequency feature from the input image.The parameters of the convolution filters are learnable and optimizable compared to MTF-matched filters [22].Furthermore, ∥∥ 1 is the L 1 norm and S(•) denotes the maximum pooling of the channel dimensions (in order to compress the number of channels to 1).

Index Loss
Image evaluation metrics generally measure the quality of generated images, which are also considered in the training loss.The non-reference image quality assessment index (QNR) [65] measures the performance of fusion.QNR predicts spatial distortion by comparing the spatial similarity between the fused image and PAN image, and it predicts spectral distortion by calculating the difference between the bands of the fused image and the MS image.The combination of the two obtains the final quality prediction value.For the formula of QNR in detail, please refer to the evaluation metrics in Equation (22).When QNR reaches the best value of 1, the value of L qnr is optimal.
The structural similarity (SSIM) index measures the similarity of two images.The desired result is that the fused image at high resolution is as similar as possible to the image.Therefore, an additional loss based on SSIM is represented as follows: where the mean µ x and µ y represent the brightness of image patches, and variance σ x and σ y represent contrast and covariance, which can be used as a measure of structural similarity.The SSIM is calculated between the low-frequency part of the upsampled MS image and the low-frequency part of the fused image.It is calculated through averaging with a window size of 11.

Adversarial Loss
The generator synthesizes increasingly realistic images to fool the discriminators through continuous adversarial learning, while the discriminators maintain a powerful capability to distinguish between real and synthetic images.The gradient penalty is also introduced into the WGAN-GP loss, which avoids the problem of gradient explosion or disappearance.Consequently, the loss can be formulated as follows: where S(•) is the operation of channel max pooling, which converts the multichannel fused image to an image with a single channel; ∇ denotes the derivative operation; and ∥∥ 2 denotes the L2-norm for the data on the channel.According to WGAN-GP, PAN is produced from the PAN image and the intermediate gray image S MS .
where ε represents a tensor of the same size as the PAN image.Correspondingly, the adversarial loss of the spectral discriminator is defined as follows: Accordingly, MS is produced by the MS image and the fused image after downsampling.
Based on the various loss functions mentioned above, the hybrid loss function is proposed as follows: where the λ represents the weight of each loss function.

Data Sets and Experimental Setup
In this section, the proposed UPGAN is evaluated on two data sets to verify its effectiveness: Quickbird (QB) and WorldView2 (WV2).
QuickBird data set.The QB data set is composed of two types of images with a spatial resolution of 0.6 m for PAN images and a spectral resolution of 2.4 m for MS images.The data set contains 451 patch pairs for training, 23 patch pairs for validation, and 23 patch pairs for testing.For reduced-r and full-resolution experiments, the sizes of the PAN and MS patches are 64 × 64 and 256 × 256 pixels, respectively.
WorldView2 data set.In the WV2 case, the spectral resolution of the MS images is 1.2 m and the spatial resolution is 0.3 m.The 1253 images are divided into 1135, 59, and 59 images for training, validation, and testing, respectively.Both in the reduced-and full-resolution data, the sizes of the MS and PAN images are 64 × 64 and 256 × 256 pixels.Following Wald's protocol, it is widely considered that the original images are downsampled to prepare the training samples, and the source MS images serve as the ground truth images.The scale factor between the PAN and MS images was 4, which means H is 4 times h.
In the training phase, the proposed method was implemented based on PyTorch 1.12.1 with a single NVIDIA GeForce GTX 3090 and 24 GB memory, which was trained in 30,000 epochs.The fixed default parameters were used on both data sets.The batch size and the initial learning rate were set as 2 and 0.0001, respectively, which were optimized through AdamW optimization.The hyperparameters in the loss function were set as λ qnr = 1.0, λ adv = 0.01, λ spatial = 0.01, λ ssim = 0.001, λ spectral−high = 0.01, λ spectral−low = 0.01, and λ = 10.
In the test phase, the proposed method and comparative methods were evaluated qualitatively and quantitatively.Five quality evaluation metrics with reference significance were adopted, consisting of the peak signal-to-noise ratio (PSNR) [66], the spectral correlation coefficient (SCC) [67], the spectral angle mapper (SAM) [68], the relative dimensionless global error in synthesis (ERGAS) [69], and the universal image quality index (UIQI) [70].Three representative metrics on the full scale were employed for no-reference experiments: the spectral distortion index D λ , the spatial distortion index D s , and the quality with no reference index QNR [65].However, according to Arienzo et al. [71], these commonly used indicators have some disadvantages; for example, QNR shows a spatial distortion index that is not decoupled from the spectral distortion index.Therefore, two full-resolution metrics were supplemented.The reprojection protocol [72] was introduced into D ρ and R − Q2 n for spatial consistency and spectral accuracy assessments.All quality assessment metrics used in the experiment are shown as follows: 1.
The peak signal-to-noise ratio (PSNR) [66] represents the information contained in the fused image compared with the reference image, which can reflect the distortion in the fusion process at the pixel level.The formula is specifically defined as where a represents the peak value of image pixels and MSE represents the mean square error.

2.
The spectral correlation coefficient (SCC) [67] is an evaluation index which is used to measure the degree of spectral correlation between images where GT stands for the reference image.GT and MS denote average values of GT and the fused image, respectively.

3.
The spectral angle mapper (SAM) [68] measures the absolute value of the spectral angle between the MS and fused images.Usually, the global spectral distortion can be measured by calculating the average value of the corresponding pixels for the whole image where v is a pixel vector in the fused image and v is a vector in the reference image.

4.
The relative dimensionless global error in synthesis (ERGAS) [69] calculates the normalized mean error for the bands of the fused image, which ranges from 0 to infinity.
where d h d t represents the ratio between of PAN and MS, µ(l) is the average of the l th band, B is the number of bands, and RMSE(l) represents the root-mean-square error of the l th layer.

5.
The universal image quality index (UIQI) [70], also known as the Q index, gives a score about the overall quality of the image and models the distortion in the fused image.The performance of fusion is characterized by measuring the covariance, standard variance, and mean.
where σ 1 and σ 2 represent the standard deviations of the two images, C denotes the covariance, and K is a constant.Moreover, a generalization of the Q index is extended to multispectral and hyperspectral images based on the computation of the hypercomplex correlation coefficient between the reference and fused images, which is referred to as Q2 n [73].

6.
The spectral distortion index D λ [65] is specifically defined as where p is a positive integer that emphasizes larger spectral differences, and Q(•) represents the Q index calculated between two images.7.
The spatial distortion index D s [65] is specifically defined as where PAN LP represents a low-resolution PAN image with spatial degradation.8.
Considering the spectral distortion and spatial distortion comprehensively, the quality with no reference index (QNR) [65] is the main non-reference index for evaluating full-resolution images.
where α and β are parameters used to balance the D λ and D s indices.9.
The correlation-based spatial consistency index D ρ [72] evaluates the preservation of the full-resolution spatial structure and calculates the average local correlation between the pansharpened image and the PAN.
where PAN σ ij represents the region with size σ × σ centered on pixel position (i, j) in the original PAN image, and corrcoe f (•) calculates the local correlation coefficient among each band b of the original PAN image and the fused image.10.The reprojection spectral distortion index R − Q2 n [72] is an evaluation index used to calculate spectral errors.The fused results are reprojected to the spatial resolution of LRMS, and the errors are calculated using the Q index.
where MS LP represents the fused image reprojected to the low resolution.

Experimental Results on QB Data Set
Tables 1 and 2 illustrate the quantitative results on the QB data set.In the experiments with reference, the proposed UPGAN was superior to other methods in terms of most indices.In particular, the PSNR was almost 1 dB higher than that of UCGAN, which indicates that the images generated by UPGAN achieved a better visual effect.Although the UPGAN method was slightly inferior to Brovey in SAM, its advantage was clear when compared with other unsupervised methods based on deep learning.In Brovey, the pixels of each band are comprehensively considered to determine the proportion of the gray value of each feature in the panchromatic band.Thus, the prominent differences among the characteristics of ground objects can be highlighted, thus boosting the SAM.
As shown in Figure 9, the MTF_GLP_HPM and SFIM methods suffered from severe ringing artifacts.The images obtained with the ZeRGAN, LDPNet, and PanGAN methods show significant spectral degradation, especially the LDPNet method, which introduced significant spatial blurring.These results indicate the shortcomings of the methods mentioned above with respect to reconstructing spatial detail while maintaining spectral performance.Although the UCGAN and UPGAN methods present slight spatial blurring, both obtained impressive visual appearance results.Moreover, the result of UCGAN has serious spectral distortion in the detailed zoom part shown the red box where the color is gray.Compared with the GT image, the roof generated with the ZS-Pan method is light blue in the enlarged area.Although Brovey had the best performance in the SAM, it can be seen that artifacts are still present in the image.Figure 10 shows the mean absolute error (MAE) of the fused images in order to distinguish the visual differences in detail.The results of MTF_GLP_HPM, SR-D, and ZS-Pan were much brighter than those of other methods, which means more errors with respect to the ground truth.Furthermore, the Brovey, PCA, and LDPNet methods presented disadvantages in reconstructing high-frequency details.The error of the image produced with UPGAN was minor, as can be seen from the yellow enlarged area.Compared with UCGAN, the spatial details extracted by UPGAN are superior, indicating the excellent performance of UPGAN.With respect to full-resolution metrics, the UPGAN method outperformed most comparative methods and obtained optimal values, while UCGAN ranked second in most indicators.The results obtained with the competitive methods are shown in Figure 11.The conventional methods yielded less distortion than the deep learning methods, but most of the pansharpened images presented artifacts.The fused image produced by SR-D suffers from obvious spatial distortion, resulting in jagged results.None of the five methods based on deep learning achieved great spectral performance; in particular, the visual results of ZeRGAN, UCGAN, and LDPNet present gray areas, which was possibly due to insufficient information extraction.With respect to the result of the UPGAN method, the overall color is harmonious and the light orange color of the enlarged area is reconstructed, such that the best results were obtained in the quantitative evaluation by the proposed model.(a (a

Experimental Results on WV2 Data Set
The average values of experimental results conducted on the WV2 data set are summarized in Tables 3 and 4, which further verify the effectiveness of the proposed UPGAN.
It can be seen that UPGAN performed best in terms of most metrics.Although UPGAN led to relatively poor performance in SAM, it still had advantages over other unsupervised methods based on deep learning.In particular, the image generated by UPGAN was 0.61 dB higher than that of UCGAN in PSNR, which further proves that UPGAN has better visual effects in the qualitative evaluation.Figure 12 shows that the images obtained with PCA and SFIM present significant spectral distortion.SR-D led to a fuzzy structure at the edge of the object.According to the qualitative analysis, UPGAN was the only method that reconstructed the light green river similar to that in the reference, while the other pansharpened images synthesized a darker color.In particular, the ZS-Pan method restored the color to an obviously dark blue rather than green.Furthermore, distortion was introduced by ZeRGAN and UCGAN in that the road displays an abnormal green color on the left side of the image.The enlarged red rectangle shows the varying degrees of spatial distortion in the results of the compared methods.In contrast, the images of UPGAN and Brovey maintained great spectral and spatial fidelity.The MAE is presented in Figure 13 for each fused result, which shows that the results produced by the UPGAN method had only minor errors.From Table 4, it can be seen that UPGAN achieved second place in D λ and performed the best in terms of the other three metrics.Like the results on the QB data set, its D ρ was not the best, which indicates that our method may still have limitations related to the insufficient utilization of spatial information.Moreover, Figure 14 shows that the evaluation indices of some conventional methods were better than those of deep learning methods.On one hand, the lack of reference images limits the performance of deep learning models, which can also be found from the simulated experiment.On the other hand, the effectiveness of deep learning methods heavily depends on the training data.It is difficult to perfectly simulate the mapping relationship among PAN images, LRMS images, and HRMS images at reduced resolution.The images produced by SFIM and LDPNet contain significant spectral distortion, and obvious artifacts can be observed in the pansharpened results of MTF_GLP_HPM and PanGAN.As can be observed from the lawn on the right of the image, the Brovey, PCA, ZS-Pan, and ZeRGAN methods recovered unusual colors.UPGAN and UCGAN produced images with excellent quality among all methods.It can be seen from the enlarged area that UPGAN can better restore the regular shape of cultivated land; however, it should be noted that these methods still do not solve the ambiguity problem.In conclusion, the UPGAN method had the best performance in terms of recovering texture details and spectral information.

Ablation Study
Next, ablation experiments were implemented on the QB data set, in which various modules were tested separately to verify their validity within the UPGAN structure.

Effectiveness of the Discriminator
UPGAN with LSGAN serves as a variant structure whose objective evaluation is listed in Table 5.The disadvantage of the variant structure is that it leads to degradation to a certain degree.Compared with the cross-entropy loss function, WGAN-GP applied in UPGAN includes the Wasserstein distance, which measures the distance between two probability distributions.Not only does this make the training process more stable, but it also avoids gradient explosion and mode collapse problems through introducing a gradient penalty term.In Figure 15, the modules in the generator are simplified as orange and gray circles.The additional skipping layers represented by dotted lines were added to the original structure.The skip connections transmit the deepest features of the feature extraction stage to each layer in the image reconstruction stage to recover the salient features.The results are denoted as "UPGAN with SC" in Table 5, which indicate that the skip connections do not result in excellent performance.The reason may be that the original connections of the U-shaped structure are supplemented with sufficient information and information transmission using superfluous connections will cause redundancy, thus slightly degrading the performance.5, which indicate that the UPGAN exhibits performance degradation if the DCAM module lacks SSA instead of being replaced by a convolutional layer.Moreover, the designed channel attention was replaced by the classical SE block in SSA.It can be seen that all indicators were lower than the results obtained with UPGAN.Compared with the compression and excitation mechanism of the SE block, the proposed module employs a parallel structure and average value calculation to acquire its significant characteristics.

Effectiveness of the CSAF Module
Two experiments were applied with respect to the CSAF module: in one, the CSAF module was replaced with the concatenation operation, and in the other, the structure was modified by removing the grouping operation and stacking features directly.The results are shown in rows 5 and 6 of Table 5.Although the ERGAS index was optimal, the other four indices were not the best in UPGAN without CSAF.In terms of CSAF without a group, the PSNR obviously decreased, leading to distortion of the pansharpened results, which indicates that feature grouping can realize optimal trade-offs between the spatial structure and spectral fidelity.

Effectiveness of the Dynamic Snake Convolution
As shown in row 7 of Table 5, the ordinary convolution replaces the dynamic snake convolutions in the MSDFL module, which is referred to as MSDFL w/o DSC.It can be seen that each metric is worse without snake convolutions.In particular, the PSNR is reduced by 0.3 dB, which means that the fused results with only ordinary convolution are more distorted.The reason may be that the offsets of spatial locations focus on the tubular region.Therefore, dynamic snake convolution is suitable for improving the fusion performance.

Effectiveness of the Number of Sub-Modules in DCAM
As the DCAM is composed of two sub-modules, with the MSDFL module adopting residual dense connections and the SSA module including the attention mechanism, the number of modules was tested to explore the associated influence on the performance of the model.As shown in Figure 16, M2S2 indicates that two MSDFL modules and two SSA modules are adopted in DCAM.It can be concluded that the model consisting of two MSDFL modules and one SSA module achieved optimal performance in the three indices.Although both SCC and UIQI improve with an increase of the number of sub-modules, the network parameters will increase as well.From Table 6, an increase in the number of MSDFL modules will lead to a sharp increase in the number of parameters.Moreover, the training time reached more than 7 h when four MSDFL modules were considered.Therefore, M2S1 was chosen for the UPGAN model in order to balance model performance and training resources.7. The batch-size parameter determines the smoothness of the gradient among iterations and the time required to complete each epoch in the training process.When the batch size is adjusted, it can be seen that all indicators decreased.After that, an experiment on the learning rate was conducted.The smaller the batch size, the smaller the learning rate needs to be; otherwise, the convergence time will be long and the results will be poor.Therefore, the initial learning rate was set to 0.001 for training.Moreover, two weights for the index loss were tested.If λ qnr = 0.5, only the SCC metric presents a small increase, while all other metrics decrease.When the weight value of the SSIM loss is adjusted, all indicators present a different degree of decrease.The composite loss function is composed of several parts, so an ablation experiment is performed for each part on the QB data set.It can be seen from Table 8 that the index loss function is very important for model training, and the lack of L qnr will seriously affect the relevant indicators of the fused images.At the same time, the spatial constraints and spectral constraints were verified, respectively, which proved that the performance of the model will be reduced in the absence of either spatial constraints or spectral constraints.When the spectral loss was calculated, both were constrained under high spatial resolution and low spatial resolution, and the associated results were better than those only under one spatial resolution.Although the UIQI indices reached the optimal value in the absence of L spectral−high , the other four indicators all deteriorated to a certain extent.Therefore, the spectral constraint of the loss function adopts two kinds of constraints in UPGAN.

Training Time
For different deep learning methods, the training time for each method is shown in Table 9.The training mode of ZeRGAN is special, as its test set is also the training set.Only one pair of test set images were input for training, and the total training time was 20.98 h.In terms of ZS-Pan, the total training time of the three training phases was calculated.For the other methods, the total time of processing the training set is shown.In general, complex structures can obtain better performance, and the more model parameters there are, the longer it takes to generate a single fusion image.Our proposed method, UPGAN, mainly optimizes the model structure from the perspective of improving the fusion effect.The structures used in UPGAN-such as the dynamic snake convolution and dense connection-are time consuming but acceptable.

Conclusions
The study proposed an unsupervised pansharpening network for remote sensing images, which is called UPGAN.The model consists of a generator and two discriminators.In the generator, the DCAM module combines dynamic snake convolution and attention mechanisms to extract and reconstruct the features of images.The CSAF module fuses the feature groups at different scales to improve spectral fidelity and spatial resolution.Due to a lack of reference images, a loss function with four constraints was designed to optimize the model training process.The proposed method was compared with five traditional methods and five unsupervised methods based on deep learning on the QB and WV2 data sets.The results demonstrated the superiority of UPGAN in terms of both visual quality and objective index values.

Figure 2 .
Figure 2. Slender tubular structures in remote sensing images.

Figure 15 .
Figure 15.The connection structure in the generator.4.4.3.Effectiveness of the SSA ModuleNext, the effectiveness of the SSA module and the channel attention mechanism were verified.The quantitative indices are reported in Table5, which indicate that the UPGAN exhibits performance degradation if the DCAM module lacks SSA instead of being replaced by a convolutional layer.Moreover, the designed channel attention was replaced by the classical SE block in SSA.It can be seen that all indicators were lower than the results obtained with UPGAN.Compared with the compression and excitation mechanism of the SE block, the proposed module employs a parallel structure and average value calculation to acquire its significant characteristics.

Figure 16 .
Figure 16.Average quantitative results for the number of sub-modules in DCAM on the QB data set.(a) PSNR.(b) SCC.(c) SAM.(d) ERGAS.(e) UIQI.4.4.7.Effectiveness of the Hyperparameters in UPGAN Several hyperparameter ablation experiments were performed, the results of which are shown in Table7.The batch-size parameter determines the smoothness of the gradient among iterations and the time required to complete each epoch in the training process.When the batch size is adjusted, it can be seen that all indicators decreased.After that, an experiment on the learning rate was conducted.The smaller the batch size, the smaller the learning rate needs to be; otherwise, the convergence time will be long and the results will be poor.Therefore, the initial learning rate was set to 0.001 for training.Moreover, two weights for the index loss were tested.If λ qnr = 0.5, only the SCC metric presents a small increase, while all other metrics decrease.When the weight value of the SSIM loss is adjusted, all indicators present a different degree of decrease.

Table 1 .
Quantitative results of compared methods on QB data set in reduced resolution.The best values are shown in bold, and second place is underlined.

Table 2 .
Quantitative results of compared methods on QB data set in full resolution.The best values are shown in bold, and second place is underlined.

Table 3 .
Quantitative results of compared methods on WV2 data set in reduced resolution.The best values are shown in bold, and second place is underlined.

Table 4 .
Quantitative results of compared methods on WV2 data set in full resolution.The best values are shown in bold, and second place is underlined.

Table 5 .
Ablation study of UPGAN on QB data set.

Table 6 .
Comparison of network parameters and training times under different numbers of DCAMs.

Table 7 .
Ablation study for the hyperparameters on the QB data set.

Table 8 .
Ablation study of the loss function on QB data set.

Table 9 .
Comparison of training times for different deep learning methods.