PercepPan: Towards Unsupervised Pan-Sharpening Based on Perceptual Loss

: In the literature of pan-sharpening based on neural networks, high resolution multispectral images as ground-truth labels generally are unavailable. To tackle the issue, a common method is to degrade original images into a lower resolution space for supervised training under the Wald’s protocol. In this paper, we propose an unsupervised pan-sharpening framework, referred to as “perceptual pan-sharpening”. This novel method is based on auto-encoder and perceptual loss, and it does not need the degradation step for training. For performance boosting, we also suggest a novel training paradigm, called “ﬁrst supervised pre-training and then unsupervised ﬁne-tuning”, to train the unsupervised framework. Experiments on the QuickBird dataset show that the framework with different generator architectures could get comparable results with the traditional supervised counterpart, and the novel training paradigm performs better than random initialization. When generalizing to the IKONOS dataset, the unsupervised framework could still get competitive results over the supervised ones.


Introduction
Pan-sharpening is generally described as an image fusion problem aiming to generate a high resolution multispectral (HRMS) image based on a low resolution multispectral (LRMS) image and a panchromatic (PAN) counterpart. Classical pan-sharpening methods include component substitution [1][2][3], multi-resolution analysis [4,5], and variational optimization [6,7]. Comprehensive reviews about these methods could be found in [8,9].
With the boom of deep learning, more and more researchers use neural networks to solve the pan-sharpening problem and achieve promising results. Inspired by image super-resolution [10][11][12][13], Masi et al. [14] construct a three-layer convolutional neural network for pan-sharpening. Differently, Shao et al. [15] design a deep convolutional network with two branches, one of which is for LRMS images and another for PAN images. To make full use of domain knowledge, Yang et al. [16] integrate a special-designed structure for spectral and spatial information preservation. To improve image quality further, Liu et al. [17] use generative adversarial networks (GAN) [18] to build a pan-sharpening network, called PSGAN, in which a two-stream generator is designed to receive LRMS images and PAN images simultaneously. Different from other methods [19][20][21], (deep) neural networks-based methods could efficiently extract multi-level abstract features [22,23] for performance boosting with the standard backpropagation.

•
How can the framework and loss be designed to train pan-sharpening model G directly? • Could supervised pre-training offer gains in the SPUF training paradigm? • Could the unsupervised perspective outperform its supervised counterpart?
The contributions of this paper could be summarized as follows:

1.
A novel unsupervised learning framework "perceptual pan-sharpening (PercepPan)" is proposed, which does not need the degradation step anymore. The framework consists of a generator, a reconstructor, and a discriminator. The generator takes responsibility for generating HRMS images, the reconstructor takes advantage of prior knowledge to imitate the observation model from HRMS images to LRMS-PAN image pairs, and the discriminator extracts features from LRMS-PAN image pairs to compute feature loss and GAN loss.

2.
A perceptual loss is adopted as the objective function. The loss consists of three parts, with one computed in pixel space, another computed in feature space and the last computed in GAN space. The hybrid loss is beneficial for improving perceptual quality of generated HRMS images.

3.
A novel training paradigm, called SPUF, is adopted to train the proposed PercepPan. Experiments show that SPUF could usually outperform random initialization. 4.
Experiments show that PercepPan could cooperate with several different generators. Experiments on the QuickBird dataset show that the unsupervised results are comparable to the supervised ones. When generalizing to the IKONOS dataset, similar conclusions still hold.
The rest of this paper is organized as follows. Section 2 introduces the perceptual loss in previous works. Section 3 describes the proposed PercepPan in detail. Section 4 experimentally verifies the effectiveness of the proposed PercepPan. Finally, Section 5 concludes this paper.

Perceptual Loss
Basically, the proposed PercepPan is trained with perceptual loss. Perceptual loss mainly depends on high level features extracted from (convolutional) neural networks [47] rather than image pixel values. After introduced into image super-resolution [48], the loss has gotten more and more attention.
The most striking example of perceptual loss is for real-time style transfer and image super-resolution in [48], where perceptual loss is computed between real and reconstructed features by the Euclidean distance. The loss could diminish the ambiguity between high resolution images and low resolution images to some extent.
Perceptual loss could also combine with GAN loss for better performance. In the variational auto-encoder/generative adversarial network (VAE/GAN) [49], feature loss and GAN loss is combined for similarity metric learning, which could be treated as an extension of perceptual loss. It also inspires our perceptual loss for pan-sharpening. Specifically, VAE/GAN uses three different losses for training. The first one is prior loss, KL(z = Enc(x)||z p ), which constrains the latent representation z learned from data point x to follow the same distribution as z p drawn from a prior distribution; The second one is feature loss, ||Dis (l) (x) − Dis (l) (x = Dec(z))|| 2 2 , which is based on hidden representations from the l-th layer of the discriminator in VAE/GAN; The last one is GAN loss, log(Dis(x)) + log(1 − Dis(x)) + log(1 − Dis(x p )), which could improve image sharpness. Here, KL means Kullback-Leibler divergence; Enc, Dec, and Dis denote the encoder, decoder, and discriminator respectively;x and x p denote generated and reconstructed images respectively.
The proposed PercepPan adopts similar loss computation to VAE/GAN but with some differences. Specifically, PercepPan treats HRMS images as the latent representation directly, which means that the dimensionality of the representation is higher than that of the input; Moreover, PercepPan introduces loss computation in pixel space as an alternate to the prior loss.
Another example leveraging perceptual loss with GAN is enhanced super-resolution GAN (ESRGAN) [50], in which residual-in-residual dense block (RRDB) is introduced as basic unit, together with relativistic generative adversarial networks [51] and perceptual loss [48]. These tricks help ESRGAN with generating high resolution images with better perceptual quality and winning the first place in the PIRM2018-SR Challenge [52]. Mathematically, ESRGAN could be simply expressed as where x and y denote low resolution (LR) and high resolution (HR) images with three channels, respectively. Figure 2 shows the generator architecture of ESRGAN. The proposed PercepPan also simply adopts the architecture of ESRGAN as a generator for pan-sharpening, except for minor adaptations. Specifically, images used in PercepPan are multispectral (MS) images, which usually have more channels/bands, such as four for IKONOS and QuickBird, and eight for WorldView-2, so that the number of channels of filters in the first convolutional layer needs to be changed. Moreover, PercepPan uses ESRGAN for "residual learning" rather than generating HR images directly, where x denotes a MS image, and µ x and σ x are residuals, both of which have the same number of channels with x. This means that the number of channels of filters in the last convolutional layer also needs to be changed. As an example, Figure 2 also illustrates the adaptation for MS images with four bands. These learned residual would then be fused with PAN images in a manner like style transfer [53,54].
It should be noted that the proposed PercepPan could cooperate with different generators. The constructed architecture above is only an example, and it is not a crucial part of PercepPan framework.

Methodology
In this section, we first describe the formula of pan-sharpening as a supervised learning problem, and then present our unsupervised PercepPan framework.

Pan-Sharpening Formula
Given a training dataset with N samples, {(x (n) , p (n) , y (n) )} N n=1 , where x (n) ∈ R W×H×C , p (n) ∈ R rW×rH and y (n) ∈ R rW×rH×C denote LRMS image, PAN image, and HRMS image, respectively. W, H, and C denote width, height, and the number of bands of an LRMS image respectively, and r is the spatial resolution ratio between an LRMS image and a PAN image.
When the ground-truth HRMS image y (n) is known, the pan-sharpening problem could be expressed as the following supervised learning problem: where G denotes a set of pan-sharpening models/generators; L is a loss function, such as MSELoss (mean squared error loss) or L1Loss/MAELoss (mean absolute error loss) in pixel space;ŷ (n) denotes a generated HRMS image from a pan-sharpening generator G ∈ G, However, the ground-truth HRMS image is unavailable in fact. In this case, the loss Equation (3) could not be calculated, nor the generator G.
In this paper, we introduce auto-encoders [49,55] to deal with the absence of of HRMS images. Usually, an auto-encoder consists of an encoder learning a latent representation of an input, and a decoder (or reconstructor) reconstructing the input from the learned representation. It usually is trained by a reconstruction loss in pixel space, and does not need any labels. For pan-sharpening, the generator G plays the role of the encoder, and, in this case, the latent representation is exactly the fused HRMS imageŷ (n) . An extra architecture R = (R x , R p ) is introduced to reconstruct LRMS-PAN image pairs fromŷ (n) , and, that is to say, (x (n) ,p (n) ) = R(ŷ (n) ) = (R x (ŷ (n) ), R p (ŷ (n) )), (5) wherex (n) andp (n) denote the reconstructed LRMS and PAN images, respectively. Based on reconstructed images, loss computation could be moved from the HRMS image space to the LRMS-PAN image pair space. Hence, Equation (3) could be reformulated as where R stands for a set of reconstructors. However, computing loss only in pixel space might introduce blurring, especially when MSELoss is used [49,56]. To prevent blurring and get better perceptual quality, a hybrid loss is introduced. Generally, loss computation could be expressed as follows: L(M(x (n) ,p (n) ), M(x (n) , p (n) )), (7) in which M is an arbitrary function. When M is an identity function, it is equivalent to loss computation only in pixel space, L(M(x (n) ,p (n) ), M(x (n) , p (n) )) := L pixel (x (n) , x (n) ) + L pixel (p (n) , p (n) ), where L pixel is MSELoss or L1Loss. When M is more complicated for feature extracting from LRMS-PAN image pairs, it then could be expressed as in which F is in place of M for clarity, and L feat is MSELoss or L1Loss. When M is a discriminator D of GAN [18], the loss could be expressed as L(M(x (n) ,p (n) ), M(x (n) , p (n) )) := L GAN (D(x (n) ,p (n) ), D(x (n) , p (n) )), where L GAN could be BCELoss (binary cross entropy loss). These three kinds of losses could represent LRMS-PAN image pairs from different abstract levels.
Combining Equations (8)-(10) together, the optimization objective function for pan-sharpening could be expressed as follows: where D denotes a set of discriminators, and α, β and γ are hyper-parameters controlling the importance of different loss terms. Equation (11) could be treated as an extension of perceptual loss, which is usually used for style transfer and image super-resolution [12,48]. This is why we call the proposed model as "perceptual pan-sharpening", or PercepPan for short. It is totally an unsupervised learning formula and does not need ground-truth HRMS images at all. It should be noticed that F is implemented as a part of D in this paper rather than an individual neural network. Figure 3 shows the structure of our PercepPan, where G, R, and D are all implemented by neural networks. F is a part of D, and it is split into two streams, F = (F x , F p ), with F x extracting features from LRMS images and F p extracting features from PAN images. These features would be first concatenated together along channel axis and then processed by a VGG-style network [57]. LGAN Lfeat

Network Architecture
As shown in Figure 3, the proposed PercepPan consists of three parts: • A generator G which takes as input a LRMS-PAN image pair (x, p) to generate a HRMS imageŷ; • A reconstructor R which takes as input a generated HRMS imageŷ to reconstruct the corresponding LRMS-PAN image pair, with the output denoted asx andp, respectively; • A discriminator D which takes as input real/reconstructed LRMS-PAN image pairs to calculate feature loss and GAN loss.
Generator. The generator G needs to fuse spectral details from LRMS images and spatial details from PAN images. Existing generators taking LRMS-PAN image pairs into networks directly to extract those details [14,17], or learning residual details according to LRMS images [15,16], could play the role of G. We also try the ESRGAN-style generator with residual learning according to PAN images, whereŷ c ∈ R rH×rW is the c-th band ofŷ, and σ c ∈ R rH×rW and µ c ∈ R rH×rW are residuals learned from x. For simplicity, they are also denoted as σ x = (σ 1 , σ 2 , . . . , σ C ) and µ x = (µ 1 , µ 2 , . . . , µ C ), where the subscript indicates that both of them are related to x. It should be noted that multiplication and addition here are element-wise. The residual learning is inspired by a well-known style transfer method, called "adaptive instance normalization (AdaIN)" [54]. Specifically, x is treated as style image, and the corresponding style features µ x and σ x are learned by the ESRGAN-style generator, while p is treated as content image, and the content features µ p and σ p are simply assigned as zero matrix and identity matrix, respectively.
Reconstructor. The reconstructor R = (R x , R p ) aims at reconstructing LRMS-PAN image pairs from the generated HRMS images. It could be implemented by a neural network. Inspired by [58], we design a shallow architecture for R to simulate the observation process about how to acquire LRMS-PAN image pairs via satellites.
Because LRMS images are spatially degraded from the corresponding HRMS images, the first part of the reconstructor, R x , is treated as a combination of blurring and downsampling, wherex c ∈ R W×H andŷ c ∈ R rW×rH are the c-th spectral bands ofx andŷ, respectively. H c is a blurring operator for the c-th spectral band, which could be implemented as a convolutional layer, and S is a downsampling operator. Because the PAN image generally covers all the wavelengths of the MS image spectral bands (see Section 4.1 for more detail), the PAN image could be approximated by a linear combination of the HRMS image bands [59], and, in other words, the second part of the reconstructor, R p , could be defined asp whereŷ c ∈ R W×H is the c-th spectral band ofŷ and w c is the corresponding weight. The linear map could be implemented by a 1 × 1 convolution. Discriminator. The discriminator D is responsible for computing feature loss and GAN loss. Feature loss computation needs LRMS-PAN image pairs as input. To receive different kinds of images simultaneously, D contains two input branches, F = (F x , F p ), with F x for LRMS images and F p for PAN images. Extracted features would then be fused together.
To compute GAN loss, D further sends these features into a VGG-style neural network [57]. For each input, the VGG-style architecture outputs a scalar, which represents the probability that the input feature is from the real data rather than the generated one.

Initialization
Initialization is crucial to training neural networks [60]. The most common strategy is random initialization according to a specific probability distribution [34,35]. Another strategy is pre-training initialization, where weights from a pre-trained network are used. The latter one has been leveraged by more and more works recently [38,50].
To initialize the generator G, both random initialization and pre-training initialization are used. For random initialization, a Gaussian distribution is used [61], denoted as Random style. For pre-training initialization, two pre-trained neural networks are used (https://github.com/xinntao/ BasicSR), with one called PSNR style, which is trained with pixel loss, and the other called ESRGAN style, which is fine-tuned with GAN loss based on the former.
To initialize the reconstructor R = (R x , R p ), we develop a novel initialization strategy, called prior initialization, in which specific satellite characteristics are used. On the one hand, blurring operators H 1 , H 2 , . . . , H C in R x are commonly implemented as Gaussian filters, of which weights are derived from the Nyquist cutoff frequencies of satellites [62,63]. On the other hand, the linear weights in R p could be calculated from normalized spectral response curves of satellites [58,64]. These characteristics parameters comprise the piror knowledge for initialization, and are shown in Table 1 for reference. This prior knowledge plays a similar role of a regularization term, which helps to reduce the uncertainty ofŷ.
To initialize the discriminator D, a common random initialization is enough [50], and again, a Gaussian distribution is used [61].

Training Strategy
Different parts of the proposed PercepPan need to be trained differently.
When training G, all of the losses would take effect. As Equation (11) shows, G is affected by all three kinds of losses. An individual pixel loss would result in blurring [49] while the combination of feature loss and GAN loss might introduce undesired artifacts [50]. The hybrid of three losses might diminish their drawbacks and lead to a better G.
When training D, only GAN loss would take effect. The reason is that pixel loss is computed before D so that it does not affect D at all, and feature loss would make D collapse to 0 easily in practice [49]. Therefore, D is trained by the GAN loss alone.
However, R is fixed during training. An ideal R could reflect the quality of its input faithfully by the quality of its output. When the output is terrible, the input generally is terrible as well, like the generator G. In this case, error signal is triggered only by G and the signal could help to train G properly. However, when R is terrible, it could output terrible reconstructions even if G is good enough. In this case, error signals might lead G in a wrong way. Because R is constructed and initialized by prior knowledge, it could be supposed that R is good enough and it is not necessary to train R further. Therefore, the final objective function of the proposed PercepPan for unsupervised pan-sharpening with perceptual loss is Another issue is how to balance the training procedure of G and D [65,66]. Inspired by a two time-scale update rule for training GAN [67], we use individual learning rates for G and D separately to balance the procedure. Intuitively, imagine that G and D are two learners, and error signals are knowledge that they have to learn. In general, different learners should have different learning capabilities, which could be controlled by learning rate. A great learning rate means a strong learning capability and a small one means a weak capability. In experiments, G and D would be trained with different learning rates.

Datasets and Algorithms
MS-PAN image pairs from satellites are big in size and coordinate-aligned. They usually are first cropped into small patches with proper size to construct datasets. The constructed datasets are summarized in Table 2.
The first dataset, the full-scale dataset, is composed of these cropped patches. Compared with the original MS-PAN image pairs, these patches become smaller only in size but preserve the original spatial resolution. These patches are treated as LRMS-PAN image pairs, i.e., input to our PercepPan with the SPUF training paradigm. Models trained on this dataset are then assessed by no-reference indices. The related training algorithm is shown in Algorithm 1. Computing loss l G = ∑ bs n=1 αl Computing Gradient g G = ∇ G l G

6:
Updating weights w G ← w G − η G · Adam(w G , g G ) Updating weights w D ← w D − η D · Adam(w D , g D ) 11: end for The second dataset, the reduced-scale dataset, is constructed under the Wald's protocol [26] as tradition. The original MS-PAN image pairs are first cropped into small patches, and then are degraded by blurring and downsampling. These degraded patches are treated as LRMS-PAN image pairs, i.e., input to our PercepPan with the SPSF training paradigm. The original non-degraded MS patches would be treated as the ground-truth HRMS images, both for loss computation and full-reference image quality assessment. This training procedure is similar to the full-scale one and is shown in Algorithm 2. It should be noted that, for the SPSF training, the reconstructor R should be removed, and the discriminator should be adjusted according to HRMS images. These changes make Algorithm 2 save a little more time than Algorithm 1, with about 0.510 s versus 0.525 s for each iteration in our experiments. Sampling a mini-batch of image pairs, {x (n) , p (n) } bs n=1 , from reduced-scale training dataset 4: Computing loss l G = ∑ bs n=1 αl Sampling a mini-batch of image pairs, {x (n) , p (n) } bs n=1 , from reduced-scale training dataset 8: Computing loss l D = ∑ bs n=1 l (n) GAN 9: Computing Gradient g D = ∇ D l D

10:
Updating weights w D ← w D − η D · Adam(w D , g D ) 11: end for

Experiments
In this section, the proposed PercepPan with different training strategies is evaluated on different datasets. It is also compared with other deep learning methods for pan-sharpening.

Experiment Settings
Datasets. Images come from two different satellites, QuickBird and IKONOS. Table 3 summarizes spectral and spatial information about these two satellites [68,69]: As described in Section 3.5, MS-PAN image pairs are first cropped (and/or degraded) into small patches, and then these patches are randomly split into three groups for training, validation, and test with proportion 6:2:2. Dataset information is summarized in Table 4. It should be noticed that the scale factor is r = 4 in this paper, which is consistent with the rate of spatial resolution between MS and PAN images from either QuickBird or IKONOS.
Other Generators. Only neural network-based methods are used as the generator G. These methods are PNN [14], RSIFNN [15], PanNet [16], and PSGAN [17]. These methods are trained in a supervised manner with preferable settings from the corresponding papers but on our reduced-scale dataset, and then generalized onto the full-scale dataset directly. Classical methods, such as [1,4,7], are not taken into consideration. Hyper-parameters. As stated in Section 3.3, the ESRGAN-style generator G is initialized by one of Random style, PSNR style, and ESRGAN style. The hyper-parameters in Equation (11) are given in advance, (α, β, γ) ∈ {(1, 0, 0), (0, 1, 0.01), (1, 1, 0.01)}, in which (1, 0, 0) means only pixel loss takes effect, (0, 1, 0.01) means feature loss and GAN loss take effect, and (1, 1, 0.01) means all of three losses take effect at the same time. It should be noticed that 0.01 is used to make the GAN loss have the same order of magnitude with the other two losses in an early training stage. The batch size is assigned to be 4, and the number of iterations for training is 5000. The whole network is trained by Adam [70]. Inspired by two time scale update rule [67], learning rates η G and η D are chosen individually from 1 × 10 −4 and 1 × 10 −5 .
All experiments are implemented on the deep learning framework, PyTorch [71] with version 4.0. All codes run on an Nvidia GTX 1080Ti GPU and an Intel Core i7 6700 CPU. The code is already available at our GitHub homepage (https://github.com/wasaCheney/PercepPan).

Image Quality Assessment
The proposed PercepPan together with other method is evaluated by common quality assessment indices, including full-reference indices for reduced-scale experiments, such as spectral angle mapper (SAM) [72], peak signal-to-noise ratio (PSNR), spatial correlation coefficient (SCC) [73], universal image quality index (Q-index) [74], structure similarity (SSIM) [75], and erreur relative global adimensionnelle de synthèse (ERGAS) [76]; and no-reference indices for full-scale experiments, D λ , D s , and QNR [77]. For convenience, we simply describe them here for reference. Denote a generated image bŷ I ∈ R H×W×C , and the corresponding ground-truth image by I ∈ R H×W×C , where H, W, C are height, width, and number of channels, respectively. • SAM is a measurement of spectral distortion. DenoteÎ i,j , I i,j ∈ R C as vectors at (i, j) pixel position ofÎ and I, respectively, then where < ·, · > is the inner product operator. • PSNR is a commonly used image quality assessment method, where MAX I is the maximum possible pixel value of I, and RMSE(·, ·) stands for the root of mean squared error. It is the same below. • SCC is a spatial quality index. DenoteÎ c , I c ∈ R H×W as the c-th band ofÎ and I, respectively. Then, where Cov(·, ·) is the covariance and σ(·) is the standard deviation. It is the same below. • Q-index gathers image luminance, contrast, and structure for quality assessment. After dividinĝ , Q-index is computed as follows: where µ(·) stands for mean value. It is the same below. • SSIM is a famous image quality assessment method and it is an extension of Q-index, where c 1 = (0.01MAX I ) 2 , c 2 = (0.03MAX I ) 2 , and c 3 = c 2/2. • ERGAS is another common method of image quality assessment. Denote the spatial resolution ratio between MS images and the corresponding PAN images by r. Then, • QNR is a no-reference method for image quality assessment. It consists of a spectral distortion index D λ , and a spatial distortion index D s . Here, denote an LRMS image with C spectral bands as I LRMS , the corresponding generated HRMS image as I HRMS , PAN image with only one spectral band as I PAN , and its degraded counterpart as I LRPAN , then where u = v = 1 and a = b = 1 usually. Figure 5 shows the score trend of different indexes with respect to the level of noise. We choose two additive noises, Gaussian noise (gauss) and Laplace noise (laplace), which would result in spectral distortion and spatial distortion, as well as two multiplicative noises, average blur (avg_blur), and Gaussian blur (gauss_blur), which would result in spatial distortion only. As the figure shows, among full-reference indexes (the first and second rows), all of them are sensitive to the additive noises, but only Q-indexes are sensitive to the multiplicative noise; among no-reference indexes (the third row), D λ is more sensitive to the additive noise but hardly sensitive to the multiplicative noises as its expression shows, while D S and QNR are sensitive to both kinds of noise. Please refer to Figure A1 in Appendix A for noised images.
In summary, different indexes react differently to the level of noises, so it is necessary to assess the quality of images by combinations of different indexes. More general, assessment scores could not reflect the quality of images perfectly, so visual analysis is also necessary, especially for full-scale experiments. Table 5 shows results of the proposed PercepPan with the ESRGAN-style generator on the QuickBird dataset. In the table, "Random" means that G is initialized by a Gaussian distribution [61], "PSNR", and "ESRGAN" mean G is initialized by pre-trained models [50]. "Reduced-scale" means networks are trained and evaluated on the reduced-scale dataset, results before "/" in "Full-scale" columns mean networks are trained on the reduced-scale dataset but evaluated on the full-scale dataset as tradition, and results after "/" in "Full-scale" columns mean networks are trained and evaluated on the full-scale dataset directly.
As tradition, models trained on the reduced-scale dataset are then evaluated on the full-scale dataset (results before "/" in "Full-scale" columns). Again, pre-training initialization generally performs better than random initialization. Hybrid loss outperforms pixel loss especially on D s and QNR indices. Learning rate (η G , η D ) = (1 × 10 −4 , 1 × 10 −5 ) still cooperates better with hybrid loss. All of these conclusions are consistent with those on the reduced-scale dataset. Finally, models are trained and evaluated on the full-scale dataset (results after "/" in "Full-scale" columns). Again, similar conclusions to the above could be drawn. Moreover, when compared with traditional perspective (results before "/" in "Full-scale" columns), these unsupervised results are greater apparently. It indicates that unsupervised training directly on the full-scale dataset is more suitable for pan-sharpening.
In summary, these results suggest that pre-training initialization, hybrid loss together with (η G , η D ) = (1 × 10 −4 , 1 × 10 −5 ) learning rate are preferable for the pan-sharpening problem both on the reduced-scale and full-scale datasets. Furthermore, it is better to directly train networks on the full-scale dataset.
These results indicate that the answers to our second and third questions are both positive; that is to say, pre-training could offer gains in the SPUF training paradigm and the proposed unsupervised perspective is superior to the traditional supervised perspective.

Generalization: Generator
As stated above, any neural network-based pan-sharpening generator could play the role of G in the proposed PercepPan framework.
In this part, four compared pan-sharpening models are employed as the generator G in the unsupervised framework with (α, β, γ) = (1, 1, 0.01) and (η G , η D ) = (1 × 10 −4 , 1 × 10 −5 ). As a comparison, these models are also trained in a supervised manner with recommended settings from the corresponding papers, and all of them use random initialization. Table 6 shows the comparison results on the QuickBird dataset. Table 5. Quality assessment of the proposed PercepPan under different settings on QuickBird dataset. "-" means the corresponding entry is invalid and the best value of each index is shown in parentheses.

Initialization
Hyper-Parameter Reduced-Scale Full-Scale When trained and evaluated on the reduced-scale dataset ("Reduced-scale" columns), all of these models work as well as the PercepPan with an ESRGAN-style generator. Specifically, on indices of SAM, PSNR, and SCC, these new generators outperform the ESRGAN-style generator, while on indices of Q-index, SSIM, and ERGAS, the opposite holds.
When trained on the reduced-scale dataset but evaluated on the full-scale dataset (results before "/" in "Full-scale" columns), these models outperform the PercepPan with ESRGAN-style generator. According to the QNR index, RSIFNN, PanNet, and PSGAN are better than the best results of the PercePan with the ESRGAN-style generator, that is, 0.805, 0.794, and 0.779 versus 0.757 (with ESRGAN initialization, (α, β, γ) = (1, 1, 0.01) and (η G , η D ) = (1 × 10 −4 , 1 × 10 −5 ) in Table 5). However, it should be noticed that this paper aims at an unsupervised perspective for pan-sharpening rather than struggling for high assessment scores under the traditional perspective.
When trained and evaluated on the full-scale dataset (results after "/" in "Full-scale" columns), these models also outperform the PercepPan with the ESRGAN-style generator. However, what is more important is that these unsupervised results are greater than the supervised ones (results before "/" in "Full-scale" columns). This means our unsupervised perspective could improve these models performance, and, that is to say, the unsupervised perspective for pan-sharpening is effective.
In summary, the proposed framework could cooperate well with different generators. Results show that the unsupervised perspective is comparable with the traditional supervised one. For perceptual intuition, Figure 6 shows the fused results of two randomly selected samples from the test set, with the top two rows corresponding to one sample and the bottom two rows to another one. For each sample, the first row shows the supervised results and the second row shows the unsupervised results. In each image, the top-left box shows the zoom-in version of a selected location for a detailed comparison. It could be seen that the perceptual quality of our unsupervised manner is better than the corresponding supervised one, especially with PNN (the second column) as the generator.

Generalization: Dataset
How about generalizing the proposed unsupervised perspective onto another dataset? In this part, we evaluate those trained models on a new dataset, the IKONOS dataset. Compared with the QuickBird dataset, the dataset has different characteristics as Tables 1 and 3 show, and it is also much smaller as Table 4 shows. Table 6. Quality assessment of different methods under supervised/unsupervised manner on the QuickBird dataset. The best value of each index is shown in parentheses.
Results show that, both on the reduced-scale dataset and the full-scale dataset, all trained models work indeed but perform worse to some extent. This phenomenon might be caused by the different characteristics between the IKONOS satellite and QuickBird satellite. Moreover, it could still be observed that the proposed unsupervised perspective outperforms the traditional one when evaluated on the full-scale dataset, but the superiority is mainly caused by the spatial index D s , which needs further study.
For perceptual intuition, Figure 7 shows the fused results of two randomly selected samples from the test set, with the top two rows corresponding to one sample and the bottom two rows to another one. For each sample, the first row shows the supervised results and the second row shows the unsupervised results. In each image, the top-left box shows the zoom-in version of a selected location for detailed comparison. Overall, the generalization results are worse than those on the QuickBird dataset, and the conclusion is consistent with the quantitative scores. Moreover, our unsupervised results still outperform the supervised results, especially with PNN (the second column) and PSGAN (the fifth column) as the generator.

Conclusions
The pan-sharpening problem always encounters an issue that high resolution multispectral images are unavailable. Traditional methods follow Wald's protocol to degrade original images for network training in a supervised manner.
In this paper, we find that the degradation step is not necessary for network training, and propose an unsupervised pan-sharpening framework PercepPan by combining auto-encoder and perceptual loss. The novel framework could work not only on reduced-scale datasets in a traditional supervised manner, but also on full-scale datasets in an unsupervised manner. Experiments on the QuickBird dataset show that the unsupervised framework cooperates well with different pan-sharpening generators, and the unsupervised results are comparable with the supervised counterparts. When generalizing to the IKONOS dataset, the unsupervised framework is still competitive.
However, it is still far from completely unsupervised pan-sharpening. As experiments show, without pre-training initialization, the proposed PercepPan performs bad. Moreover, the reconstructor needs to be initialized according to a certain satellite, which might worsen its generalization performance. Both of these issues are worthy of further consideration.

Acknowledgments:
The authors would like to thank Zengjie Song, Junying Hu, Kai Sun, and Guang Shi for their insightful comments and thank the developers and maintainers of PyTorch.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. Appendix A Figure A1. Noised images and the corresponding image quality assessment scores. Greater level value means stronger noise.