While both high spatial and spectral resolution observations are desired in remote sensing, imaging systems must sacrifice the resolution in one domain to achieve the requested performance in the other. As a result, the majority of spaceborne remote sensing imaging systems acquire two types of observations, high spatial resolution single-band panchromatic (PAN) images, or moderate to high spectral resolution multispectral (MS) or hyperspectral (HS) observations at much lower spatial resolution. For instance, the SPOT 6 and 7 satellites acquire PAN imagery at
m spatial resolution and 6 m spatial resolution for the 4 band MS, while the WorldView-4 acquires PAN imagery at
m and 4 bands MS at
m. Pan-sharpening methods explore how the two types of observations can be combined in order to produce high spatio-spectral observations [96
]. For this problem, several DNN approaches have been recently proposed.
4.1. AE-Based Approaches
Huang et al. [97
] are among the first to consider DNNs for the problem of pan-sharpening remote sensing observation. Specifically, they propose a modified Sparse Denoising Autoencoder (MSDA) architecture, aiming at learning the patch-level relationship between low and high spatial resolution MS and PAN images. The authors consider a series of pre-trained MSDAs which form a stacked MSDA, and subsequently fine-tune the entire network to infer high spatial resolution image patches for each MS band, which are then appropriately averaged to produce the final image/cube. The method is evaluated on two different satellite datasets from the IKONOS and QuickBird satellites, obtaining higher quality imagery compared to conventional algorithms.
To extract the most representative features from each source, a multi-level neural network approach for the Pan-sharpening problem is proposed in [98
]. The authors propose a network composed of four individual Stacked Autoencoders (SAE) networks, and specifically a variation called Denoising SAE, which are coupled together to derive the desired high-resolution MS image. Concerning the role that each of the separate network play, two of them represent the high-resolution PAN image and the upsampled MS image in a new feature space via SAE, while the subsequent third one builds the connection (i.e., mapping) between these two representations. Finally, a fine-tuning fourth network is introduced at the highest level to ameliorate the learned parameters of the networks via joint optimization through backpropagation. The proposed method is evaluated on the QuickBird dataset, and managed to outperform not only conventional pan-sharpening techniques but deep learning based ones (i.e., [99
]) as well.
A multi-resolution analysis framework for confronting the Pan-sharpening problem is proposed in [100
] where a two-stage DNN scheme based on the SAE framework is investigated. In the first stage a low-resolution PAN image is constructed by its high-resolution counterpart via the help of two autoencoders (the second one is trained with input the features derived by the first one). In the second stage the desired high-resolution MS image is reconstructed via its low-resolution version according to the respective relationship between PAN images. Once both AE are trained alone, the whole network (i.e., both together) is fine-tuned to optimize its hyper-parameters. The proposed approach is tested in two separate datasets (namely QuickBird and WorldView-3), where it outperformed its competitive algorithms. A possible shortcoming of the method could be considered the absence of comparison with other deep learning-based methods, which seem to be state-of-the-art in the field in recent years.
A recently approach using AE for pan-sharpening is proposed by Azarang et al. [101
] where the author consider a convolutional AE architecture, composed of an encoding and the associated decoding stage from improving the spatial information of the low-resolution MS bands. The objective is achieved by learning the nonlinear relationship between a PAN image and its spatially degraded version at a patch level, and uses the trained model to increase the spatial resolution of each MS band independently. Evaluation on data from QuickBird, Pleiades and GeoEye-1 demonstrate some gains compared to other non-DL-based methods.
4.2. CNN-Based Approaches
In addition to AE, CNNs have also been considered for pan-sharpening. One of the first attempts to employ CNNs in pan-sharpening is reported in [102
] where the authors consider a CNN architecture based on the SRCNN [34
] architecture for increasing the spatial resolution of MS observations in a band-by-band fashion and then use the spatially enhanced band corresponding to the PAN image for adjusting the original PAN image before subsequently applying a Gram–Schmidt transform to fuse the MS with the PAN observations. Training of the model is performed on natural images from the 2013 version of ImageNet dataset while for the validation, observations from QuickBird satellite are employed, demonstrating higher quality estimation compared to competing non-DL methods.
Another early attempt in using CNN for pan-sharpening is explored in [76
] which also borrows ideas from the SRCNN super-resolution architecture and extends them to tackle the pan-sharpening problem. The key novelty of this work involves the upsampling of the low spatial resolution 4 band MS observations to the resolution of the PAN image and stacking both observations into a 5 component input. The authors further propose a nonlinear radiometric parameter (typical in remote sensing applications) injection as a surplus input to the network, which also contributes to an increase in the output image quality. The performance of the proposed approach, termed PNN, is validated using 3 different datasets from the IKONOS, GeoEye-1, and WorldView 2 satellites, where it outperforms the competitive non-DL algorithms with respect to various full-reference and no-reference metrics.
Another early attempt in the application of CNNs for par-sharpening is proposed by Li et al. [103
] who, inspired by the VDSR architecture [31
], propose the introduction of convolution/deconvolution layers and residual connections. The proposed method, called DCNN, is capable of estimating the full resolution output, not by averaging different patches but n a band/channel-wise fashion, achieving promising results with respect to the competing methods in two different datasets from the QuickBird and GeoEye-1 satellites.
Wei et al. [99
] also propose a residual CNN architecture, to evaluate the performance of deeper networks as opposed to shallow architectures. To overcome the problem of vanishing gradient characterizing very deep architectures, the proposed approach, called DRPNN, considers a residual architecture with skip connections, inspired by the VDSR method for super-resolution, allowing the network to accept low-resolution images in the input, i.e., without the need for a upsampling pre-processing step. The method is tested in two separate datasets, namely QuickBird and WorldView-2, and demonstrates significant gains compared to traditional non-DL approaches, as well as well as a CNN one [76
]. A similar residual CNN approach is also considered in [105
] which is shown to outperform conventional algorithms, as well as the CNN-based method [76
] in terms of performance on observations from the LANDSAT 7 ETM+.
In a similar vein, i.e., using residual connections for training deeper CNNs, the authors in [106
] consider an architecture based on the ResNet [30
] architecture. The proposed scheme incorporates domain specific knowledge to preserve both spatial and spectral information, by adding upsampled MS images to the output and training the network in the high-frequency domain. Experimental results demonstrate the superiority of the proposed method compared to conventional pan-sharpening methods as well as CNN-based [76
]) in the WorldView-3 dataset, as well as greater generalization capabilities in new datasets, namely WorldView-2 and WorldView-3, where the proposed approach is capable of producing high-quality results without the need to be retrained.
Another CNN approach for pan-sharpening is proposed by [107
] where the authors introduce a multiscale feature extraction by considering three different size convolutional filter at each layer to build a respective multiscale network, inspired by the inception architecture [33
]. By concatenating the derived features across the spectral dimension, they retained as much spatial information as possible, while minimizing spectral distortion. To build a deeper network architecture, they adopt residual connection approach leading to sparse and informative features. The proposed architecture is tested in two different datasets from the QuickBird and WorldView-2 satellites, where it outperformed competitive approaches, including the CNN baseline [76
Building upon the method proposed in [76
], which became a baseline for most of the aforementioned deep learning-based approaches for Pan-sharpening, the authors propose an updated approach in [109
]. More specifically, they modified their initial approach by taking into account research that had taken place after their initial publication, by adopting the following changes: (i) use of
loss function-instead of
; (ii) processing and learning in the image residual domain-rather than in the raw image one; (iii) use of deeper network architectures (in the [76
], the network consisted of only three convolutional layers). The aforementioned modifications are tested in the GeoEye-1 dataset and compared to the initial setup proposed in [76
], and proved to be quite beneficial in terms of performance as well as training time reduction.
Capitalizing on these observations, the authors extended the approach in [110
] by making it adaptive to every new image that needed to be pan-sharpened via the trained network. For doing so, every new image fed a fine-tuning process on the pre-trained deep neural network (until convergence), and right afterwards it entered the trained model to be pan-sharpened at its output. To prove the validity of their -updated- approach, the authors tested it in four different datasets, namely IKONOS, GeoEye-1, WorldView-2 and WorldView-3, and compared it with their initial approach ([76
]) as well as with conventional pan-sharpening techniques. The reported results in all datasets verified the merits of the new approach both in terms of performance metrics as well as of visual inspection.
Another approach involving CNNs for pan-sharpening is the RSIFNN architecture [111
] which poses the problem as a fusion of PAN and MS inputs. Specifically, the RSIFNN introduces a two-stream fusion paradigm such that features from the PAN and MS inputs are first independently extracted and subsequently fused in order to produce high spatio-spectral observations, while a residual connection between the low-resolution MS and the outputs is also introduced. The authors considered PAN and MS (4 bands) observations the from QuickBird and Gaofen-1 satellites and generate synthetic examples through appropriate downsampling (Wald’s protocol) for comparing the performance of the proposed and various state-of-the-art methods, including the CNN-based method by Massi et al. [76
] as well as the SRCNN method. The authors report a gain of around
dB compared to [76
] and significant gains using additional metrics likes SAM, while also demonstrating a dramatic decrease (
) in running time for inference compared to other CNN-based approaches. In a similar spirit, Zhang et al. [112
] consider two-stream bidirectional Pyramid networks which can extract and infuse spatial information to MS from PAN imagery along multiple scales. Validation using GF2, IKONOS, QuickBird, and WorldView3 observations demonstrate some minor improvements in performance.
Wei Yao et al. [113
] propose a pixel-level regression for encoding the relationship between the pixel values in the low- and high-resolution observation. For that case, they employed a CNN inspired by the U-Net architecture [35
], but refined with respect to the problem at hand and trained with image patches. The proposed method obtained quite improved performance with respect to localization accuracy as well as feature semantic levels, compared to other DNN-based methods. At the same time, the proposed approach outperforms other sophisticated Pan-sharpening methods in two different remote sensing datasets in various indices of quality, leading to sharpened visual results both in terms of spatial details as well as spectral characteristics.
Increasing the spatial resolution of MS observation using PAN imagery using CNNs is also explored in [114
] where a novel loss function penalizing the discrepancy in the spectral domain is proposed. The proposed scheme employs two networks, the first one called ’Fusion Network’ which introduces high spatial resolution information to upsampled MS images and the second one, the ’Spectral compensation’ network which tries to minimize spectral distortions. Validation on observations from Pleiades, Worldview-2 and GeoEye-1 demonstrate the proposed scheme achieves superior performance compared to [76
] in almost all cases.
In the recently proposed method by Guo et al. [115
], the authors use the latest CNN design innovations including single and multiscale dilated convolutions, weight regularization (weight decay) and residual connections for pan-sharpening Worldview-2 and IKONOS imagery. Compared to state-of-the-art methods, including [76
], the proposed method achieved both superior performance in terms of quality (more than 0.5 in SAM compared to competing CNN approaches) and computational requirements for inference.
Building on the disruptive GAN framework, the authors in [116
] proposed a two-stream CNN architecture (one stream for the PAN images and the other one for the MS ones) as the generator of high-quality pan-sharpened images, accompanied by a fully convolutional discriminator. In this way, the fusion of images is performed in the feature level (of the generator), rather than in pixel-level, reducing in this way the spectral distortion. The aforementioned approach is tested on two different datasets (namely QuickBird and GaoFen-1), where it outperformed conventional and deep learning-based ([76
]) pan-sharpening methods, even if only the generator component is used solely. A similar idea is also explored in [117
] where additional contains related to spectral features are imposed.