1. Introduction
Due to the increasing number of remote sensing satellites and to renewed data sharing policies, e.g., the European Space Agency (ESA) Copernicus program, the remote sensing community calls for new data fusion techniques for such diverse applications as cross-sensor [
1,
2,
3,
4], cross-resolution [
5,
6,
7,
8] or cross-temporal [
4,
9,
10] ones, for analysis, information extraction or synthesis tasks. In this work, we target the pansharpening of remotely sensed images, which amounts to the fusion of a single high resolution panchromatic (PAN) band with a set of low resolution multispectral (MS) bands to provide a high-resolution MS image.
A recent survey [
11] gathered the available solutions in four categories: component substitution (CS) [
12,
13,
14,
15], multiresolution analysis (MRA) [
16,
17,
18,
19], variational optimization (VO) [
20,
21,
22,
23], and machine/deep learning (ML) [
24,
25,
26,
27,
28]. In the CS approach, the multispectral image is transformed in a suitable domain, one of its components is replaced by the spatially rich PAN and the image is transformed back into the original domain. For example, in the simple case where only three spectral bands are concerned, the Intensity-Hue-Saturation (IHS) transform can be used for this purpose. The same method has been straightforwardly generalized to the case of a larger number of bands in [
13]. Other examples of this approach, to mention a few, are whitening [
12], Brovey [
14] or the Gram–Schmidt decomposition [
15]. In the MRA approach, instead, pansharpening is addressed resorting to a multi-resolution decomposition, such as decimated or undecimated wavelet transforms [
16,
18] and Laplacian pyramids [
17,
19], for proper extraction of the detail component to be injected into the resized multispectral component. VO approaches leverage on suitable acquisition or representation models to define a target function to optimize. This can involve the degradation filters mapping high-resolution to low-resolution images [
22], sparse representation of the injected details [
29], probabilistic models [
21] and low-rank PAN-MS representations [
23]. Needless to say, the paradigm shift from model-based to ML approaches registered in the last decade has also heavily had an impact on such diverse remote sensing-related image processing problems as classification, detection, denoising, data fusion and so forth. In particular, the first pansharpening convolutional neural network (PNN) was introduced by Masi et al. (2016) [
24], and it was rapidly followed by many other works [
25,
27,
28,
30,
31,
32].
It seems safe to say that deep learning is currently the most popular approach for pansharpening. Nonetheless, it suffers from a major problem: the lack of ground truth data for supervised training. Indeed, multiresolution sensors can only provide the original MS-PAN data, downgraded in space or spectrum, never their high-resolution versions, which remain to be estimated. The solution to this problem introduced in [
24], and still adopted by many others, consists in a resolution shift. The resolution of the PAN-MS data is properly downgraded by a factor equal to the PAN-MS resolution ratio in order to obtain input data whose ground-truth (GT) is given by the original MS. Any network can therefore be trained in a fully supervised manner, although in a lower-resolution domain, and then be used on full-resolution images at inference time. The resolution downgrade paradigm is not new, as it stems from Wald’s protocol [
33], a procedure employed in the contest of the pansharpening quality assessment, and presents two main drawbacks:
- i.
It requires the knowledge of the point spread function (also referred to as sensor Modulation Transfer Function, MTF, in the pansharpening context), which characterizes the imaging system, to apply before decimation to obtain the reduced resolution dataset;
- ii.
It relies on a sort of scale-invariance assumption (a method optimized at reduced resolution is expected to work equally well at full resolution).
In particular, the latter limitation has recently motivated several studies aimed at circumventing the resolution downgrade [
31,
32,
34,
35]. These approaches resort to losses that do not require any GT being oriented to consistency rather than to the synthesis assessment. During training, the full-resolution samples feed the network whose output is then compared to the two input components, MS and PAN, once suitably reprojected in their respective domains. The way such reprojections are realized, in combination with the measurement employed, i.e., the
consistency check, has been the object of intense research since the seminal Wald’s paper [
33], and is still an open problem. In addition, a critical issue is also represented by the lack of publicly available datasets that are sufficiently large and representative to ensure generality to the trained networks. A solution to this, based on the target-adaptivity principle, was proposed in [
36] and later adopted in [
34,
35] too. On the downside, target-adaptive models pay a computational overhead at inference time, which increases when operating at full-resolution, as occurs in [
35].
Motivated by the above considerations, following the research line drawn in [
35], which combines full-resolution training and target-adaptivity, in this work, we introduce a new target-adaptive scheme, which allows reducing the computational overhead while preserving the accuracy of the pansharpened products. Experiments carried out on GeoEye-1, WorldView-2 and WorldView-3 images demonstrate the effectiveness of the proposed solution, achieving computational gains of about one order of magnitude (∼10 times faster on average) for fixed accuracy levels.
In the next
Section 2, we describe the related work and the proposed solution. Then, we provide experimental results in
Section 3 and a related discussion in
Section 4, before providing conclusions in
Section 5.
3. Results
To prove the effectiveness of the proposed adaptation schedule, we carried out experiments on 2048 × 2048 WorldView-3, WorldView-2 and GeoEye1 images, parts of larger tiles covering the cities of Adelaide (courtesy of DigitalGlobe
), Washington (courtesy of DigitalGlobe
) and Genoa (courtesy of DigitalGlobe
, provided by European Space Imaging), respectively. The image size (power of 2) simplifies the analysis of the proposed solution for obvious reasons but is by no means limiting in the validation of the basic idea. For each of the three cities/sensors, we disposed of four images—three for validation, the remaining for a test—as summarized in
Table 2.
An RGB version of the MS component for each test image is shown in
Figure 4,
Figure 5 and
Figure 6, respectively. Overlaid on the images are the crops involved in the several adaptation phases and some crops (A, B, C) that will be later recalled for the purpose of visual inspection of the results.
Table 3 summarizes the distribution of the number of iterations planned for each crop size and the corresponding computational time, by iteration and by phase. As it can be seen, the adaptation time for the proposed solution (see Fast Z-PNN on WV-∗ to fix the ideas) was about 12 s against 40 or 100 s for the baseline scheme when using 100 (default choice) or 256 iterations, respectively. Moreover, it is worth noticing that most of the time (6.42 s) was spent by the last iterations on the full image. Similar considerations apply for GeoEye-1, as well as for the other two models. In particular, notice that for Z-DRPNN, all time scores scaled since this model is heavier (more parameters) compared to Z-PNN and Z-PanNet.
Actually, from a more careful inspection of
Table 3, it can be noticed that the time per iteration tended to increase by a factor of 4, moving from one crop size to the next one when these were larger. In fact, Fast Z-PNN obtained the following time multipliers on WV-∗ : 1.17 (
), 3.14, 3.51 and 3.84 (
). This should not be surprising since the crop area increased by 4 when doubling its linear dimensions and also because the computational time on parallel computing units, such as GPUs, does not always scale linearly with the image size, particularly when the input images are relatively small or too large, causing memory swaps. Assuming to be in a linear regime where the iteration cost grows linearly with the image size (area), considering that the number of iterations halves from one phase to the next one, the time consumption per phase eventually doubled, moving from a phase to the next one. Asymptotically, each new phase took a time comparable to the time accumulated by all previous phases. Consequently, by skipping the last phase, one would save approximately half of the computational burden, paying something in terms of accuracy.
Based on the above considerations, we also tested the lighter configuration, “Faster”, of the proposed method, where we skipped the last tuning phase with an early stop, as indicated in
Table 1.
The experimental evaluation was split in two parts. On one side, the proposed Fast and Faster variants were compared to the corresponding baselines (Z-∗) directly in terms of loss achieved during the target-adaptation phase. This was completed separately for both spectral and spatial loss components and using the validation images only. The results of this analysis for the whole validation dataset are summarized in
Table 4 and
Table 5, while
Figure 7 displays the loss curves for a sample image. Moreover,
Figure 8 compares the different target-adaptive schemes for Z-PNN using some sample visual results.
On the other hand, the test images were used for a comparative quality assessment in terms of both pansharpening numerical indexes and subjective visual inspection. For a more robust quality evaluation, we resorted to the pansharpening benchmark toolbox [
11], which provides the implementation of several state-of-the-art methods and several quality indexes, e.g., the spectral
and spatial
consistency indexes. We integrated the benchmark with the Machine Learning toolbox proposed in [
39], which provides additional CNN-based methods. Furthermore, the evaluation was carried out using the additional indexes
, R-SAM, R-ERGAS and R-
, recently proposed in [
40]. All comparative methods are summarized in
Table 6. The results are gathered in
Table 7,
Table 8 and
Table 9, whereas sample visual results are given in
Figure 9,
Figure 10 and
Figure 11. A deeper discussion about these results is left to
Section 4.
4. Discussion
Let us now analyze in depth the provided results, starting from a comparison with the baseline models Z-∗.
In
Figure 7, we show the loss decay during the adaptation phase, separately for the spectral (top) and spatial (bottom) terms, as a function of the running time rather than the iteration count. This experiment refers to a WorldView-3 image processed by Z-PNN models, but similar behaviors have been observed for the other images, regardless of the employed base model. The loss terms for the baseline (Z-PNN) and the proposed Fast Z-PNN are plotted in green and blue, respectively. Although the loss is a per-pixel loss (spatial average), so that these curves are dimensionally consistent, the latter refers to the current crop while the former is an average on the whole image. For this reason, for the proposed (Fast/Faster) solution, we also computed the value of the loss on the whole image at each iteration (red dashed line). Surprisingly, the “global” loss tightly follows the “local” (to crop) loss, showing a more regular decay, thanks to a wider average. These plots clearly show the considerable computational gain achievable for any fixed target loss level. It is worth noticing, for example, that the loss levels achieved by the proposed Fast scheme in 12 s (256 iterations) are reached by the baseline approach in about 90 s (∼200 iterations on the full image). Besides, the Faster version reaches almost the same loss levels of the Fast version in about 5.7 s.
In
Table 4, for each validation image, we quantify the computational gains achieved by Faster and Fast Z-PNN in terms of time consumption. For each image, it is indicated, separately for the spectral (top table) and spatial (bottom table) loss components, the value without adaptation (
) and those achieved by Faster (
) and Fast (
) adaptation schemes (these values refer to the loss assessed on the whole image, not to the one computed on the crops for gradient descent), whose run times (
and
) do not depend on the specific image (they drop when working with GeoEye-1 instead of WV-∗ because of the smaller number of spectral bands). Then, we report the times
and
needed for Z-PNN to reach the loss levels
and
, respectively. We can fairly read
and
as the times needed for Z-PNN to achieve the same spectral (top table) or spatial (bottom table) target levels of its faster versions. Consequently,
and
represent the computational gains of the proposed solution. Similar results, not shown for brevity, have been registered for Z-PanNet and Z-DRPNN. It is worth noticing that whereas for the spectral loss the gain is always (well) larger than 1, for the spatial component, there can occur cases where it is already pretty low (see
in
Table 4); hence, only the spectral loss decays while the spatial one either keeps constant or even grows a little (see bold numbers in
Table 4). In these cases, the gains cannot be computed, but actually, it make sense to focus on the spectral behavior only, as no adaptation is needed on the spatial side.
In general, from all experiments that we carried out, it emerges that the adaptation is mostly needed for the reduction of the spectral distortion rather than for the spatial one. This is a particular feature of the Z-∗ framework, which leverages a spatial consistency loss based on correlation (Equation (
4)) that shows a quite robust behavior. On the basis of the above reasons, it makes sense to focus on the spectral loss only for checking the quality alignment between Z-∗ and the proposed variants. Therefore, we can give a look to
Table 5, which provides the average gains for all models, averaged by sensor, but limited to the spectral part. Overall, it can be observed that the Fast models allow for obtaining gains ranging from 4.03 to 15.6 (9.6 on average), whereas Faster models provide gains between 6.33 to 26.8 (17.9 on average), at the price of a little increase of the loss components (compare the loss levels in
Table 4).
Let us now move to the analysis of the results obtained on the test images for a comparison with the state-of-the-art solutions. Starting from the numerical results gathered in
Table 7 (WV-3),
Table 8 (WV-2) and
Table 9 (GE-1), we can observe that the most important achievement is that the proposed Faster and Fast solutions are exactly where they are expected to be, i.e., between Z-∗ (100 it.) and Z-∗ (256 it.) almost always, coherently with the loss levels shown in
Figure 7 and
Table 4, with few exceptions. In some cases, the proposals behave even better than the baseline (e.g., Fast-Z-PNN against Z-PNN on GeoEye-1,
Table 9). Moreover, it is worth noticing that Faster Z-∗ tightly follows Fast Z-∗ and Z-∗ (256 it.), confirming our initial guess that the need for tuning is mostly due to geometric or atmospheric misalignments between the training and testing datasets rather than grounding content mismatches.
Concerning the overall comparison, it must be underlined that all indexes are used on the full-resolution image without any ground-truth. They assess the consistency rather than the synthesis properties, and each reflects some arbitrary assumption. Both
and
deal with the spatial consistency, while the remaining relate to the spectral consistency. While the latter group shows a good level of agreement, the former looks much less correlated. For a deeper discussion on this issue that is a little out of scope here, the Reader is referred to [
35]. In addition, the goal of the present contribution is on efficiency rather than on accuracy; therefore, we leave these results to the reader without further discussion.
Beside numerical assessment, the visual inspection of the results is a fundamental complementary check to gain insight into the behavior of the compared methods. Let us first analyze the impact of the tuning phase by comparing the pretrained model Z-∗ (0 it.) with the proposed target-adaptive solutions, Faster and Fast versions.
Figure 8 shows a few crops from WV-3 Adelaide image (
Figure 4), with the related pansharpened images obtained with the Z-PNN variants. Remarkable differences can be noticed between the pretrained model and the two target-adaptive options. The Reader may notice the visible artifacts introduced by the pretrained model occurring, for example, in the pool of crop A or on several building roofs (spotted artifacts), which are removed by the proposed solutions, thanks to the tuning. On the other hand, there is no noticeable difference between the Faster and Fast configurations proposed. It is also worth noticing that this alignment between the Fast and the Faster configurations hold indistinguishably on crop A (always involved in the tuning) and crops B and C that are not involved in the tuning process in the Faster version (see
Figure 4). Similar considerations can be noted from the experiments, not shown for brevity, carried out on the other datasets and/or using different baseline models.
Figure 9,
Figure 10 and
Figure 11 show, again, on some selected crops, the pansharpening results obtained by the proposed solutions and by the best performing ones among all comparative methods listed in
Table 6 for the WorldView-3, WorldView-2 and GeoEye-1 images, respectively. In particular, these results confirm the most relevant observation that Fast Z-PNN, Faster Z-PNN and the baseline Z-PNN are nearly indistinguishable, in line with the numerical results discussed above, further underlining that the registered computational gain has been achieved without sacrificing accuracy. Regarding the comparison with the other methods, from the visual point of view, the proposed solutions are aligned with the best ones, without appreciable spectral or spatial distortion and a very high contrast level.
Finally, in order to further prove the robustness of the proposed approach, we ran a cross-image experiment. Taking two different WV-3 sample images whose content is sufficiently different (see
Figure 12), we ran Fast Z-PNN on the first one (a), which plays as a target image. On the same image, we also tested Z-PNN without adaptation (0 it.) and the model adapted on image (b) using the Fast configuration. The numerical results, shown in (c), provide further confirmation that the actual content of the image has a minor impact on pansharpening, in the tuning phase, with respect to acquisition geometry and atmospheric conditions.