1. Introduction
Image pansharpening is the process of merging two observations of the same scene, a low-resolution multispectral (MS) component and a high-resolution panchromatic (PAN) component, to generate a new multispectral image that displays both the rich spectral content of the MS and the high resolution of the PAN. By following the taxonomy proposed in [
1], pansharpening methods can be roughly grouped into four main categories: component substitution (CS) [
2], multiresolution analysis (MRA) [
3], variational optimization (VO) [
4,
5], and machine/deep learning (ML) [
6,
7].
In the CS approach, the multispectral image is transformed in a suitable domain, where one of its components is replaced with the PAN. In the particular case of three spectral bands, the Intensity–Hue–Saturation (IHS) transform is an option where the intensity component can be replaced with the PAN band [
8]. This method has been generalized in [
9] (GIHS) to handle a larger number of bands. Other useful transforms to implement a CS solution include the principal component analysis [
10], the Brovey transform [
11], and the Gram–Schmidt (GS) decomposition [
12]. More recently, adaptive CS methods have also been proposed, such as the advanced versions of GIHS and GS [
13], the partial replacement CS method (PRACS) [
14], or the band-dependent spatial detail (BDSD) injection method and its variants [
15,
16,
17].
In MRA approaches [
3], the pansharpening task is addressed from the perspective of a pyramidal decomposition to separate low-frequency content from detail components. The high-frequency spatial details are extracted by means of multi-resolution decomposition, such as decimated or undecimated wavelet transforms [
3,
18,
19,
20], Laplacian pyramids [
21,
22,
23,
24,
25], or other nonseparable transforms, e.g., contourlet [
26]. Extracted details are then properly injected into the upscaled MS component.
A further set of methods address the pansharpening problem through the variational optimization of suitable models of the fused image. In [
4], the optimization target involves the degradation filters mapping high-resolution to low-resolution images, while [
27] leverage sparse representations for detail injection. Palsson et al. proposed several methods of this class. A total variation regularized least square formulation is provided in [
28]. The same research team has framed the pansharpening as a maximum
a posteriori problem in [
29] and, more recently, explored the use of low-rank representations of the joint PAN-MS [
5]. Other methods do not fit the above categories and can be roughly classified as statistical [
30,
31,
32,
33,
34], dictionary-based [
35,
36,
37,
38,
39,
40], or matrix factorization approaches [
41,
42,
43]. The reader is referred to [
1,
44] for a more comprehensive review.
In recent years, a paradigm shift from model-based to data-driven approaches has revolutionized all fields of image processing, from computer vision [
45,
46,
47,
48,
49] to remote sensing [
7,
50,
51,
52]. In pansharpening, the first method based on convolutional neural networks (CNN) was proposed by Masi et al. in 2016 [
6], after which many more followed in the span of a few years [
7,
53,
54,
55,
56,
57,
58,
59,
60,
61,
62,
63,
64]. It seems safe to say that deep learning is currently the most popular approach for pansharpening. Nonetheless, it suffers from a major problem: the lack of ground truth data for supervised training. In fact, multi-resolution sensors can only provide the original PAN-MS data, downgraded in space or spectrum, and never their high-resolution versions, which remain to be estimated.
Based on this brief and certainly not exhaustive overview, it appears that pansharpening is a very active research field, with many new methods proposed every year. Reliable quality assessment procedures are of critical importance for correctly advancing the state of the art, and an incorrect evaluation paradigm may negatively impact the design or tuning of any new solution. Unfortunately, by the very nature of pansharpening, no ground truth (GT) data are available to perform a reference-based assessment. As a consequence, two kinds of quality assessments are usually employed:
- (i)
reference-based reduced-resolution assessment (synthesis check);
- (ii)
no-reference full-resolution assessment (consistency check).
Lacking GTs, the synthesis capabilities of any pansharpening method can only be assessed on “synthetic” data. In particular, Wald’s protocol [
65] suggests taking the real PAN-MS data and applying a proper resolution downgrade process for scale reduction. The scaled PAN-MS pair is the synthesized image whose GT is now given by the original MS component. The way in which the resolution downgrade has to be performed has been the object of intense research in the last two decades and has a non-negligible impact on the reliability of the consequent quality assessment. However, no matter how a GT is obtained, there exist plenty of reference-based image quality indicators. The spectral angle mapper was introduced in [
66] and assesses the balance among the spectral bands. On the contrary, the spatial correlation coefficient [
67] computes the correlation coefficient across the high-pass-filtered bands of the pansharpened image and of the GT. It is therefore oriented to the assessment of the spatial quality. The
Erreur Relative Globale Adimensionnelle de Synthése [
68] generalizes the root mean squared error, introducing band-wise correction weights. The universal image quality index [
69] compares image statistics to take into account local correlation, intensity, and contrast. Based on this general-purpose image quality index, domain-specific variants suitably adapted to the pansharpening have been proposed in [
70,
71]. In addition to these indexes, other popular general-purpose options are sometimes considered for pansharpening, e.g., the peak signal-to-noise ratio or the structural similarity (SSIM) index.
Furthermore, spectral or spatial consistency indexes suited for full-resolution images that do not require any GT have also been developed. A first spectral distortion index proposed by Zhou et al. [
67] compares a rescaled version of the pansharpened image with the MS image. Other spectral distortion indexes follow a similar procedure [
72,
73]. In [
74], the Quality for Low Resolution (QLR) index based on SSIM was proposed and later updated, exchanging SSIM with the composite image quality measure CMSC based on means, standard deviations, and correlation coefficient [
75]. Basically, most of the known spectral distortion indexes follow this protocol: degradation of the pansharpened image and comparison with the MS image. The different behavior stems from the degradation model and the error measure. For example, [
74] is based on the structural similarity indexes, while [
75] leverages a composite image quality measure based on statistics such as mean, variance, and correlation coefficient. On the other hand, the spatial consistency can be checked through the cross-scale invariance of some statistics [
72]. A similar approach is proposed in [
73], where only high-frequency components are concerned. In [
74], it is proposed a Quality for High Resolution (QHR) index, which assesses the SSIM index between the panchromatic band and a projection of the pansharpened image in the PAN domain. A variant of this approach with a different error term was proposed in [
75]. Another similar approach is proposed in [
76], where the coefficient of determination is used to compare the PAN image with its projection from the pansharpened image. It is also worth mentioning other no-reference quality indexes, such as the Natural Image Quality Evaluator (NIQE) [
77,
78], which seeks to assess the image quality by itself, rather than checking consistency.
Finally, often, the spectral and the spatial consistency indexes are somehow combined, e.g., through geometric means, to provide a single hybrid consistency index representing the overall quality no-reference assessment of the fused images [
72,
73,
74,
75,
78,
79,
80,
81,
82,
83].
Both reference-based and no-reference indexes present inherent limitations, which are analyzed in the next section.
In this work, we propose new full-resolution quality indexes that overcome some of these problems and provide more reliable guidance for the development of ever more accurate pansharpening methods. To this purpose, the developed indexes have been made available to the community through a web repository at
https://github.com/matciotola/fr-pansh-eval-tool/ (accessed on 5 April 2022).
The remainder of the paper is organized as follows.
Section 2 provides a brief critical survey on pansharpening assessment.
Section 3 introduces the proposed approach.
Section 4 discusses the experimental results and, finally,
Section 5 draws conclusions.
4. Experimental Results and Discussion
In this section, we present several experimental analyses. Preliminarily, a summary of the datasets employed and the involved methods is provided. The first experiment focuses on the dependence of the spectral consistency indexes on the alignments of the MS bands with the PAN. Next, we move to the reduced-resolution domain to cross-check reference-based and no-reference indexes thanks to the availability of the ground truth. Then, it follows an experimental analysis focused on the assessment of the spatial consistency, before closing with an overall comparison.
4.1. Datasets and Methods
The experimental validation relies on 25 methods provided in the benchmark toolbox [
1] belonging to the four main categories recalled in
Section 1, CS (8), MRA (9), VO (3), and ML (4), plus an ideal interpolator (EXP). The dataset is composed of two WorldView-2 (WV2) and two WorldView-3 (WV3) large images, courtesy sample products of DigitalGlobe
©. In total, 20 512 × 512 tiles were extracted from the WV2 images (Washington and Stockholm) and 20 from the WV3 images (Adelaide and Mexico City) for the experiments at full resolution. Likewise, 20 + 20 2048 × 2048 tiles were extracted and downscaled to size 512 × 512 for the experiments at reduced resolution.
Table 1 summarizes the main spectral and spatial characteristics of the WV-2 and WV-3 sensors.
4.2. Spectral Distortion Dependence on PAN-MS Misalignment
The impact of band misregistration on the quality of the fused products has been already recognized in the past [
93,
94,
95]. Here, we propose an ad hoc experimental analysis to show the robustness of the proposed reprojection indexes to the data misregistration. The starting point is a WV3 dataset composed of ten 2048 × 2048 tiles extracted from a larger image of Adelaide (DigitalGlobe
© sample product). This dataset presents band misalignment that we have corrected to produce a companion aligned dataset. In
Table 2, we summarize the average spectral distortion scores for each method for both datasets. For a convenient reading of these numbers, in
Figure 2, we show the impact of data misregistration on
and
in differential terms.
Each bar indicates the difference between the values of the given indicator computed on aligned and misaligned datasets, respectively. As can be clearly noticed, traditional CS methods such as BT-H, GS, GSA, C-GSA, PRACS, which, by construction, provide fused images that are strongly anchored to the PAN geometry, show a considerable spectral loss according to
or
. These results are not aligned with our expectations, as these CS methods are expected to be more robust to misregistration. The reader can refer to [
93] for a theoretical comparison between CS and MRA methods in the presence of misregistration, with the former category being superior to the latter.
In
Figure 3, instead, we compare Khan’s index
with its variant
with alignment. Again, bars indicate their variations due to dataset misalignment.
By inspecting the figure, it can be noticed that misregistered data generally cause an increase in , except for some CS methods, such as BT-H, GS, GSA, C-GSA, and PRACS, which tend to preserve the PAN geometry operating an intrinsic alignment. On the contrary, other methods, notably those belonging to the MRA, VO, and ML categories, which are oriented to the spectral preservation but do not operate alignment, register an increase in the spectral distortion when the metric takes into account the misalignment.
Concluding, the proposed experiment suggests that Khan’s index with alignment correction, which is the complement of the proposed reprojection index R-
, is more robust to misalignment or, at least, more consistent with the theoretical expectation [
93] than the original Khan’s index
or its variant
.
4.3. Reference vs. No-Reference Index Cross-Checking in the Reduced-Resolution Space
In order to assess the consistency between no-reference and reference-based indicators, we have designed a set of experiments in the reduced-resolution space. Although no-reference indexes are conceived to work in the full-resolution framework, their use on (simulated) reduced-resolution datasets allows us to carry out studies on the correlation between them and objective error measurements (reference-based indexes) thanks to the availability of GTs. In particular, for this experiment, we have resorted to our WV2 dataset composed of twenty 2048 × 2048 images at full resolution (for the sake of brevity, analogous results obtained on WV3 images will not be presented). These images come from two larger images of Washington (13) and Stockholm (7), respectively, courtesy samples of DigitalGlobe©. Each such tile was resized to 512 × 512 pixels using the usual Wald’s downgrading protocol. This dataset was already well coregistered; therefore, we have proceeded to create a misregistered counterpart operating a simple modification of the downgrading process: a 1-pixel shift (in both directions) has been introduced in the decimation (after LPF) of the MS bands.
Then, 25 pansharpening algorithms provided by the toolbox [
1] were run on each RR sample image, generating 500 results, for each of the two datasets (registered and not), for which all the indexes of interest have been computed. Eventually, we have obtained hundreds of points in a multi-dimensional evaluation space, which enable plenty of analyses, with some dimensions corresponding to the reference-based indexes (SAM, ERGAS and
, SSIM, CMSC), and the remaining ones associated with the proposed indexes (R-SAM, R-ERGAS, R-
,
) and other state-of-the-art no-reference indexes (
,
,
,
, QLR
, QLR
). In particular, by construction, we expect to observe a good correlation between Khan’s index variants and
. Therefore, we start from the scatter plots shown in
Figure 4 with such variants, in turn, vs.
.
Both the “aligned” (top) and the “misaligned” (bottom) datasets are considered. For the aligned case, by visual inspection, we can appreciate that
correlates better than
with
. Moreover,
behaves similarly to
because no alignment is operated. For both
and
, the GTs (black dots), for which
is always 1, obtain the ideal value (0) of spectral distortion. Rarely, the alignment process (based on the maximization of the correlation) embedded in the computation of
fails for some spectral band, giving rise to non-zero results for the GT even if the test image is aligned. Moreover,
is not minimized by the GT, coherently with its definition (Equation (
6)) based on the comparison between the smoothed GT and the upscaled MS. It is also worth noticing that the correlation degree between
and the different variants of spectral distortion grows when assessed by category (by colors), supporting the idea that no-reference indexes are more reliable for intra-class assessment [
1]. Moving to the misaligned dataset (bottom scatters), it can be clearly recognized a bias for both
and
, which does not occur for
, represented by the shift of the GT scores, which no longer achieve the ideal value. In general, the distortion scores in terms of
and
register a degradation, with CS methods (blue) much more penalized than others. This last observation gives further support to the interpretation provided above about the results of
Figure 2 and
Figure 3. A similar behavior is also registered for other state-of-the-art indexes. For example, in
Figure 5, the score scatters in the QLR
-SSIM (left) and QLR
-CMSC (right) planes, with registered (top) and misaligned (bottom) datasets.
Let us now leave the registration problem out, considering aligned datasets only and looking at the relationship between the spectral consistency indexes and the three most used reference-based indexes, i.e., SAM, ERGAS, and
, with the help of
Figure 6. From top to bottom, we can see how different compared spectral consistency indexes agree with the three (column-wise distributed) reference-based indexes.
From the top row scatters, it clearly appears that agrees with SAM and ERGAS much less than with . This is because is based on the same index but also because SAM and ERGAS encode a different concept of quality. Similar considerations apply for QLR and QLR, which, as well as , are both based on the comparison among local statistics. For these reasons, we believe that, in addition to R-, it makes sense to also provide R-SAM and R-ERGAS for a more comprehensive evaluation of pansharpening methods. On the bottom row, we show all three (, R-) scatter plots, which speak in favor of a good level of agreement between each objective index and the reprojected counterpart R-.
Besides the qualitative interpretation of the score scatters, we can quantify the level of agreement among the reference-based indexes and the compared no-reference indexes in terms of correlation coefficients. These are shown in the usual matrix form, for both the aligned (
Table 3) and misaligned (
Table 4) datasets, respectively. As expected, the reprojected indexes show a relatively high correlation with the non-reprojected counterpart, both for aligned and misaligned datasets.
and
also correlate well with
, but only in the aligned case. Likewise, QLR
and QLR
also correlate well with different reference-based indexes in ideal conditions but register a drop on misaligned data. It is also worth remarking that GT scores have been discarded so as not to penalize excessively the competitors on the misaligned dataset (see GT score distribution in
Figure 4 and
Figure 5).
4.4. A Qualitative Assessment of the Proposed Spatial Distortion Index
While objective synthesis quality indexes such as and ERGAS account for both spectral and spatial inaccuracies, their reprojected versions clearly limit their scope to the spectral component as they are computed on low-pass versions of the fused products, ignoring image “details”. It is therefore necessary to complement these indexes with some spatial quality indicator.
The assessment of the spatial quality of a pansharpened image in the full-resolution framework is very difficult and somewhat controversial, lacking a degradation model to describe the relationship between the full-resolution spatial–spectral data cube (fused image) and the panchromatic image. In fact, while the spectral degradation can be reasonably modeled through the sensor MTF, allowing for a spectral consistency check, the spatial degradation cannot be modeled through a simple global weighted band average [
91]. Indeed, in addition to the obvious spatial dependency of the spectral mixing process that provides the PAN image, there is also a mismatch between the PAN spectral coverage and the MS coverage [
91]. This means that there could be details “seen” by the PAN but not by the virtual full-resolution MS counterpart, and vice versa. This is the origin of the ill-posedness of the pansharpening problem, mostly residing in the spatial reconstruction, which makes the spatial consistency assessment subjective to some extent. On the basis of this observation, we provide here an interpretation of some sample results through visual inspection, comparing the proposed index
with the state-of-the-art spatial distortion index
. Of course, given the subjective nature of this kind of analysis, we will focus on clearly visible phenomena, leaving to the reader the final say based on his/her visual perception of more subtle patterns, if any. In particular, we propose two experiments: a “horizontal” comparison among the pansharpening toolbox methods and a “vertical” comparison where a single machine learning method is optimized using the proposed
(varying the related scale parameter
) jointly with a term controlling the spectral consistency.
Let us start with the first experiment.
Figure 7 shows some FR pansharpening results for crops extracted from a single tile of the WV3 Adelaide image (For visualization purposes, hereinafter, all displayed crops extracted from multispectral images are obtained combining the red, green, and blue channels (see
Table 1)). The PAN component, used as a reference, is shown in the middle of different groups of results. In the top half of the figure, the top four solutions according to
(left) and the top four according to
(right) are shown with
and
computed on the whole tile. Similarly, in the bottom half, the top four results according to QHR and NIQE are shown with related scores. It appears clearly that images selected according to the
index ensure better agreement with the reference in terms of spatial layout and, in general, better quality, with sharp contours, accurate textures, and a lack of annoying patterns such as those present in some top
and top NIQE images. It is interesting to observe that, among the top four
results, the most convincing one from visual inspection corresponds also to the relatively lowest
(AWLP). It is also worth observing a certain degree of agreement between
and QHR. Similar phenomena are observed on all other tiles.
In the previous experiment, we set the scale parameter
according to the above-discussed theoretical motivations (
Section 3.2). To gain insight into the effectiveness of this choice, we have designed an additional ad hoc experiment to provide pansharpening results with controlled spectral and spatial qualities, with the latter quantified through
, configurable using different scale settings. This is achieved by leveraging a CNN model working in target-adaptive modality [
101] and using a combination of the wanted consistency measures as a loss. In this context, the choice of the specific CNN model is not critical as we work in adaptive modality, by running unsupervised tuning iterations on each test image until the loss terms reach the desired level (overfitting). In particular, the optimized loss is defined as
where
and
are two prefixed target levels for the spectral (
) and spatial (
) quality indicators, respectively. In practice, the network parameters are pushed to overfit the test image so that the corresponding loss reaches the target quality
. These two threshold values could be ideally set to zero, but this would lead to extremely long tuning and may also generate instability because of a conflicting interaction between the two loss terms. In particular, we have set
, quite a low value considering the dynamic range of the pixel values, and
, which is also quite low according to our experiments. By doing so, each test requires a different number of tuning iterations. We have therefore oversized the number (experimentally set to 5000) of iterations to ensure convergence for all. This process is repeated for several choices of the scale parameter
ranging from 2 to 64. In
Figure 8, we show some crops from the WV3 dataset. These sample results reflect a general behavior that we have observed in a wide range of experiments, which is a relatively good response in the range of
. For smaller (
) or larger (
) scale values, noisy patterns arise. These are particularly noticeable on roofs and roads for the selected samples. The above observations provide experimental validation of the choice
proposed in this work on the basis of the theoretical motivations discussed in
Section 3.2.
The present experiment gives us the opportunity to also make some considerations about the computational load related to the proposed indexes. In general, in the context of quality assessment, the computation time is not a critical issue as it is carried out offline. However, with the advent of deep learning, researchers have started to use such quality indexes as loss functions for training purposes. In this regard, it is worth distinguishing pixel-based indexes (e.g., SAM, ERGAS,
-norms), which are relatively light to compute, from other indexes that involve local statistics computed at a certain scale for each location (e.g.,
,
,
). When using the latter as loss, the training may slow down. In the particular case of
, which has been included in the loss function to fine-tune the sample CNN employed in the experiment, we have actually registered a moderate impact on the training time. In fact, using a training batch composed of a single 2048 × 2048 image, the time consumption per iteration, without or with the additional
loss component (second term in Equation (
13)), shifts from 1.27 s to 1.6 s on an NVIDIA P6000 GPU.
4.5. Comparative Results
To conclude, we present here an overall comparison among the available pansharpening methods provided by the toolbox [
1], using the proposed quality indicators and supported by the visual inspection of some sample results. In
Figure 9 and
Figure 10, we gather the average numerical results obtained on our WV3 and WV2 datasets, respectively.
As expected, the numerical results clearly show a good level of agreement among the three reprojection error indexes, all essentially linked to the spectral consistency. However, some exceptions can be observed, particularly for the WV2 dataset (
Figure 10), where some CS methods register a performance loss in terms of
not observed on R-SAM and R-ERGAS. We also recognize good agreement with some literature findings, particularly on WV3 (
Figure 9), such as the spectral accuracy gap between MRA and CS methods [
44] or the competitiveness of some MRA, VO, and ML solutions. Moreover, for the ML solutions, it is worth noticing a performance gap moving from one dataset (WV3) to another (WV2). Indeed, such variability is not unusual for data-driven approaches, which can suffer generalization limits when the training dataset is not sufficiently representative. In this particular situation, it could likely be the case that the WV2 test images used are too different from the images used for the training of the involved ML methods.
To gain insight into the effectiveness of the proposed indexes, let us take a look at some sample results for a complementary subjective analysis. In
Figure 11, we show some clips from a single WV3 tile of the Adelaide dataset. On the leftmost column are gathered the PAN and MS input components on two consecutive rows. Then, the corresponding pansharpening results of the best-performing solutions according to R-
are shown in decreasing order from left to right, next to the PAN. Next to the MS, we also show the “reprojection” error map, which is the difference between the input MS and the reprojection (LPF plus decimation) of the output. On the bottom lines are gathered the R-
and
scores. Therefore, the spectral quality of the results decreases left to right. This can also be appreciated through the inspection of the reprojection error maps. Moreover, the best methods from the spatial perspective (
) follow a different ordering. This partially reflects a tradeoff between spectral and spatial features. Particularly interesting is the case of the method TV, which scores first in terms of R-
(actually, it is the best according to R-SAM and R-ERGAS, as well) but shows the worst
value, 0.215, corresponding to a 78.5% average correlation between the PAN and the pansharpened bands. The impact of this low correlation level is clearly visible in the pansharpening results of TV, which show underlying noisy patterns. Such patterns disappear with the reprojection, explaining why the reprojection indexes do not worsen. The other methods achieve much lower values of
, ranging from 0.084 to 0.115 (around 90% of correlation), which corresponds to a higher coherence between the spatial features of the results and the PAN, easily appreciable through visual inspection. For completeness, in
Figure 12, we show analogous results for a WV2 tile, which basically confirms the same considerations made above for WV3.
5. Conclusions
To cope with the limitations of the reference-based reduced-resolution procedures for the quality assessment of pansharpening methods, in this work, we propose a new full-resolution no-reference evaluation framework. By following Wald’s protocol [
65], the full-resolution assessment must be carried out checking the “consistency” of the fused products with the MS and PAN input components rather than the “synthesis” capacity, which would require the availability of GTs. In particular, inspired by Khan’s index [
73], we have proposed the use of reprojection-based indexes with embedded alignment to handle misregistered datasets for the assessment of the spectral consistency between the pansharpened image and the input MS. Moreover, the spatial consistency between the fused image and the PAN is quantified by averaging, spatially and spectrally, the fine-scale local correlation of individual super-resolved bands with the high-resolution PAN.
A key qualifying aspect of the proposed indexes is the absence of any resolution downgrading of the input data, which frees the assessment from the effect of scale-dependent phenomena. Experiments on reduced-resolution datasets show that the reprojection indexes are reliable predictors of image quality as quantified by reference-based indexes, supporting their use in the full-resolution domain. On the other hand, experiments on full-resolution data make clear that the local correlation-based index provides indications on image quality that largely agree with the judgement of human experts. The proposed approach can be readily generalized to fusion tasks other than pansharpening, such as, for example, the combination of low-resolution hyperspectral and high-resolution multispectral images.
On the other hand, the user must also be aware of some limitations. In particular, the reprojection requires the knowledge of an accurate estimation of the sensor MTF for a correct low-pass filtering. A wrong estimation would lead to inaccurate spectral consistency assessment, a problem shared with preexisting solutions. Moreover, the proposed correlation distortion index is based on the assumption that the PAN correlates well with all spectral bands. Such a hypothesis is globally acceptable but there can be rare cases, spatially and spectrally well localized, for which it may be too strong.