Full-resolution quality assessment for pansharpening

A reliable quality assessment procedure for pansharpening methods is of critical importance for the development of the related solutions. Unfortunately, the lack of ground-truths to be used as guidance for an objective evaluation has pushed the community to resort to two approaches which can also be jointly applied. Hence, two kinds of indexes can be found in the literature: i) reference-based reduced-resolution indexes aimed to assess the synthesis ability; ii) no-reference subjective quality indexes for full-resolution datasets aimed to assess spectral and spatial consistency. Both reference-based and no-reference indexes present critical shortcomings which motivate the community to explore new solutions. In this work, we propose an alternative no-reference full-resolution assessment framework. On one side we introduce a protocol, namely the reprojection protocol, to take care of the spectral consistency issue. On the other side, a new index of the spatial consistency between the pansharpened image and the panchromatic band at full resolution is also proposed. Experimental results carried out on different datasets/sensors demonstrate the effectiveness of the proposed approach.


Introduction
Image pansharpening is the process of merging two observations of the same scene, a low resolution multispectral (MS) component and a high resolution panchromatic (PAN) component, to generate a new multispectral image that displays both the rich spectral content of the MS and the high resolution of the PAN. By following the taxonomy proposed in [79], pansharpening methods can be roughly grouped in four main categories, component substitution (CS) [70], multiresolution analysis (MRA) [60], variational optimization (VO) [82,57], and machine/deep learning (ML) [45,91].
In the CS approach, the multispectral image is transformed in a suitable domain where one of its components is replaced with the PAN. In the particular case of three spectral bands, the Intensity-Hue-Saturation (IHS) transform is an option where the intensity component can be replaced with the PAN band [74]. This method has been generalized in [73] (GIHS) to handle a larger number of bands. Other useful transforms to implement a CS solution include the principal component analysis [15], the Brovey transform [27] and the Gram-Schmidt (GS) decomposition [37]. More recently, adaptive CS methods have also been proposed, such as the advanced versions of GIHS and GS [5], the partial replacement CS method (PRACS) [17], or the band-dependent spatial detail (BDSD) injection method and its variants [26,24,77].
In MRA approaches [60], the pansharpening task is addressed from the perspective of a pyramidal decomposition to separate low-frequency content from detail components. The high frequency spatial details are extracted by means of a multi-resolution decomposition, such as decimated or undecimated wavelet transforms [53,60,54,32], Laplacian pyramids [2,3,4,40,63], or other nonseparable transforms, e.g., contourlet [67]. Extracted details are then properly injected into the upscaled MS component.
A further set of methods address the pansharpening problem through the variational optimization of suitable models of the fused image. In [82] the optimization target involves the degradation filters mapping high-resolution to low-arXiv:2108.06144v3 [cs.CV] 4 Apr 2022 Full-resolution quality assessment for pansharpening resolution images, while [75] leverage on sparse representations for detail injection. Palsson et al. proposed several methods of this class. A total variation regularized least square formulation is provided in [56]. The same research team has framed the pansharpening as a maximum a posteriori problem in [55] and, more recently, explored the use of low-rank representations of the joint PAN-MS [57]. Other methods do not fit the above categories and can be roughly classified as statistical [23,94,48,69,97], dictionary-based [42,41,101,16,100,29], or matrix factorization approaches [92,38,30]. The reader is referred to [78,79] for a more comprehensive review.
In recent years, a paradigm shift from model-based to data-driven approaches has revolutionized all fields of image processing, from computer vision [34,20,28,39,96] to remote sensing [91,65,13,47]. In pansharpening, the first method based on convolutional neural networks (CNN) was proposed by Masi et al. in 2016 [45], after which many more followed in a few years' span [91,88,89,62,46,11,93,43,68,76,95,22,21]. It seems safe to say that deep learning is currently the most popular approach for pansharpening. Nonetheless, it suffers from a major problem: the lack of ground truth data for supervised training. In fact, multi-resolution sensors can only provide the original PAN-MS data, downgraded in space or spectrum, never their high-resolution versions, which remain to be estimated.
Based on this brief and certainly not exhaustive overview, it appears that pansharpening is a very active research field, with many new methods proposed every year. Reliable quality assessment procedures are of critical importance for correctly advancing the state of the art, and a wrong evaluation paradigm may negatively impact the design or tuning of any new solution. Unfortunately, by the very nature of pansharpening, no Ground Truth (GT) data are available to perform a reference-based assessment. As a consequence two kinds of quality assessments are usually employed: i) reference-based reduced-resolution assessment (synthesis check) ii) no-reference full-resolution assessment (consistency check).
Lacking GTs, the synthesis capabilities of any pansharpening method can only be assessed on "synthetic" data. In particular, Wald's protocol [86] suggests taking the real PAN-MS data and applying a proper resolution downgrade process for scale reduction. The scaled PAN-MS pair is the synthesized image whose GT is now given by the original MS component. The way the resolution downgrade has to be performed has been object of intense research in the last two decades and has a non negligible impact on the reliability of the consequent quality assessment. However, no matter how a GT is obtained, there exist plenty of reference-based image quality indicators. The spectral angle mapper was introduced in [35] and assesses the balance among the spectral bands. On the contrary, the spatial correlation coefficient [98] computes the correlation coefficient across the high-pass filtered bands of the pansharpened image and of the GT. It is therefore oriented to the assessment of the spatial quality. The Erreur Relative Globale Adimensionnelle de Synthése [85] generalizes the root mean squared error, introducing band-wise correction weights. The universal image quality index [87] compares image statistics to take into account local correlation, intensity and contrast. Based on this general purpose image quality index, domain specific variants suitably adapted to the pansharpening have been proposed in [7,25]. In addition to these indexes, other popular general purpose options are sometimes considered for pansharpening, e.g., the peak signal-to-noise ratio or the structural similarity (SSIM) index.
Besides, spectral or spatial consistency indexes suited for full-resolution images which do not require any GT have also been developed. A first spectral distortion index proposed by Zhou et al. [98] compares a rescaled version of the pansharpened image with the MS image. Other spectral distortion indexes follow a similar procedure [6,51]. In [58] the Quality for Low Resolution (QLR) index based on SSIM was proposed and later updated exchanging SSIM with the composite image quality measure CMSC based on means, standard deviations and correlation coefficient [59]. Basically, most of the known spectral distortion indexes follow this protocol: degradation of the pansharpened image and comparison with the MS image. The different behavior stems from the degradation model and the error measure. For example [58] is based on the structural similarity indexes while [59] leverages on a composite image quality measure based on statistics such as mean, variance and correlation coefficient. On the other hand, the spatial consistency can be checked through the cross-scale invariance of some statistics [6]. A similar approach is proposed in [51] where only high-frequency components are concerned. In [58] it is proposed a Quality for High Resolution (QHR) index which assesses the SSIM index between the panchromatic band and a projection of the pansharpened image in the PAN domain. A variant of this approach with a different error term was proposed in [59]. Another similar approach is proposed in [10], where the coefficient of determination is used to compare the PAN image with its projection from the pansharpened image. It is also worth to mention other no-reference quality indexes, such as the Natural Image Quality Evaluator (NIQE) [50,36] that seeks to assess the image quality by-itself rather than checking consistency.
Finally, often the spectral and the spatial consistency indexes are somehow combined, e.g. through geometric means, to provide a single hybrid consistency index representing the overall quality no-reference assessment of the fused images [6,51,58,59,1,36,49,14,84,83].
Both reference-based and no-reference indexes present inherent limitations which are analyzed in the next Section.
In this work, we propose new full-resolution quality indexes that overcome some of these problems and provide a more reliable guidance for the development of ever more accurate pansharpening methods. To this purpose, the developed indexes have been made available to the community through a web repository at https://github.com/matciotola/ fr-pansh-eval-tool/.
The remainder of the paper is organized as follows. Section 2 provides a brief critical survey on pansharpening assessment. Section 3 introduces the proposed approach. Section 4 discusses the experimental results and, finally, Section 5 draws conclusions.

A review of panshapening indexes
Due to the lack of ground truths, visual inspection by human experts is the ultimate benchmark for pansharpening methods. However, this is a lengthy and tedious work, and numerical indexes are essential for a viable development process. Reduced-Resolution (RR) indexes are computed, according to Wald's protocol [86], by reducing the scale of both MS and PAN components. Then, pansharpening is performed on these rescaled data, and the original MS is used as GT to compute reference-based indexes. Instead, Full-Resolution (FR) indexes are computed on the original data and try to assess their spectral and/or spatial consistency with the pansharpened output. In a comprehensive evaluation, following Wald's protocol [86], usually, both types of indexes are considered together in addition to visual inspection [79]. In order to highlight the critical points of the current approaches to the quality assessment, we provide a brief overview of some of the most representative indexes in the following.

Reduced-resolution assessment
Let M and P be the original MS and PAN components, respectively, andM be the pansharpened image. The synthesis properties can not be directly verified on the full-resolution pair (M, P ) due to the lack of the GT with which to compareM . Leveraging on a scale invariance hypothesis, it is however possible to synthesize datasets with GT by properly downgrading the available (M, P ) pairs by a factor R (PAN-MS resolution ratio). The resulting reduced resolution pair (M ↓ , P ↓ ) can therefore be used as the source input for the pansharpening task, whereas the original MS image M plays as GT. Once moved to the reduced resolution space, many different quality indexes can be computed to assess the mismatch between the pansharpened image and its reference GT, such as root mean square error, structural similarity, and so on. In particular, referring to some of the most popular ones for pansharpening, the most common are the following. a) SAM (Spectral Angle Mapper) [35]. It determines the spectral similarity in terms of pixel-wise average angle between spectral signatures. Said v andv two corresponding pixel spectral responses to be compared, SAM is obtained by averaging over all image locations the following "angle" among vectors: b) ERGAS (Erreur Relative Globale Adimensionnelle de Synthése) [85]. It is one of the most popular indexes to assess both spectral and structural fidelity between a synthesized image and a target GT. Such an index presents interesting invariance properties. Indeed, it is insensitive to radiometric range, number of bands and resolution ratio. Said B the number of spectral bands, it is defined as where RMSE b is the root mean square error over the b-th spectral band and µ GT b is the average intensity of the b-th band of the GT image. c) Q2 n [25]. It is a multiband extension of the universal image quality index (UIQI) [87]. Each pixel of an image with B spectral bands is accommodated into a hypercomplex (HC) number with one real part and B-1 imaginary parts. Let z andẑ denote the HC representations of a generic GT pixel and its prediction, respectively, then Q2 n can be written as the product of three terms: The first factor provides the modulus of the HC correlation coefficient between z andẑ. The second and the third terms measure contrast changes and mean bias, respectively, on all bands simultaneously. Statistics are typically computed on 32×32 pixels blocks, then and an average over the blocks of the whole image provides the global Q2 n score, which takes values in the [0, 1] interval, being 1 the optimal value achieved if and only if z =ẑ in each location.
Behind its appealing simplicity, regardless of the specific error measurement employed, the reduced resolution assessment hides some major subtle pitfalls that undermine its usefulness. On one hand, its accuracy depends critically on the method used for scaling, which may not correspond to the actual sensing conditions. More fundamentally, it relies on the arbitrary assumption that a method optimized for the RR scale will keep its good behavior at the FR scale. Experimental evidence shows this not to be the case. Furthermore, optimizing parameters for a given scale may lead to a sort of "scale overfitting", with the perverse effect of degrading performance on a different scale. For these reasons, many recent studies focus on no-reference full-resolution quality indexes [36,10,84,49]. It is also worth to recall that the assessment of spectral and spatial consistencies is not necessarily linked to a fusion task and is of a broader interest, with applications in the medical [71], agricultural [33] and food [99] sciences, to mention a few.

Full-resolution no-reference assessment
As a consequence of the above limitations, many studies have moved toward a no-reference approach to the pansharpening quality assessment. In particular, the same Wald's protocol [86] recognizes the problem invoking a complementary consistency quality check in addition to the synthesis capacity assessment. In practice, for consistency-based assessment, the input data are pansharpened in their original (full) resolution space. The outcome is then compared with both input components, MS and PAN, for spectral and consistency assessment, respectively. The spectral consistency check requires a spatial resolution degradation of the pansharpened image, which is typically done by means of a Low-Pass Filtering (LPF) followed by decimation, using LPFs that approximate the sensor Modulation Transfer Function (MTF) [9]. Other approaches have also been explored, such as wavelet transforms [19,61], averaging operators [52], pyramid decomposition [2]. The spatial (or structural) consistency check is indeed more controversial lacking a clear relationship that allows to project the pansharpened MS in the PAN domain [72]. No-reference indexes are also often referred to as FR indexes, as they do not require any resolution degradation of the input data, contrarily to the reference-based ones which do need it because of the lack of GTs. For this reason, the latter are also referred to as RR indexes.
Among the several FR quality assessment approaches proposed in the last years [6, 98, 51, 67, 10], we will recall a few of them which are the most frequently used. In particular, Alparone et al. [6] proposed a spectral distortion index D λ defined as , with subscripts indicating selected bands and being Q(·, ·) the UIQI [87]. Such an index was also enclosed in the pansharpening benchmark toolbox [78]. As it can be observed, D λ compares the inter-band relationship through the UIQI, separately for the input MS M and the pansharpened imagê M , and then quantifies their divergence over each coupling of spectral bands. As it can be figured, lacking a direct measurement of the distance between M andM , and leveraging on an assumption of cross-scale inter-band correlation invariance only, there can be important spectral aberrations that can remain unseen by D λ . The recently published revisited version [79] of the same toolbox [78], indeed, makes use of an approximated version of Khan's index proposed in [51] to monitor the spectral consistency. In particular, Khan's index is defined as follows whereM lp ↓ indicates the MTF-based low-pass filtered and decimated version of the pansharpened imageM , while M is the original input MS. Its approximation used in the toolbox [79], name it D (K) λ , avoids the decimation ofM by using an upscaled version M of M : beingM lp the low-pass version ofM . These indexes range between 0 (optimal value) and 1, and clearly relate to the spectral consistency, since the resolution downgrade process applied toM gets rid of the high-pass content retrieved from the PAN. However, two critical points should be taken into account: a) dependence on the accuracy of the estimated MTF; b) sensitivity to the PAN-MS alignment.
In particular, as a consequence of b), methods such as several CS approaches [44,37,5,63,17] that intrinsically address the coregistration problem are penalized by both D on misaligned datasets. Hence, in order to fairly use Khan's index it is usually recommended to keep separate the registration and pansharpening problems by means of a prior coregistration of the datasets to be used for testing [78,79].
Several indexes have also been proposed for FR spatial consistency check [6,51,58,10]. In particular, the spatial distortion index proposed in [6] is used in the most recent pansharpening assessment toolbox [79] and is defined as: where P ↓ is the original PAN component resized to the MS scale and Q(·, ·) is the UIQI quality index [87]. This index is somehow similar to D λ as it checks the cross-scale invariance of the PAN-MS spectral correlation. Therefore it inherits similar limitations, which are: i. no direct comparison between the pansharpened imageM and the PAN P ; ii. a cross-scale invariance assumption for which there are no guarantees.

Proposed full-resolution indexes
We now propose some new indexes for assessing the quality of pansharpening, aiming to overcome some limitations of traditional ones. Given the weakness of the scale-invariance assumption of reduced-resolution quality assessment indexes, we focus on full-resolution ones, moving our perspective from synthesis to consistency. In the following, we will discuss spectral and spatial assessment separately.

Reprojection protocol for spectral accuracy assessment
The assessment of the spectral consistency is not as controversial as the assessment of the spatial one. In particular, as far as we know, Khan's approach is the most trusted one for this purpose. Nonetheless, as remarked above, it has some limitations and, in particular, in order to work properly, the datasets must be well aligned.
To cope with this issue without the need to operate any coregistration on the available datasets, let us now consider a variation of Khan's index where the decimation ofM lp is performed using a band-wise shift that maximizes the correlation coefficient between the low-pass filtered PAN P lp and each ideally interpolated MS band M b . In case of aligned datasets, it returns the usual Khan's index because no shifts are applied. Otherwise, before decimation, we apply toM the same shift that would align M and P lp . Let us indicate as D λ,align such an index which is formally given by where ↓a indicates decimation with band-wise alignment.
Let M indicate a generic reference-based distortion index, such as Q2 n , SAM or ERGAS. We define the reprojection index R-M as that in the particular case of M = Q2 n reduces to the complement of the D In other words, the pansharpened imageM is projected in the original MS space prior to be compared with the MS image M (reference) using the error index M. The projection accounts for both eventual misalignment and sensor MTF using matched low-pass filters. Of course, because of the low-pass filtering, the high frequency image content (spatial details) is discarded and some further indexes will be necessary to assess spatial quality. Note also that, under the hypothesis that the sensor MTF is correctly simulated, the reprojection indexes ensure that an ideal pansharpening (GT) would achieve the best score as expected. This will be experimentally analyzed in Sec. 4.3.
A pictorial representation of the proposed reprojection indexes which relates them to the synthesis assessment protocol proposed by Wald [86] is given in Fig.1. The common starting point (top-left) is the available (P, M ) pair. Indexes based on Wald's protocol follow the counterclockwise path: decimation-pansharpening-measure. The proposed reprojection protocol, instead, consists in following the clockwise path: pansharpening-decimation-measure, where the actual pansharpened image is first computed, and then reprojected on the low-resolution domain for quality assessment. Eventually, in both cases, the original MS is used as a reference for any error measurement. In our proposal, however, pansharpening precedes the resolution downgrading, thereby exploiting the most valuable source of spatial information, the original PAN. Furthermore, the overall evaluation no longer depends on how the PAN is downscaled. Since the the reprojection protocol involves only the low-resolution component of the synthesized image, it has to be considered a spectral consistency rather than a synthesis check. A pseudo-code of the reprojection indexes is given in Algorithm 1.

Correlation-based spatial consistency index
In order to complement the reprojection-based assessment, we also propose a new index to evaluate the spatial consistency with the aim of obtaining indications better correlated to human judgment than those provided by D S . In particular, to assess the preservation of fine-scale spatial structures, we compute the average local correlation between the pansharpened image and the PAN.
Let X σ ij indicate a σ × σ patch of X centered on location (i, j). We compute the correlation field ρ σ PM given by the local correlation coefficients between P and each band b ofM : Then we reduce it to its average value ρ σ PM over space and spectral bands. The final index is then defined as such that D ρ = 0 corresponds to perfect correlation. From a different point of view, we are studying to what extent a σ×σ patch of any band ofM can be linearly predicted from the corresponding PAN patch. Therefore, D ρ is strictly related to the matching between the spatial layouts ofM and P at a fine scale.
The choice of the scale parameter σ is of critical importance, as we are interested to the exploitation of the complementary information discarded by the reprojection indexes, i.e. details appearing only at the PAN scale but not at the MS scale. Therefore, if R is the resolution ratio, by choosing σ = R (equal to 4 for our datasets) we can put the focus just on that spatial content which is not visible at the MS scale. By doing so, we reduce the dependence between the proposed spatial (D ρ ) and spectral (R-M) distortion indexes, assigning to them clear but well separated evaluation objectives.
As for D S and D (K) λ , or its variants, D ρ computed forM =GT does not reach the zero value, given the slightly different spatial layout of different bands in natural images and the complex relationship between the high resolution image and its panchromatic projection [72]. Nonetheless, we expect good quality pansharpened images to present relatively small values of D ρ . A detailed discussion in this regard supported by our experimental results will be provided in the following.

Experimental results and discussion
In this Section we present several experimental analyses. Preliminarily, a summary of the datasets employed and on the involved methods is provided. The first experiment focuses on the dependence of the spectral consistency indexes on the alignments of the MS bands with the PAN. Next, we move to the reduced-resolution domain to cross-check reference-based and no-reference indexes thanks to the availability of the ground-truth. Then, it follows an experimental analysis focused on the assessment of the spatial consistency, before closing with an overall comparison.

Datasets and methods
The experimental validation relies on 25 methods provided in the benchmark toolbox [79] belonging to the four main categories recalled in Section 1, CS (8), MRA (9), VO (3) and ML (4), plus an ideal interpolator (EXP). The dataset is composed by two WorldView-2 (WV2) and two WorldView-3 (WV3) large images, courtesy sample products of DigitalGlobe © . 20 512×512 tiles were extracted from the WV2 images (Washington and Stockholm) and 20 from the WV3 images (Adelaide and Mexico City) for the experiments at full resolution. Likewise, 20+20 2048×2048 tiles were extracted and downscaled to size 512×512 for the experiments at reduced resolution. Tab. 1 summarizes the main spectral and spatial characteristic of the WV-2 and WV-3 sensors.  λ,align . Detailed information about the methods can be found in [79].

Spectral distortion dependence on PAN-MS misalignment
The impact of bands misregistration on the quality of the fused products has been already recognized in the past [12,31,90]. Here we propose an ad hoc experimental analysis to show the robustness of the proposed reprojection indexes to the data mis-registration. The starting point is a WV3 dataset composed of ten 2048×2048 tiles extracted from a larger image of Adelaide (DigitalGlobe © sample product). This dataset presents bands misalignment that we have corrected to produce a companion aligned dataset. In Tab. 2 we summarize the average spectral distortion scores for each method for both datasets. For a handy reading of these numbers, in Fig. 2 we show the impact of data misregistration on D (K) λ and D (K) λ in differential terms. Each bar indicates the difference between the values of the given indicator computed on aligned and misaligned datasets, respectively. As it can be clearly noticed, traditional CS methods such as BT-H, GS, GSA, C-GSA, PRACS, that by construction provide fused images that are strongly anchored to the PAN geometry, pay a considerable spectral loss according to D λ . These results are not aligned with our expectations, as these CS methods are expected to be more robust to misregistration. The Reader can refer to [12] for a theoretical comparison between CS and MRA methods in presence of misregistration, with the former category being superior to the latter.
In Fig. 3, instead, it is compared Khan's index D λ,align , except for some CS methods, such as BT-H, GS, GSA, C-GSA and PRACS, which tend to preserve the PAN geometry operating an intrinsic alignment. On the contrary, other methods, notably those belonging to the MRA, VO and ML categories, that are oriented to the spectral preservation but do not operate alignment, register an increase of the spectral distortion when the metric takes into account the misalignment. λ . Bar levels indicate the variation of the indexes due to data misragistration.  λ,align = 1−R-Q2 n vs Q2 n on the WV2 RR dataset. Both the "aligned" (top) and "mis-aligned" (bottom) dataset cases are shown. Each marker is associated to a single pansharpening result on a 512 × 512 tile. Black markers correspond to the ground-truth. Light gray ones (EXP) correspond to a simple interpolation. The remaining clusters correspond to the different algorithms grouped by categories.
Concluding, the proposed experiment suggests that Khan's index with alignment correction, that is the complement of the proposed reprojection index R-Q2 n , is more robust to misalignment or, at least, more consistent with the theoretical expectation [12], than the original Khan's index D

Reference vs no-reference indexes cross-checking in the reduced resolution space
In order to assess the consistency between no-reference and reference-based indicators, we have designed a set of experiments in the reduced resolution space. Despite no-reference indexes are conceived to work in the full-resolution framework, their use on (simulated) reduced resolution datasets allows us to carry out studies on the correlation between them and objective error measurements (reference-based indexes) thanks to the availability of GTs. In particular, for this experiment we have resorted to our WV2 dataset 1 composed of twenty 2048×2048 images at full-resolution. These images come from two larger images of Washington (13) and Stockholm (7), respectively, courtesy samples of DigitalGlobe © . Each such tiles was resized to 512×512 pixels using the usual Wald's downgrading protocol. This dataset was already well coregistered, therefore we have proceeded to create a misregistered counterpart operating a simple modification of the downgrading process: a 1-pixel shift (in both directions) has been introduced in the decimation (after LPF) of the MS bands.
Then, 25 pansharpening algorithms provided by the toolbox [79] were run on each RR sample image generating 500 results, for each of the two datasets (registered and not), for which all the indexes of interest have been computed. Eventually, we have obtained hundreds of points in a multi-dimensional evaluation space, which enable plenty of analyses, with some dimensions corresponding to the reference-based indexes (SAM, ERGAS and Q2 n , SSIM, CMSC), and the remaining ones associated to the proposed indexes (R-SAM, R-ERGAS, R-Q2 n = 1 − D (K) λ,align , D ρ ) and to other state-of-the-art no-reference indexes (D λ , D λ , D S , QLR 1 , QLR 2 ). In particular, by construction we expect to observe a good correlation between Khan's index variants and Q2 n . Therefore we start from the scatter plots shown in Fig. 4 with such variants, in turn, vs Q2 n . Both the "aligned" (top) and the "mis-aligned" (bottom) datasets are considered. For the aligned case, by visual inspection we can appreciate that D   Figure 5: From left to right QLR 1 vs SSIM and QLR 2 vs CMSC on the WorldView-2 RR dataset. Both the "aligned" (top) and "mis-aligned" (bottom) dataset cases are shown. Each marker is associated to a single pansharpening result on a 512 × 512 tile. Black markers correspond to the ground-truth. Light gray ones (EXP) correspond to a simple interpolation. The remaining clusters correspond to the different algorithms grouped by categories.
the GTs (black dots), for which Q2 n is always 1, get the ideal value (0) of spectral distortion. 2 Besides, D (K) λ is not minimized by the GT, coherently with its definition (Eq. 6) based on the comparison between the smoothed GT and the upscaled MS. It is also worth to notice that the correlation degree between Q2 n and the different variants of spectral distortion grows when assessed by category (by colors), supporting the idea that no-reference indexes are more reliable for intra-class assessment [79]. Moving to mis-aligned dataset (bottom scatters), it can be clearly recognized a bias for both D register a degradation, with CS methods (blue) much more penalized than others. This last observation gives a further support the interpretation provided above about the results of Fig. 2 and Fig. 3. A similar behavior is also registered other state-of-the-art indexes. For example, in Fig.5 the score scatters in the QLR 1 -SSIM (left) and QLR 2 -CMSC (right) planes, with registered (top) and misaligned (bottom) datasets.
Let us now leave the registration problem out considering aligned datasets only and looking at the relationship between the spectral consistency indexes and the three most used reference-based indexes, i.e., SAM, ERGAS and Q2 n , with the help of Fig. 6. From top to bottom we can see how different compared spectral consistency indexes agrees with the three (column-wise distributed) reference-based indexes. From the top row scatters it clearly appears that D (K) λ agrees with SAM and ERGAS much lesser than with Q2 n . This because D (K) λ is based on the same Q2 n index but also because SAM and ERGAS encode a different concept of quality. Similar considerations apply for QLR 1 and QLR 2 who, as well as Q2 n , are both based on the comparison among local statistics. For these reasons, we believe that in addition to R-Q2 n it makes sense to also provide R-SAM and R-ERGAS for a more comprehensive evaluation of pansharpening methods. On the bottom row, we show all three (M, R-M) scatter plots, which speak in favor of a good level of agreement between each objective index M and the reprojected counterpart R-M.
Beside the qualitative interpretation of the score scatters, we can quantify the level of agreement among the referencebased indexes and the compared no-reference indexes in terms of correlation coefficients. These are shown in Tab. 3 in the usual matrix form, for both the aligned (a) and misaligned (b) dateset. As expected the reprojected indexes show a relatively high correlation with the non-reprojected counterpart, both for aligned and misaligned datasets. D (K) λ and D (K) λ also correlate well with Q2 n , but only in the aligned case. Likewise, QLR 1 and QLR 2 also correlate well with different reference-based indexes in ideal conditions but register a drop on misaligned data. It is also worth to remark that GT scores have been discarded to not penalize excessively the competitors on the misaligned dataset (see GT score distribution in Fig. 4-5)

A qualitative assessment of the proposed spatial distortion index
While objective synthesis quality indexes such as Q2 n and ERGAS account for both spectral and spatial inaccuracies, their reprojected versions clearly limit their scope to the spectral component as they are computed on low-pass versions of the fused products ignoring image "details". It is therefore necessary to complement these indexes with some spatial quality indicator.
The assessment of the spatial quality of a pansharpened image in the full-resolution framework is very difficult and somewhat controversial, lacking a degradation model to describe the relationship between the full-resolution spatialspectral datacube (fused image) and the panchromatic image. In fact, while the spectral degradation can be reasonably modeled through the sensor MTF, allowing for a spectral consistency check, the spatial degradation cannot be modeled through a simple global weighted band average [72]. Indeed, in addition to the obvious spatial dependency of the spectral mixing process that provides the PAN image, there is also a mismatch between the PAN spectral coverage and the MS coverage [72]. This means that there could be details "seen" by the PAN but not by the virtual full-resolution MS counterpart, and vice versa. This is the origin of an ill-posedness of the pansharpening problem, mostly residing in the spatial reconstruction, which makes the spatial consistency assessment subjective to some extent. On the basis of this observation, we provide here an interpretation of some sample results through visual inspection, comparing the proposed index D ρ with the state-of-the-art spatial distortion index D S . Of course, given the subjective nature of this kind of analysis, we will focus on clearly visible phenomena, leaving to the Reader the final say based on his/her visual perception of more subtle patterns, if any. In particular, we propose two experiments: an "horizontal" comparison among the pansharpening toolbox methods and a "vertical" comparison where a single machine learning method is optimized using the proposed D ρ (varying the related scale parameter σ) jointly with a term controlling the spectral consistency.
Let us start with the first experiment. Fig. 7 shows some FR pansharpening results for crops extracted from a single tile of the WV3 Adelaide image. 3 The PAN component, used as a reference, is shown in the middle of different groups of results. In the top half part of the figure, the top-4 solutions according to D S (left) and the top-4 according to D ρ (right) are shown with D S and D ρ computed on the whole tile. Similarly, in the bottom half part, the top-4 results according to QHR and NIQE are shown with related scores. It appears clearly that images selected according to the D ρ index ensure a better agreement with the reference in terms of spatial layout and, more in general, a better quality, with sharp contours, accurate textures, and the lack of annoying patterns such as those present in some top-D S and top-NIQE images. It is interesting to observe that, among the top-4 D S results, the most convincing one from visual inspection corresponds also to the relative lowest D ρ (AWLP). It is also worth to observe a certain degree of agreement between D ρ and QHR. Similar phenomena are observed on all other tiles.
In the previous experiment we have set the scale parameter σ = R = 4 according to the above discussed theoretical motivations (Section 3.2). To gain insight into the effectiveness of this choice we have designed an additional ad hoc experiment to provide pansharpening results with controlled spectral and spatial qualities, with the latter quantified through D ρ , configurable using different scale settings. This is achieved by leveraging on a CNN model working in target-adaptive modality [18] and using a combination of the wanted consistency measures as a loss. In this context, the choice of the specific CNN model is not critical as we work in adaptive modality, by running unsupervised tuning iterations on each test image until the loss terms reach the desired level (overfitting). In particular, the optimized loss is defined as where L * λ and D * ρ are two prefixed target levels for the spectral ( M lp ↓a − M 1 ) and spatial (D (σ) ρ ) quality indicators, respectively. In practice, the network parameters are pushed to overfit the test image so that the corresponding loss reaches the target quality (L * λ , D * ρ ). These two threshold values could be ideally set to zero, but this would lead to extremely long tuning and may also generate instability because of a conflicting interaction between the two loss terms. In particular, we have set L * λ = 0.015, quite a low value considering the dynamic range of the pixels values, and D * ρ = 0.1, which is also quite low according to our experiments. By doing so, each test requires a different number of tuning iterations. We have therefore oversized the number (experimentally set to 5000) of iterations to ensure convergence for all. This process is repeated for several choices of the scale parameter σ ranging from 2 to 64. In Fig. 8, we show some crops from the WV3 dataset. These sample results reflect a general behavior that we have observed in a wide range of experiments, that is a relatively good response in the range of σ ∈ [4,8]. For smaller (σ = 2) or larger (σ > 8) scale values noisy patterns arise. These are particularly noticeable on roofs and roads for the selected samples. The above observations provide an experimental validation of the choice σ = R = 4 proposed in this work on the basis of the theoretical motivations discussed in Section 3.2.
The present experiment gives us the opportunity to also make some considerations about the computational load related to the proposed indexes. In general, in the context of quality assessment, the computation time is not a critical issue as it is carried out offline. However, with the advent of deep learning, researchers started to use such quality indexes as loss functions for training purposes. In this regard, it is worth to distinguish pixel-based indexes (e.g., SAM, ERGAS, α -norms), that are relatively light to compute, from other indexes that involve local statistics computed at a certain scale for each location (e.g., Q2 n , D ρ , D (K) λ ). When using the latter as loss the training may slowdown. In the particular case of D ρ , that has been included in the loss function to finetune the sample CNN employed in the experiment, we have actually registered a moderate impact on the training time. In fact, using a training batch composed of a single 2048×2048 image, the time consumption per iteration, without or with the additional D ρ loss component (second term in Eq. 13), shifts from 1.27s to 1.6s on a NVIDIA P6000 GPU.

Comparative results
To conclude, we present here an overall comparison among the available pansharpening methods provided by the toolbox [79], using the proposed quality indicators and supported by the visual inspection of some sample results. In Fig. 9 and Fig. 10 we gather the average numerical results obtained on our WV3 and WV2 datasets, respectively. As expected, the numerical results clearly show a good level of agreement among the three reprojection error indexes, all   essentially linked to the spectral consistency. However, some exceptions can be observed, particularly for the WV2 dataset (Fig. 10) where some CS methods register a performance loss in terms of D (K) λ,align not observed on R-SAM and R-ERGAS. We also recognize a good agreement with some literature findings, particularly on WV3 (Fig. 9), such as the spectral accuracy gap between MRA and CS methods [78] or the competitiveness of some MRA, VO and ML solutions. Moreover, for the ML solutions, it is worth to notice a performance gap moving from one dataset (WV3) to another (WV2). Indeed, such variability is not unusual for data-driven approaches that can suffer generalization limits when the training dataset is not sufficiently representative. In this particular situation, it could likely be the case that the WV2 test images used are too different from the images used for the training of the involved ML methods.
To gain insight into the effectiveness of the proposed indexes let us give a look to some sample results for a complementary subjective analysis. In Fig. 11 we show some clips from a single WV3 tile of the Adelaide dataset. On the leftmost column are gathered the PAN and MS input components on two consecutive rows. Then, the corresponding pansharpening results of the best performing solutions according to R-Q2 n are shown in decreasing order from left to right, next to the PAN. Next to the MS, we also show the "reprojection" error map, that is the difference between the input MS and the reprojection (LPF plus decimation) of the output. On the bottom lines are gathered the R-Q2 n and D ρ scores. Therefore, the spectral quality of the results decreases left to right. This can also be appreciated through the inspection of the reprojection error maps. Besides, the best methods from the spatial perspective (D ρ ) follow a different ordering. This partially reflects a tradeoff between spectral and spatial features. Particularly interesting is the case of the method TV which scores first in terms of R-Q2 n (actually it is the best according to R-SAM and R-ERGAS, as well) but shows the worst D ρ value, 0.215, corresponding to a 78.5% average correlation between the PAN and the pansharpened bands. The impact of this low correlation level is clearly visible in the pansharpening results of TV which show underlying noisy patterns. Such patterns disappear with the reprojection explaining why the reprojection indexes do not worsen. The other methods achieve much lower values of D ρ , ranging from 0.084 to 0.115 (around 90% of correlation), to which correspond to a higher coherence between the spatial features of the results and the PAN, easily appreciable through visual inspection. For completeness, in Fig. 12 we show analogous results for a WV2 tile, which basically confirms the same considerations made above for WV3.

Conclusions
To cope with the limitations of the reference-based reduced-resolution procedures for the quality assessment of pansharpening methods, in this work we propose a new full-resolution no-reference evaluation framework. By following Wald's protocol [86], the full-resolution assessment must be carried out checking the "consistency" of the fused products with the MS and PAN input components rather than the "synthesis" capacity which would require the availability     of GTs. In particular, inspired by Khan's index [51], we have proposed the use of reprojection-based indexes with embedded alignment to handle mis-registered datasets for the assessment of the spectral consistency between the pansharpened image and the input MS. Besides, the spatial consistency between the fused image and the PAN is quantified by averaging, spatially and spectrally, the fine-scale local correlation of individual super-resolved bands with the high-resolution PAN.
A key qualifying aspect of the proposed indexes is the absence of any resolution downgrading of the input data, which frees the assessment from the effect of scale-dependent phenomena. Experiments on reduced resolution datasets show that the reprojection indexes are reliable predictors of image quality as quantified by reference-based indexes supporting their use in the full-resolution domain. On the other hand, experiments on full-resolution data make clear that the local correlation-based index provides indications on image quality that largely agree with the judgement of human experts. The proposed approach can be readily generalized to fusion tasks other than pansharpening such as, for example, the combination of low-resolution hyperspectral and high-resolution multispectral images.
On the other hand, the User must also be aware of some limitations. In particular, the reprojection requires the knowledge of an accurate estimation of the sensor MTF for a correct low-pass filtering. A wrong estimation would lead to inaccurate spectral consistency assessment, a problem shared with preexisting solutions. Moreover, the proposed correlation distortion index is based on the assumption that the PAN correlates well with all spectral bands. Such a hypothesis is globally acceptable but there can be rare cases, spatially and spectrally well localized, for which it results too strong.
In order to ensure reproducible research, the code for the proposed indexes can be found at https://github.com/ matciotola/fr-pansh-eval-tool/.