1. Scenario and Motivations
Among remote sensing image fusion techniques, panchromatic (Pan) sharpening, or
pansharpening, of multispectral (MS) images has recently received considerable attention [
1,
2]. Pansharpening techniques take advantage of the complementary characteristics of spatial and spectral resolutions of MS and Pan data, originated by constraints on the signal-to-noise ratio (SNR) of broad and narrow bands, to synthesize a unique product that exhibits as many spectral bands as the original MS image, each with same spatial resolution as the Pan image.
After the MS bands have been interpolated and co-registered to the Pan image [
3], spatial details are extracted from Pan and added to the MS bands according to the injection model. The detail extraction step may follow the
spectral approach, originally known as component substitution (CS), or the
spatial approach, which may rely on multiresolution analysis (MRA), either separable [
4] or not [
5]. The dual classes of spectral and spatial methods exhibit complementary features in terms of tolerance to spatial and spectral impairments, respectively [
6,
7].
The Pan image is preliminarily histogram-matched, that is, radiometrically transformed by a constant gain and offset, in such a way that its lowpass version exhibits mean and variance equal to those of the component that shall be replaced [
8]. The injection model rules the combination of the lowpass MS image with the spatial detail of Pan. Such a model is stated between each of the resampled MS bands and a lowpass version of the Pan image having the same spatial frequency content as the MS bands; a contextual adaptivity is generally beneficial [
9]. The multiplicative injection model with haze correction [
10,
11,
12,
13] is the key to improving the fusion performance by exploiting the imaging mechanism through the atmosphere. The injection model is crucial for multimodal fusion, where the enhancing and enhanced datasets are produced by different physical imaging mechanisms, like thermal sharpening [
14] and optical and SAR data fusion [
15]. In the latter case, since measures of spectral reflectance and radar reflectivity of the surface cannot be directly merged, the optical views are enhanced by means of noise-robust texture features extracted from the SAR image [
16].
Remote sensing image data are generally available in packed fixed-point formats, together with floating-point gains and offsets, a pair for each band of each scene, that allow floating-point calibrated values to be recovered [
17]. While the maximum value of each band of the scene is mapped onto the largest digital number (DN) of the fixed-point representation, offsets are generally set equal to the minimum value, such that the active range of floating-point values of the scene is exactly mapped onto the dynamic range of the DN representation. If the offsets are taken all equal to zero, the DN and the floating-point representations differ only by a scaling factor, which, however, may be different from one band to another in the same scene, thereby originating an alteration in the spectral content of the data.
A problem seldom investigated in the literature is whether it makes difference if fusion is accomplished in the packed DN format or in the original floating-point format. In this work, we are concerned with how the performance of pansharpening methods depends on their input data format. It is theoretically proven and experimentally demonstrated that MRA methods are unaffected by the data format, which instead is crucial for CS methods, unless their intensity component is calculated by means of a multivariate linear regression between the upsampled bands and the lowpass-filtered Pan, as it is accomplished by the most advanced CS methods, e.g., [
18].
Quality assessment of the pansharpened images is another debated problem. Notwithstanding achievements over the last years [
19,
20,
21,
22,
23,
24,
25], the problem is still open, being inherently ill-posed. A further source of uncertainty, which has been explicitly addressed very seldomly [
26], is that also the measured quality may depend on the data format. The quality check often entails the shortcoming of performing fusion with both MS and Pan datasets degraded at spatial resolutions lower than those of the originals, in order to use non-degraded MS originals as quality references [
27]. In this study, several widespread with-reference (dis)similarity indexes are reviewed and discussed in terms of the reproducibility of their output values towards the data format. We wish to stress that in the present context the term
quality represents the
fidelity to a hypothetically available reference, and has no relationship with the intrinsic quality of the data produced by the instrument, which mainly depends on the modulation transfer function (MTF) of the multiband system and on the SNR, due to a mixed noise model, both photon and electronic [
28,
29].
In an experimental setup, GeoEye-1 and WorldView-2 data are either fused in their packed 11-bits DN format or converted to spectral radiance before fusion is accomplished, by applying gain and offset metadata. In the former case, fusion results are converted to spectral radiance before quality is measured. In the latter case, fusion results are preliminarily converted to DNs. Results exactly match the theoretical investigations. For the majority of CS fusion methods, which do not feature a regression-based intensity calculation, results are better whenever they are obtained from floating-point data. Furthermore, the assessment of nine pansharpened products from as many algorithms, carried out both on floating-point and on packed fixed-point data, reveals that the quality evaluations may be misleading whenever they are performed on fixed-point DN formats.
2. Data Formats and Products
Remote sensing optical data, specifically MS and Pan, are available in spectral radiance format, that is radiance normalized to the width of the spectral interval of the imaging instrument, and in reflectance formats. The reflectance, which is implicitly spectral, ranges in [0,1] and can be defined either as
top-of-atmosphere (TOA) reflectance or as
surface reflectance, also called
bottom-of-atmosphere (BOA) reflectance, if it is measured at sea level. The former is the reflectance as viewed by the satellite and is given by the TOA spectral radiance rescaled by the TOA spectral irradiance of the Sun. The latter represents the spectral signature of the imaged surface; its determination requires the estimation, through parametric modeling and/or measurements, of the upward and downward transmittances of the atmosphere and of the upward scattered radiance at TOA, also known as path radiance [
30].
Besides the TOA spectral radiance, which is a level-one (L1) product, the TOA reflectance, another L1 product, is generally available for systems featuring a nadiral acquisition and a global Earth coverage, like ASTER, Landsat 7 ETM+, Landsat 8 OLI, and Sentinel-2. On the contrary, extremely high resolution (EHR) systems (IKONOS, QuickBird, GeoEye-1, WorldView-2/3/4, Pléiades 1AB) perform sparse acquisitions with maneuverable acquisition angles and may not have the TOA reflectance format available. The surface reflectance is a level-two (L2) product and is distributed for global-coverage systems (OLI, Sentinel-2), only where an instrument network is available for atmospheric measurements [
31], though it has been recently shown that robust estimates of surface reflectance can be achieved through the same data that shall be atmospherically corrected [
32].
In order to store and distribute fixed-point data (typically 8 to 16 bits per pixel per band), more compact and practical than floating-point data, the spectral radiance/reflectance values are rescaled to fill the 256 to 65,536 digital number (DN) counts of the binary representation. A negative bias may be introduced to force the minimum radiance/reflectance value in the
zero DN. If
p denotes the wordlength of the fixed point representation and
(
) the maximum (minimum) spectral radiance value, the conversion rule to DN is:
For each band, the reciprocal of the scaling factor, i.e. , and the bias changed of sign, i.e., , which are generally different for each band, are placed in the file header as gain and offset metadata and are used to restore calibrated floating-point values from the DNs, which are identical for the three formats, spectral radiance, and reflectances; only gains and offsets change. In some cases, the offsets are set equal to zero, for all bands, including Pan, regardless of the actual minimum, , which implies that the minimum DN may be greater than zero. This strategy is generally pursued when the wordlength of the packed DN is 11 bits or more. In the following of this study, we will show that such a choice is highly beneficial for the reproducibility of pansharpening methods and quality indexes.
We wish to remark that the packaging of floating-point data into DNs does not penalize the original precision of the calibrated data, which are obtained starting from the integer samples produced by the on-board analog-to-digital converter (ADC). Thus, the calibrated samples exhibit a finite number of floating-point values, which can be accommodated in a DN of suitable wordlength, comparable to that of the ADC; generally one bit less, because the ADC span is designed to encompass the dark signal, which is removed before calibration, and to leave an allowance to prevent saturation.
Eventually, for the
kth spectral channel, the following relationships hold between the floating-point calibrated data, TOA spectral radiance,
, TOA reflectance,
, and surface reflectance,
, and the packed fixed-point DN format that is distributed:
in which the
s are the same for the three formats, while the gains,
, and the offsets,
, are constant over the scene and variable from one band to another, including Pan. In the case of TOA reflectance,
and
are equal to
and
divided by the the solar irradiance, which cannot be assumed spatially constant if the scene is large. Therefore,
and
are not constant over the whole scene, but on the sub-scenes of a partition, e.g. in square blocks. Analogous considerations hold for
and
. Without loss of generality, hereafter we will consider only the TOA spectral radiance in Equation (
2a), which will be simply referred to as spectral radiance (SR).
Figure 1a,b show the meaning of
and
:
is the slope and
the intercept of the inverse trans-characteristic that maps DN counts onto SR values. For the direct conversion in
Figure 1a the slope is
and the intercept
. So, if gains and offsets are different for the bands of an MS image, including Pan,
Figure 1c shows that the same DN value is mapped onto different values of SR depending on the band. Thus, there is a spectral alteration, a sort of miscalibration occurring if the DN counts are not converted back to physical units before their use. As it clearly appears from
Figure 1c, such a miscalibration vanishes if gains and offsets are identical for each band, including Pan.
3. Basics of CS and MRA Pansharpening
Classical pansharpening methods can be divided into CS, MRA, and hybrid methods. The unique difference between CS and MRA is the way to extract the Pan details, either by processing the stack of bands along the
spectral direction or in the
spatial domain. Hybrid methods, e.g., [
33,
34], are the cascade of CS and MRA, either CS followed by MRA or, more seldom, MRA followed by CS. In the former case, they are equivalent to MRA that inherits the injection model from the CS; in the latter case, they behave like CS, with the injection model borrowed from MRA [
8]. Therefore, at least for what concerns the present analysis, hybrid methods are not a third class with specific properties. The notation used in this paper will be firstly shown. Afterward, a brief review of CS and MRA will follow.
3.1. Notation
The math notation used is detailed in the following. Vectors are indicated in bold lowercase (e.g., ) with the i-th element indicated as . 2-D and 3-D arrays are denoted in bold uppercase (e.g., ). An MS image is a 3-D array composed by N spectral bands indexed by the subscript k. Accordingly, indicates the kth spectral band of . The Pan image is a 2-D matrix indicated as . The MS interpolated and pansharpened MS bands are denoted as and , respectively. Unlike conventional matrix product and ratio, such operations are intended as product and ratio of terms of the same positions within the array.
3.2. CS
The class of CS, or
spectral, methods is based on the projection of the MS image into another vector space, by assuming that the forward transformation splits the spatial structure and the spectral diversity into separate components. The problem may be stated as how to find the color space that is most suitable for fusion [
35].
Under the hypothesis of substitution of a single component that is a linear combination of the input bands, the fusion process can be obtained without the explicit calculation of the forward and backward transformations, but through a proper injection scheme, thereby leading to the fast implementations of CS methods, whose general formulation is [
1]:
in which
k is the band index,
the 3-D array of
injection gains, which in principle may be different for each pixel each band, while the intensity,
, is defined as
in which the weight vector
is the 1-D array of spectral weights, corresponding to the first row of the forward transformation matrix [
1]. The term
is
histogram-matched to
in which
and
denote mean and square root of variance, respectively, and
is a lowpass version of
having the same spatial frequency content as
[
8].
Figure 2 shows the typical flowchart employed by CS methods.
The simplest CS fusion method is the Intensity-Hue-Saturation (IHS) [
1], or better its generalization to an arbitrary number of bands, GIHS, which allows a fast implementation, given by Equation (
3) with unitary injection gains,
. The multiplicative or
contrast-based injection model is a special case of Equation (
3), in which space-varying injection gains,
, are defined such that
The resulting pansharpening method is described by
which, in the case of spectral weights all equal to
, is the widely known Brovey transform (BT) pansharpening method [
1,
36].
In Gram–Schmidt (GS) [
37] spectral sharpening, the fusion process is described by Equation (
3), where the injection gains are spatially uniform for each band, and thus denoted as
. They are given by [
8]:
in which
indicates the covariance between
and
, and
is the variance of
. A multivariate linear regression has been exploited to model the relationship between the lowpass-filtered Pan,
, and the interpolated MS bands [
1,
18],
in which
is the optimal intensity component and
the least squares (LS) space-varying residue. The set of space-constant optimal weights,
, and
, is calculated as the
minimum MSE (MMSE) solution of Equation (
9). A figure of merit of the matching achieved by the MMSE solution is given by the coefficient of determination (CD), namely
, defined as
in which
and
denote the variance of the (zero-mean) LS residue,
, and of the lowpass-filtered Pan image, respectively. Histogram-matching of Pan to the MMSE intensity component,
, should take into account that
, from Equation (
9). Thus, from the definition of CD in Equation (
10)
This brief review does not include a popular CS fusion method employing principal component analysis (PCA). The reason is that PCA is a particular case of the more general GS transformation, in which
is equal to the maximum-variance first principal component, PC1, and the injection gains are those of GS, as in Equation (
8), [
1].
3.3. MRA
The
spatial approach relies on the injection of highpass spatial details of Pan into the resampled MS bands. The most general MRA-based fusion may be stated as:
in which the Pan image is preliminarily histogram-matched to the interpolated
kth MS band [
8]
and
the lowpass-filtered version of
. It is noteworthy that according to either of Equations (
5) and (
13), histogram matching of
always implies the calculation of its lowpass version
.
According to Equation (
12) the different approaches and methods belonging to this class are uniquely characterized by the lowpass filter employed for obtaining the image
, by the presence or absence of a decimator/interpolator pair [
38] and by the set of space-varying injection gains, either spatially uniform,
or space-varying,
. The contrast-based version of MRA pansharpening is
It is noteworthy that, unlike what happens for Equation (
7), Equation (
14) does not preserve the spectral angle of
, because the multiplicative sharpening term depends on
k, through Equation (
13). Eventually, the projective injection gains derived from GS (Equation (
8)) can be extended to MRA fusion as
whose space varying-version
, with statistics calculated locally on a sliding window, coupled with a pyramid MRA constitutes a popular pansharpening method known as GLP-CBD [
1,
39].
Figure 3 shows the general flowchart of
spatial, a.k.a. MRA, pansharpening methods.
7. Concluding Remarks
In this work, we have investigated how the performance of a pansharpening method depends on the input data format, either packed DNs of Equation (
1) or SR, as well as any other floating-point calibrated format. It is theoretically proven and experimentally demonstrated that MRA methods are unaffected by the data format, which instead is crucial for CS methods, unless their intensity component is calculated by means of a multivariate linear regression between the interpolated bands and the lowpass-filtered Pan, as it is accomplished by the most advanced CS methods. For CS fusion methods that do not feature a regression-based intensity calculation, results are better whenever they are obtained from floating-point data. For this reason, whenever the data that are merged come from different platforms [
44] and/or are related to different intervals of the electromagnetic spectrum [
45], the use of floating-point formats is highly recommended. This study has also demonstrated the necessary non-reproducibility versus the data format of normalized spectral similarity indexes for multiband images. This is necessary because if an index yields the same values when it is calculated from data in packed fixed-point or floating-point formats, it will not necessarily be a spectral index. Conversely, normalized spatial/radiometric indexes are those and only those that do not depend on the format of the data that are compared, at least if the band offsets are all zero. Thus, spectral similarity should be measured on data that are represented as physical units, e.g., SR. Use of packed DNs may lead to misestimation of quality, because DN data are spectrally altered, and hence miscalibrated, if the gains are not equal to one another and the offsets are nonzero.
A viable escape to retain the advantages of a fixed-point processing, mandatory for most of dedicated hardware implementations, and avoid the drawbacks of the spectral distortion, originated by the packaging of floating-point calibrated data of Equation (
1), could be:
convert the available packed DNs to floating-point SR, or any other physical format, by using Equation (
2a);
convert the SR data obtained at the previous step back to DN by using Equation (
1), in which
is not the maximum of the individual band of the scene, but of the whole scene, and
.
Thus, it is easily verified that the gains
are identical to one another and the offsets
are all zero. Hence, it turns out that the distortion factor SIF%, Equation (
32), is identically zero. As a final remark, notwithstanding that remote sensing data fusion by means of extremely sophisticated machine learning (ML) tools, e.g., [
46], has nowadays reached the state of the art [
2], the behaviors of such methods can be hardly modeled and hence predicted because they are based on the outcome of a complex learning process and not on simple constitutive relations. Consequently, it is not easy to prove or foresee if an ML fusion method trained on packed DN data gives the same results as the same method trained on floating-point spectral radiance data. Only an experimental analysis, as carried out in
Section 6, would be feasible for ML-based pansharpening methods, but in this case, the analysis and its outcome would not be reproducible, thereby invalidating its motivations. Therefore it is recommended that all the training data are used in floating-point format and fusion is accomplished and assessed in the same format.