1. Introduction
The availability of image data with spectral diversity (visible, near infrared, short wave infrared, thermal infrared, X- and C-band microwaves with related polarizations) and complementary spectral-spatial resolution, together with the peculiar characteristics of each image set, have fostered the development of fusion techniques specifically tailored to remotely sensed images of the Earth. Fusion aims at producing an extra value with respect to those separately available from the individual datasets. Though the results of fusion are more often analyzed by human experts to solve specific tasks (detection of landslides, flooded and burned areas, just to mention a few examples), partially supervised and also fully automated systems, most notably thematic classifiers, have started benefiting from fused images instead of separate datasets.
Extensive research on remote sensing image fusion for Earth observation has been carried out over the last decade, and a remarkable number of algorithms have been developed [
1]. Image fusion techniques can be classified according to different criteria. One of the most common ways to differentiate fusion algorithms is based on sensor homogeneity. The term homogeneous image fusion refers to the case in which the images to be merged are produced by sensors exploiting the same imaging mechanism. This category is also called
unimodal image fusion. In remote sensing for Earth observation, the fusion of panchromatic and multispectral (MS) images, also known as pansharpening, is a typical example of homogeneous image fusion. The images subject to fusion are the outcome of measurements of the reflected solar radiation of the scene, even though they are referred to different wavelengths and are characterized by different information contents, also in terms of spatial resolution. On the other hand, the fusion of heterogeneous data, or
multimodal image fusion, is referred to those cases in which the data to be merged come from sensors not sharing the same imaging mechanism.
An additional way to discriminate among fusion techniques is based on the content level subject to fusion, i.e., pixel level, feature level, and decision level [
1]. Pixel level image fusion directly combines the pixels of the involved images in order to produce a new image, whereas feature level fusion aims to combine specific features or descriptors extracted from the images to be merged. The extraction of the features can be performed either simultaneously on all the images or separately on each image. As an example, for the fusion of optical and Synthetic Aperture Radar (SAR) images [
2], a direct merge of the two datasets is not recommended, to prevent contamination of the fusion product with the low signal-to-noise ratio (SNR) of SAR data. In this case, features extracted from the SAR image, either [
3] for texture and spatial heterogeneity or [
4] for temporal coherence of the scene derived from geocoded multilooked products, can be transplanted into the optical image, thereby alleviating the stringent requirement of co-registration between the two datasets typical of pixel-level fusion. Decision level fusion is the combination of the classification results achieved either from each dataset separately or from multiple algorithms on the same dataset. In this case, the fusion output is a classification map [
5].
Among pixel-based remote-sensing image-fusion techniques, panchromatic (Pan) sharpening, or pansharpening, of multispectral (MS) images is receiving ever increasing attention [
1,
6]. Pansharpening takes advantage of the complementary characteristics of the spatial and spectral resolutions of MS and Pan data, originated by physical constraints on the SNR of broad and narrow bands [
7]. The goal is the synthesis of a unique product that exhibits as many spectral bands as the original MS image, each with same spatial resolution as the Pan image.
After the MS bands have been interpolated and co-registered to the Pan image [
8], spatial details are extracted from Pan and added to the MS bands according to a predefined injection model. The detail extraction step may follow the
spectral approach, originally known as component substitution (CS), or the
spatial approach, which may rely on multiresolution analysis (MRA), either separable or not [
9]. In the spectral approach, the detail is the difference between the sharp Pan image and a smooth intensity component generated as a combination of the interpolated MS bands. In the spatial approach, the detail is the difference between the original Pan image and its version smoothed by a proper lowpass filter, retaining the same spatial frequency content of the MS bands. The dual classes of spectral and spatial methods exhibit complementary features in terms of tolerance to spatial and spectral impairments, respectively [
10,
11].
The Pan image is preliminarily histogram-matched, that is, radiometrically transformed by a constant gain and offset, in such a way that its lowpass version exhibits mean and variance equal to those of the component that shall be replaced [
12]. The injection model rules the combination of the lowpass MS image with the spatial detail of Pan. Such a model is stated between each of the resampled MS bands and a lowpass version of the Pan image having the same spatial frequency content as the MS bands; a contextual adaptivity is generally beneficial [
13]. The multiplicative, or contrast-based, injection model with haze correction [
14,
15] the key to improving the fusion performance by exploiting the imaging mechanism through atmosphere [
16]. The injection model, which can rely on the most disparate criteria [
17], is crucial for multimodal fusion, where the enhancing and enhanced datasets are produced by different physical imaging mechanisms, such as thermal sharpening [
18]. The basic classification of CS and MRA has been progressively upgraded by considering several other methods that have been recently developed [
6], such as those based on Bayesian inference [
19], total variation (TV) regularization [
20] and sparse representations [
21]. More recently, machine learning paradigms have been introduced: since the pioneering study on pansharpening based on convolutional neural networks (CNN) [
22], up to extremely sophisticated architectures, such as generative adversarial networks (GAN) [
23]. It is noteworthy that, at least for methods based on learning concepts, histogram matching and detail-injection modeling are learned from the training data and implicitly performed by the network, without any control from the user. GAN architectures, however, are able to control one another, and thus, they are invaluable, e.g., for multimodal fusion [
24].
This work deals with the use of nonlinear intensity components in spectral MS pansharpening methods. In fact, while CS methods, whose intensity components are linear combinations of the input bands, have been extensively investigated [
1], nonlinear intensities have been seldom considered in the literature [
25,
26,
27]. The hyperspherical color space (HCS) fusion technique [
25] is perhaps the most widely known example. Analogously to the linear case, we propose a multivariate linear regression between the interpolated MS bands and the lowpass-filtered Pan. This time, however, the MS and Pan values are squared, before the regression is calculated. Hence, the intensity component no longer lies on a hyperplane in the vector space of the MS samples, as for the linear case, but on a hyper-ellipsoid. The proposed nonlinear intensity component is used in conjunction with the multiplicative injection model. Hence, the de-hazing procedure is extended to the nonlinear intensity.
In an experimental setup, GeoEye-1 and WorldView-3 data in spectral radiance format are pansharpened by several state-of-the-art and up-to-date methods, whose implementations are available in the Pansharpening Toolbox, originally conceived in [
28]. The proposed method outperforms all the benchmarks on both the datasets, for all quality indexes at reduced resolution; in particular, its counterpart with linear intensity, referred to as Brovey transform with haze correction (BT-H) [
15].
The remainder of this article is organized as follows.
Section 2 provides the essential basics of pansharpening.
Section 3 introduces the nonlinear intensity and describes the novel method.
Section 4 is devoted to haze estimation of MS bands.
Section 5 summarizes the adopted quality criteria and the related distortion indexes.
Section 6 describes the two datasets and reports simulations and comparisons. Concluding remarks are presented in
Section 7.
3. Pansharpening Based on Nonlinear Intensity Components
The critical review of the baseline HCS [
25] is based on the subsequent study by Tu et al. [
26], who highlighted the advantages and limitations of the HCS approach. The first idea was to use the HCS transformation as an alternative to the intensity-hue-saturation (IHS) transformation, which had been already generalized to an arbitrary number of bands [
1,
26], as GIHS. Unfortunately, IHS features a unitary detail-injection model, which is generally poorer than the projection model (GS) and the multiplicative model of Brovey transform (BT) [
1,
34]. Therefore, in the subsequent publication [
26], a fast multiplicative version of HCS was proposed. Fast because it is no longer necessary to calculate the direct and inverse hyperspherical transforms, but only the radius, which is used as intensity component of BT.
The (fast) HCS fusion [
26] is given by Equation (
5), with the MMSE intensity
replaced by the HCS intensity,
, the radius of the
N-dimensional hypersphere:
The original contribution of the present study is to generalize the multivariate regression of Equation (
7) to the case of Euclidean distance, as in Equation (
14). The result is a new nonlinear intensity component, given by the RMS weighted value of the interpolated MS bands:
in which the set of
N spectral weights,
, and the bias
are found as the LS solution of the linear regression between squared MS and squared lowpass-filtered Pan:
In the case of three bands, the color spaces of contrast-based fusion methods with linear and nonlinear intensities calculated with and without regression are displayed in
Figure 1a–d. Notice that the linear intensity with prefixed equal weights of BT defines an equilateral triangle, as the intersection of a plane with the the first octant of the Euclidean space; the linear intensity with LS weights of BT-H [
15] generally yields a scalene triangle. Conversely, as the color space of HCS is the section of a spherical surface lying in the first octant, the proposed method yields an ellipsoidal section. In fact, Equation (
15) defines a hyper-ellipsoid, a generalization of the hypersphere in
Figure 1c, when the weights may no longer be equal to one another; hence the name hyper-ellipsoidal color space (HECS) fusion.
The proposed scheme includes de-hazing, highly beneficial for fusion methods with a multiplicative detail-injection model [
15,
35]. Hence, the formulation of the proposed HECS pansharpening fusion is
in which the path-radiance, or
haze, of the synthetic intensity,
, is given by the weighted RMS value of the individual path radiances,
,
,
Equation (
17) shows that the spectral pixel vector is translated by the haze vector before the multiplicative fusion is accomplished and the fused pixel vector is translated back by the same haze vector.
Figure 2 shows a flowchart describing the HECS fusion process. The estimation of the atmospheric path radiances of the individual MS bands will be tackled in the next section.
5. Quality Assessment
Quality evaluation of image fusion products has been, and still is, the object of extensive research. The problem is complicated by the fact that it may not be easy to formalize what “quality” means in the fusion process. In this regard, a protocol of assessment should have very clear objectives and possibly require a reference on which the comparison relies. Image fusion assessment is traditionally performed in two ways: (1) by means of a human visual inspection by a panel of investigators; and (2) through mathematical functions capable of objectively measuring or inferring the similarity of the fusion product to a reference target, which is always unavailable and often also undefined. Whereas the former is based on subjective human evaluations that can be embodied by some statistical indexes, such as entropy, contrast, gradient, and so on [
24], the objective evaluation involves stringent and quantitative measures that involve both original and fused images and are possibly consistent with human visual perception. In the fusion of medical images, it is crucial to preserve the diagnostic characteristics of the original images within the fused image; thus, it is necessary to evaluate the result of the fusion process using objective parameters.
In remote sensing image fusion, especially multispectral pansharpening applications, quality assessment is performed following Wald’s protocol [
47], which substantially requires the fused image to satisfy three main properties:
- -
Consistency: the fused image, once spatially degraded to the original resolution, should be as close as possible to the original image;
- -
Synthesis: any low-resolution (LR) image fused by means of a high-resolution (HR) image should be as identical as possible to the ideal image that the corresponding sensor, if existent, would observe at the resolution of the HR image.
- -
Vector synthesis: the set of multispectral images fused by means of the HR image should be as identical as possible to the set of ideal images that the corresponding sensor, if existent, would observe at the spatial resolution of the HR image.
The property of consistency is usually easier to assess since the original LR image can be used as a reference. Only the procedure of spatial degradation and the matching function are to be standardized. On the contrary, the synthesis property is harder to be verified, since a reference is required. A viable shortcoming stems from the assumption of scale-invariance of the scene, that is, quality measures do not vary with the resolution, at which the scene is imaged. This allows the quality to be measured at a resolution lower than the original one, for which the reference image is available.
More specifically, the process consists of spatially degrading both the enhancing and the enhanced datasets by a factor equal to the scale ratio between them and using the original LR image as reference. Obviously, such an assumption is not always valid, especially when the degradation process does not mimic the actual sensor acquisition process. In the case of multimodal image fusion, the applicability of the synthesis properties of the Wald’s protocol is questionable since a multimodal fusion method aims at producing images in which the features coming from different sensors should in principle be both present. If the imaging sensors exploit different physical mechanisms, e.g., reflectivity and emissivity in the case of fusion of optical and thermal data, the assumption that an “ideal” sensor producing the fused image could exist is unlikely, since such a sensor should be able to measure and integrate different physical phenomena at the same time.
In conclusion, notwithstanding achievements over years [
48,
49,
50,
51,
52,
53,
54], quality assessment of pansharpened images is still an open problem, being inherently ill-posed. A further source of uncertainty, which has been explicitly addressed very seldom [
55,
56], is that the measured quality may also depend on the data format.
5.1. Reduced-Resolution Assessment
The quality check often entails the shortcoming of performing fusion with both MS and Pan datasets degraded at spatial resolutions lower than those of the originals, in order to use non-degraded MS originals as quality references [
57]. Here, some popular statistical similarity/dissimilarity indexes used in this study will be briefly reviewed.
5.1.1. SAM
The spectral angle mapper (SAM) was originally introduced for the discrimination of materials starting from their reflectance spectra [
58]. Given two spectral vectors,
and
, both having
N components, in which
is the reference spectral pixel vector and
is the test spectral pixel vector, SAM denotes the absolute value of the spectral angle between the two vectors:
SAM is usually expressed in degrees and is equal to zero
if the test vector is
spectrally identical to the reference vector, i.e., the two vectors are parallel and may differ only by their moduli. A global spectral dissimilarity, or distortion, index is obtained by averaging Equation (
19) over the scene.
5.1.2. ERGAS
ERGAS, the French acronym for relative dimensionless global error in synthesis [
59], is the cumulative normalized root mean square error (NRMSE) between test and reference band, multiplied by the Pan-to-MS scale ratio and expressed in percentage:
where
is the ratio between pixel sizes of Pan and MS, e.g., 1/4,
is the mean (average) of the
kth band of the reference and
N is the number of bands. Low values of ERGAS indicate high similarity between fused and reference MS data.
5.1.3. Multivariate UIQI
Q2
n is the multiband extension of the universal image quality index (UIQI) [
60] and was introduced for quality assessment of pansharpened MS images [
61]. Each pixel of an image with
N spectral bands is accommodated into a hypercomplex (HC) number with one real part and
imaginary parts.
Let
and
denote the HC representation of the reference and test spectral vectors at pixel
. Analogously to UIQI, namely, Q2
0=Q, Q2
n may be written as product of three terms:
the first of which is the modulus of the HC correlation coefficient (HCCC) between
z and
. The second and third terms, respectively, measure contrast changes and mean bias on all bands simultaneously. Statistics are calculated on square blocks, typically, 32 × 32, and Q2
n is averaged over the blocks of the whole image to yield the
global score index. Q2
n takes values in [0, 1] and is equal to 1
iff for all pixels.
5.2. Full-Resolution Assessment
Quality can be evaluated at the original panchromatic scale, according to a
full resolution (FR) approach [
62]. In this case, the spectral and spatial distortions are separately evaluated starting from the fused image and either the original low-resolution MS bands or the high-resolution panchromatic image, as firstly proposed by Zhu et al. [
63].
5.2.1. QNR
A widely adopted FR assessment is based on the quality with no reference (QNR) protocol [
51] and the related distortion indexes. QNR combines into a unique overall quality index a spectral distortion measure between the original and pansharpened MS bands and a spatial distortion measure between each MS band and PAN. The QNR protocol is based on the following assumptions:
The fusion process should not change the intra-relationships between couples of MS bands; in other words, any intra-relationship changes between couples of MS bands across resolution scales are considered as indicators of spectral distortions;
The fusion process should not change the inter-relationships between each MS band and the Pan image; in other words, any inter-relationship changes between each MS band and the Pan across resolution scales are modeled as spatial distortions.
The QNR protocol employs the UIQI as a similarity measure and the absolute difference as the change operator. The spectral distortion index,
, is obtained by computing two sets of UIQI values, each between couples of MS bands. Afterward, their absolute difference is taken and averaged:
The spatial distortion index,
, is computed by means of the average absolute UIQI band by band difference, between MS and Pan, both at FR and at the original MS resolution:
Finally, a unique quality index is obtained combining the complement of spatial and spectral distortion indexes:
The exponents
and
rule the balance of the spectral and spatial quality components. They can be normalized in such a way that
. In this case, if
, Equation (
24) yields the geometric mean of the spectral and spatial qualities, though normalization of exponents compresses the variability of the cumulative index. Typical values for the exponents are
.
5.2.2. Khan’s QNR
A totally different approach was later proposed by Khan et al. [
52]. Analogously to QNR, Khan’s QNR (KQNR) defines and combines spectral and spatial consistency factors. The innovations introduced by the KQNR protocol is to make use of the consistency property of Wald’s protocol to calculate the spectral consistency of the pansharpened product. Since the consistency property evaluation requires a spatial degradation stage, including a decimation operation, the KQNR protocol proposes to use MTF-matched filters to perform the spatial degradation of the fused MS bands. Thus, the spectral distortion index,
, is computed according to the following procedure:
Each fused MS band is spatially degraded (filtered and decimated) with its specific MTF-matched filter;
The index between the set of spatially degraded fused MS images and the original MS dataset is computed;
The one’s complement is taken to obtain a distortion measure:
The spatial consistency of Khan’s protocol is given by the average change in interscale similarities between highpass components of each fused band and Pan:
Again, a cumulative quality index is obtained combining the complement of spatial and spectral distortion indexes:
with typical values for the exponents
.
It is noteworthy that, unlike what happened with QNR, the KQNR protocol states the the spectral and spatial consistencies are calculated on the lowpass and highpass spatial-frequency channels of the fused images; in the former case with a comparison with the original MS; in the latter case with a comparison with the highpass components of the original Pan and of the spatially degraded Pan.
5.2.3. Hybrid QNR
The hybrid QNR (HQNR) has been presented in [
53] as the combination of the spectral distortion index of KQNR in Equation (
25) with the spatial distortion index of QNR in Equation (
23).
Analogously to QNR and KQNR, a unique quality index is obtained combining the one’s complements of the spectral and spatial distortions:
For the sake of completeness, we could consider the dual of HQNR (DQNR), in which the spectral distortion in Equation (
22) is coupled with the spatial distortion in Equation (
26), defined as: