1. Introduction
Satellite systems face inherent limitations in imaging, storage, and data transmission, leading to a trade-off between spectral and spatial resolutions in remote sensing images [
1,
2]. Despite technological progress, spaceborne sensors typically capture high spatial resolution (HR) panchromatic (PAN) images alongside low spatial resolution (LR) multispectral (MS) images, rather than direct HR MS data.
Pansharpening, a fundamental image fusion technique in remote sensing, addresses this by combining the high spatial resolution of PAN images with the rich spectral information of MS images to produce high spatial resolution multispectral (HRMS) products [
3]. This process enhances the interpretability and utility of satellite imagery, enabling applications such as land cover classification, urban planning, environmental monitoring, disaster management, precision agriculture, visual analysis, change detection, and mapping. The pansharpening field has evolved through key phases since the 1980s [
4]: early techniques like Intensity-Hue-Saturation (IHS) and High-Pass Filtering (HPF) [
5]; 1990s advancements in Multiresolution Analysis (MRA) [
6,
7]; and post-2000 innovations including variational optimization (VO) [
8,
9], deep learning (DL) [
10], and unified frameworks [
11,
12]. Recent advancements in this area have further refined these deep learning approaches by exploring interactions between the spatial and frequency domains to improve feature extraction and fusion accuracy [
13].
However, the fusion process frequently introduces artifacts, with spectral distortion being one of the most critical and pervasive issues. Spectral distortion refers to the alteration of the original spectral properties in the fused image, manifesting as color shifts, radiometric inconsistencies, or loss of spectral fidelity. Visually, this may appear as unnatural hues or brightness variations, but its impact extends far beyond aesthetics. In quantitative remote sensing tasks, such as vegetation index calculation (e.g., NDVI), mineral mapping, or change detection, precise spectral signatures are paramount. Even subtle distortions can lead to erroneous interpretations, compromising the reliability of downstream analyses. For instance, altered spectral bands may misrepresent vegetation health or soil composition, leading to inaccurate decision-making in fields like agriculture or ecology.
Despite advances in pansharpening algorithms, categorized broadly into component substitution (CS), MRA, VO, and machine learning (ML) approaches, spectral distortion remains a persistent challenge, exacerbated by factors such as sensor misalignment, atmospheric effects, or algorithmic assumptions about spectral–spatial relationships. CS and MRA represent traditional methods, differing in spatial detail extraction: CS uses spectral transforms to separate and substitute intensity components with histogram-matched PAN data; MRA employs multiscale decompositions (e.g., à trous wavelet or generalized Laplacian pyramid) to inject high-frequency PAN details. VO recasts fusion as an optimization problem with fidelity and regularization terms, drawing from super-resolution and restoration techniques. ML, especially convolutional neural networks (CNNs), excels at managing non-linearities, balancing spatial and spectral quality in satellite imagery.
Quality assessment (QA) in pansharpening is vital for ensuring image reliability and accuracy [
14]. It guides fusion algorithm selection, informs new method development, sets standards for downstream applications (e.g., classification or detection), and boosts the commercial appeal of fused products. Effective QA mitigates issues from spectral–spatial trade-offs, preventing distortions that could compromise results. Assessment approaches fall into three main types: qualitative (visual inspection) [
15], application-based (task performance, e.g., classification) [
16], and quantitative. The latter is considered the most objective, targeting spectral distortion (color alterations) and spatial distortion (artifact introduction or detail mismanagement).
These are subdivided into full-reference (FR) and no-reference (NR) categories. FR QA usually relies on Wald’s protocol [
14], emphasizing consistency (degraded fused image matches original LR MS) and synthesis (fused image mimics hypothetical HR MS capture). Reduced-resolution (RR) assessment applies this by downscaling inputs, using the original LR MS as a reference, with filters matching sensor modulation transfer function (MTF) [
17]. FR metrics, such as the Spectral Angle Mapper (SAM), Spectral Information Divergence (SID), and Cross-Correlation (CC), require a high-resolution ground truth image for comparison, which is often unavailable in real-world scenarios. However, RR suffers from scale-invariance failures at high ratios, filter biases, and MTF instability from sensor aging.
In contrast, NR methods evaluate the HR MS image directly, avoiding such assumptions, though protocols like Quality with No Reference (QNR) [
18] and its variants (e.g., FQNR and MQNR) face criticism for spectral–spatial coupling and a lack of standardization [
19]. Traditional no-reference image quality assessment (NR-IQA) or blind IQA includes opinion-aware methods (e.g., BIQI [
20], DIIVINE [
21], BRISQUE [
22], BLIINDS-II [
23]) trained on distorted natural images with subjective scores, limiting generalization for pansharpening. Opinion-unaware approaches, like NIQE [
24] and IL-NIQE [
25], fit features to Multivariate Gaussian (MVG) models, measuring distances from pristine benchmarks promising for remote sensing adaptation. For pansharpening-specific NR QA, evaluations occur at the PAN scale without HR MS references, using innovative distortion measures. QNR-like indices [
26] assess band relationships via the Universal Image Quality Index (UIQI), with variants like hybrid QNR (HQNR) [
27] and regression-based QNR (RQNR) [
28] incorporating MTF and consistency for spectral fidelity. Quality Estimation by Fitting (QEF) [
15] extrapolates RR metrics, enhanced by Kalman Filter-based (KQEF) [
29] and Combiner-based (CQEF) [
19] versions, but these depend on accurate down-sampling, suffer scale-invariance issues, and show uneven performance across scenes and distortions. DL-based NR-QA advances include CNN architectures like the Deep Feature Similarity Measure Network (DFSM-net) [
30] and the Three-Branch Neural Network (TBN-PSI) [
31], which learn distortions without hand-crafted features, improving correlations but requiring large datasets and computational resources, reducing interpretability. MVG-based NR methods extract features (e.g., NDVI, NDWI, ASM, CON) from pristine MS images to train benchmark models, then compute distances for test images [
32,
33]. While effective, these often produce global scores, conflating spatial and spectral distortions.
The existing literature highlights several limitations in current NR spectral QA methods. For example, QNR-based indices assume that spectral relationships remain consistent across resolutions, ignoring the non-stationary nature of remote sensing imagery, which includes diverse land covers like vegetation, water bodies, and urban areas. Other MVG-based models, while effective for general image quality assessment, do not incorporate features tailored to spectral artifacts, such as deviations from natural statistical distributions or Color Moments. Our previous studies on datasets like the Ningbo University (NBU) pansharpening database [
34] comprising imagery from sensors such as IKONOS (IK), WorldView-2 (WV-2), WorldView-3 (WV-3), and WorldView-4 (WV-4) reveal that pristine MS images adhere to statistical laws like Benford’s Law in their First Digit Distributions (FDDs) within hyperspherical color domains (HCDs), but fused images deviate markedly due to spectral alterations. Simulated distortions (e.g., hue shifts, saturation changes, non-linear intensity mismatches) further amplify these deviations, underscoring the need for metrics sensitive to such changes. Despite these insights, no dedicated NR metric exists that exclusively isolates spectral distortion while leveraging a comprehensive statistical model to capture both local and global spectral characteristics.
Recently, a spectral quality assessment method based on Benford’s Law was proposed, showing a strong correlation with visual perception [
35]. Expanding on this concept, a subsequent study developed an NR metric specifically for fused hyperspectral imagery [
36]. The findings from this later work indicate that the technique offers superior stability and robustness compared to alternative NR metrics, yielding results that align more closely with full-reference benchmarks.
Various other NR techniques employ a Multivariate Gaussian (MVG) model trained on features extracted from pristine, undistorted images [
31,
32]. However, these MVG-driven approaches typically generate a single aggregate quality score, failing to distinguish between spatial and spectral distortions.
Despite these strides, methods based on MVG, DL, and sparse coding frequently struggle to generalize across diverse sensors and scenes. Traditional FR protocols rest on flawed assumptions, existing NR metrics like QNR tend to couple distinct distortion types, and DL approaches demand prohibitively large datasets. These limitations highlight a critical need for advanced NR frameworks capable of isolating spectral distortion through specialized statistical features, which serves as the primary motivation for this paper.
To address this gap, this paper proposes a novel No-Reference Multivariate Gaussian-based Spectral Distortion Index (MVG-SDI) specifically designed for pansharpened images. Building on the MVG framework, the method extracts a hybrid feature set from non-overlapping image patches: 9-dimensional FDD features derived from Benford’s Law in the hyperspherical color space (HCS) to detect statistical deviations in angular components, and 12-dimensional Color Moment (CM) features (mean, standard deviation, and skewness across RGB-NIR channels) to quantify perceptual color shifts. These features are concatenated into a 21-dimensional vector per patch, forming a spectral feature matrix. Separate MVG models are fitted to the original MS (reference) and fused (test) images, with spectral distortion quantified via the Mahalanobis distance between their parameters. This approach ensures sensitivity to localized distortions while accounting for feature interdependencies, outperforming existing NR metrics in isolating spectral errors without confounding them with spatial artifacts.
The contributions of this work are threefold: (1) it introduces the first NR index dedicated solely to spectral distortion in pansharpened images using MVG model, decoupling it from spatial quality assessment; (2) it integrates FDD and CM features within an MVG model, providing a robust statistical representation validated on diverse sensor data; and (3) extensive experiments on the NBU dataset demonstrate superior correlation with FR benchmarks (SAM, SID, CC) compared to QNR variants, highlighting its practical utility for algorithm optimization. This work provides targeted feature engineering and experimental protocols that allow future research and practitioners to extend and adapt the MVG framework to new satellites, sensors, or domain requirements.
The remainder of this paper is organized as follows.
Section 2 details the proposed MVG-SDI methodology, including patching, feature extraction, model fitting, and score computation.
Section 3 describes the experiments, including the datasets description, fusion algorithms, evaluation protocols, and analysis.
Section 4 presents discussions, followed by conclusions in
Section 5.
2. Proposed Method
This section details the No-Reference Multivariate Gaussian-based Spectral Distortion Index (MVG-SDI). The proposed approach operates on the hypothesis that spectral distortions in pansharpened images manifest as statistical deviations from the natural spectral characteristics inherent in the original MS data.
The MVG model is a statistical method that describes an image’s pixel distribution using a mean vector and a covariance matrix, which capture the average pixel values and the correlation between spectral bands, respectively. This model, first introduced for blind assessment of natural images by Mittal et al. [
15], has proven effective at capturing the statistical regularities of natural scenes. In pansharpening assessment, the MVG model is fitted to both the original and the fused images. By comparing the statistical parameters of the two distributions, the model can quantify how well the pansharpened image preserves the original spectral information. For a
d-dimensional feature vector
, the probability density function (PDF) of the MVG distribution is given by
where:
is the feature vector,
is the covariance matrix,
is the mean vector,
denotes the feature matrix containing the image patches,
K is the number of image patches,
d is the dimensionality of the distribution, and the superscript
T denotes the transpose.
This work adapts the MVG framework specifically for spectral distortion by focusing on features sensitive to color and radiometric changes, while excluding spatial-oriented ones. As illustrated in
Figure 1, the method compares the statistical characteristics of the fused image against an ideal reference derived from the original MS data. The process uses the MS image as the training reference and the fused image as the testing sample. Both undergo identical processing: patch division, spectral feature extraction (combining FDD from Benford’s Law and CMs), and aggregation into a Spectral Features Matrix. Separate MVG models are fitted into a training MVG model
from the MS data, representing undistorted spectral properties, and a testing MVG model
from the fused image. The Mahalanobis distance between these models quantifies distortion, with smaller values indicating better spectral preservation. This adaptation enhances robustness by accounting for feature covariances, enabling detection of subtle, inter-dependent spectral artifacts that simpler distances overlook.
2.1. Patch-Based Decomposition
The fused and MS images are decomposed into a uniform grid of non-overlapping patches. Numerous patch sizes have been tested, and through empirical experimentation, this size was determined to be the optimal choice. This patch-based approach directly addresses the spectral variability common in HR remote sensing imagery, where adjacent land-cover types such as vegetation, water, and urban surfaces display unique spectral characteristics. In contrast, a global statistical model would blend these diverse responses, potentially concealing specific distortions caused by pansharpening. The selected dimension strikes a practical balance: it is expansive enough to ensure reliable statistical calculations for FDD and CMs, yet compact enough to maintain local uniformity. As a result, the MVG-SDI is capable of detecting fine, location-specific color variations or radiometric discrepancies that broader metrics would typically miss.
2.2. Spectral Feature Extraction
The performance of the proposed index depends on features that are highly sensitive to the spectral distortions typically introduced during the pansharpening process. To achieve this, a hybrid feature set was employed that combines FDD features derived from Benford’s Law to capture deviations in spectral statistics with CM features, which characterize the global color distribution through the mean, standard deviation, and skewness of each channel. This combination yields a comprehensive 21-dimensional feature vector for each image patch, effectively representing both local spectral consistency and global chromatic variation.
2.2.1. First Digit Distribution Features
Distortion causes an image’s statistical characteristics to stray from their expected norms. By extracting these statistics as features and measuring their divergence, image quality can be assessed without any reference images. One widely used feature is Benford’s Law, which has been employed in natural image quality assessment [
37,
38]. Unlike natural images, remote sensing data often include multiple bands and rich spectral information, such as MS and hyperspectral images, making spectral distortion measurement in pansharpened MS images essential for evaluating sharpening performance.
To ensure the reproducibility of the proposed metric, it is important to note that no radiometric normalization, clipping, or dynamic range rescaling is applied to the input images prior to feature extraction. The HCS transform is applied directly to the raw pixel values of the fused and reference images. The only normalization performed is the scaling of the angular components
by the constant
, as defined in Equation (
3) to map them to the
interval required for consistent First Digit Distribution analysis.
The experiments on the NBU database show that the FDD of the angular components of pristine MS images in the HCD adheres to standard Benford’s Law. As shown in
Figure 2, for high-quality (undistorted, unprocessed) images, the FDD features of four different MS bands align almost perfectly with the theoretical Benford distribution.
Figure 3 illustrates that once the same images are processed by fusion algorithms known to introduce spectral errors (such as the BT-H, TV, and PWMBF methods), their first digit frequencies deviate markedly. Similarly,
Figure 4 shows that simulated spectral degradations like hue, saturation, and non-linear intensity mismatch break the Benford pattern even more severely, causing the distribution to skew away from the theoretical curve. These distortions confirm that only pristine MS images in the HCD truly follow Benford’s Law; any filtering or spectrally altering processing disrupts the natural first digit statistics. Building on this, a nine-dimensional feature vector was extracted based on Benford’s Law to quantify spectral distortion.
First, the hyperspherical color space (HCS) transform is employed to map the N-band pansharpened image from its original space to the hyperspherical color space. This process separates the intensity component from the angular components, yielding one intensity component and angular components. Specifically, the intensity component characterizes the spatial information of the pansharpened image, while the angular components represent its spectral information.
Let the intensity component of the pansharpened image
in the HCD be denoted as
, and the angular components be denoted as
. The dimensions of
are
pixels, and the value range of each pixel in
is
. The hyperspherical color transform is calculated as follows:
The raw angular components
fall within the range
. Before feature extraction, these are normalized to the range
to obtain
:
Figure 5 displays the heatmaps of the HCD normalized angular components for the MS image and the IHS pansharpened image, which exhibits severe spectral distortion. It can be observed that the normalized angular component heatmaps of the IHS pansharpened image differ significantly from those of the MS image. This demonstrates that the normalized angular components can effectively reflect spectral distortion.
Next, the FDD features are determined. For every pixel in the normalized component
, the first non-zero digit is extracted. The probability of each digit
is calculated as
where:
is the count of digit
a, and
is the total number of pixels in the patch.
This creates a 9-dimensional FDD feature vector for each angle component. These are then averaged across the
angles to produce a single 9-dimensional FDD feature vector
for the patch:
where the superscript
T represents the transpose of the vector.
2.2.2. Color Moment Features
While the FDD features capture the underlying statistical distribution, they are complemented by CM features to provide a more direct and perceptually relevant measure of the image’s color profile. This feature set is designed to capture the global color distribution within each patch and is highly sensitive to the primary artifacts of spectral distortion, such as changes in brightness, chromaticity, and illumination. Pansharpening algorithms can often introduce radiometric shifts (biasing brightness), alter the gain between bands (changing color balance), or cause non-linear saturation, all of which are effectively quantified by this feature set.
For each patch, the first three statistical moments are calculated for each of the red (R), green (G), blue (B), and Near-Infrared (NIR) channels. This process forms a compact and highly descriptive 12-dimensional feature vector (4 channels × 3 moments/channel):
Mean (): The first-order moment. This represents the average color intensity of a channel, directly reflecting the image’s overall brightness or any radiometric bias introduced during fusion.
Standard deviation (): The second-order moment. This measures the contrast or dynamic range within a channel. A higher indicates more variation in pixel intensities, a property often compressed or unnaturally expanded by fusion.
Skewness (): The third-order moment. This captures the asymmetry of the pixel distribution. It is highly sensitive to non-linear distortions, indicating whether the channel is biased toward darker or brighter tones, which often results from pixel value clipping or saturation.
Thus, each image block yields the concatenated 12-dimensional feature vector:
2.3. Spectral Feature Matrix Construction
Following the extraction of the two distinct feature sets from each patch, the next step is to combine them into a single, powerful descriptor. For each individual image patch, the FDD features (a 9-dimensional vector) and the CM features (a 12-dimensional vector) are concatenated end-to-end. This fusion of features is crucial, as it creates a single, comprehensive vector that simultaneously describes the patch’s underlying statistical “naturalness” (from FDD) and its direct perceptual color profile (from CMs).
This process results in a final 21-dimensional spectral feature vector for each patch, defined as
This procedure is repeated for all N non-overlapping patches extracted from the image. The resulting N feature vectors are then aggregated to form the comprehensive spectral feature matrix. This final matrix statistically represents the complete spectral properties of the entire image.
2.4. Model Fitting
The core of the quality assessment lies in comparing the statistical characteristics of the fused image against those of the original MS image using the MVG framework. To achieve this, two distinct models are constructed:
Training MVG model (): This model is constructed from the spectral feature matrix (containing both FDD and CM features) extracted from the original, LR distortion-free MS image. Its parameters, the mean vector and covariance matrix , serve as the “ground truth” statistical benchmark, representing the ideal spectral properties that the fused image should replicate.
Testing MVG model (): In parallel, a second model is built for the fused image under evaluation. Its mean vector and covariance matrix are computed from its own spectral feature matrix, which also contains the FDD and Color Moment features. This model represents the actual spectral statistics of the final fused product, including any distortions.
The main steps of the proposed method can be summarized in Algorithm 1.
| Algorithm 1 Pseudocode of the proposed MVG-SDI method |
- Require:
Multispectral image (), fused image () - Require:
Block size , feature dimension - Ensure:
Spectral Distortion Index (Q) - 1:
Step 1: Feature extraction - 2:
Function ExtractFeatures(Image): - 3:
Divide into K non-overlapping patches of size - 4:
for to K do - 5:
// Extract FDD features (9 dimensions) - 6:
Convert patch to HCS to get angular components - 7:
Normalize angles: - 8:
Compute digit probabilities based on Benford’s Law - 9:
// Extract CM features (12 dimensions) - 10:
Select first 4 bands (RGB + NIR) - 11:
Compute mean (), standard deviation (), and skewness () for each band - 12:
Construct vector - 13:
// Concatenate feature vector - 14:
- 15:
end for - 16:
return Feature matrix - 17:
Step 2: Model construction - 18:
- 19:
- 20:
Compute training MVG model parameters: - 21:
, - 22:
Compute testing MVG model parameters: - 23:
, - 24:
Step 3: Distance calculation - 25:
Compute pooled covariance: - 26:
Calculate difference vector: - 27:
Compute Mahalanobis distance: - 28:
- 29:
return Q
|
2.5. Quality Score Computation
The spectral distortion is formally quantified by calculating the Mahalanobis distance (
D) between the statistical parameters of the two fitted MVG models, utilizing the formulation provided in Equation (
8). This distance measures the dissimilarity between the mean feature vector of the test model (
) and the mean of the ideal reference model (
).
By incorporating the pooled covariance matrix, this metric offers a more robust assessment than simple Euclidean distance. It explicitly normalizes for statistical variance and accounts for the complex inter-dependencies between distinct spectral features, such as the correlation between FDD variations and Color Moments.
3. Experiments
3.1. Datasets
Experimental validation was performed using the publicly available large-scale NBU dataset [
34], which comprises 1200 diverse image pairs acquired by five distinct satellite sensors: IK, WV-2, WV-3, and WV-4. Each sample pair contains a high-resolution PAN image (
pixels) and a corresponding low-resolution MS image (
pixels).
Table 1 provides a detailed breakdown of the image pairs and spectral bands, while
Figure 6 displays representative examples from each sensor subset.
3.2. Fusion Algorithms
The evaluation employed 19 distinct pansharpening algorithms, sourced from two public MATLAB toolkits [
39] to ensure standardized implementation. These methods were selected to represent a diverse cross-section of established techniques, which is crucial for assessing the proposed index’s performance across various types of spectral artifacts. The set includes five CS, nine MRA, four VO, and four ML methods, covering the most prominent categories in the field.
Table 2 provides a summary of the algorithms utilized in this study.
3.3. Implementation Details
To ensure the reliability of the proposed MVG-SDI metric and facilitate future comparisons, all key implementation parameters have been standardized. These settings, including patch size, normalization constants, and feature dimensionality. The specific values used in this study are detailed in
Table 3 below.
3.4. Spectral Quality Assessment Metrics
To validate the performance of the proposed MVG-SDI, its results were compared against a comprehensive suite of established evaluation metrics. This suite included both competing NR methods and benchmark FR measures.
For the NR comparison, the spectral distortion components of several prominent QNR-like indices, namely , , and , were employed. These indices are designed to assess how well spectral characteristics are preserved in pansharpened images without requiring a ground truth reference.
In addition, three widely used FR evaluation measures, the Spectral Angle Mapper (SAM), Spectral Information Divergence (SID), and the correlation coefficient (CC), were incorporated as benchmarks. These metrics are employed specifically to quantify spectral angular consistency, Spectral Information Divergence, and statistical correlation, respectively.
Moreover to properly evaluate these metrics, this study employs a dual-scale experimental protocol consisting of NR and FR validation. This approach is essential for comparing the proposed index against other NR metrics at the original scale, while validating its performance against FR benchmarks at a reduced scale. The two protocols are summarized as follows.
3.4.1. No-Reference Assessment
This protocol compares the proposed spectral index against existing NR spectral indexes (, , and ). The fusion algorithms are applied at the original full resolution (producing fused images), and all NR metrics are computed directly on these outputs to evaluate performance under realistic conditions.
3.4.2. Reduced-Resolution Validation (Wald’s Protocol)
This protocol validates the proposed NR index against reference-based scores using Wald’s protocol. Following standard degradation and upsampling procedures, fusion is performed at a reduced scale ( images). The FR metrics (SAM, SID, and CC) are computed by comparing the fused results against the original MS image, which serves as the ground truth. The proposed NR spectral index is simultaneously calculated on these reduced-resolution images to analyze its correlation with the FR benchmarks.
3.5. Numerical Evaluation
A comprehensive analysis was conducted to benchmark the proposed MVG-SDI against state-of-the-art metrics across diverse fusion categories (CS, MRA, VO, and ML). In the reported results, the optimal performance within each category is highlighted in red, while the poorest performance is marked in blue.
A critical observation from this comparison is the strong alignment between the proposed method and the advanced
model. As evidenced in
Table 4,
Table 5,
Table 6 and
Table 7, the proposed index exhibits a high degree of correlation with
in identifying extreme performers. Specifically, both metrics converge on the same best-performing algorithm in the WV-2 dataset, identify identical best and worst algorithms in the WV-3 dataset, and consistently flag the same poorest performer in the WV-4 dataset.
Furthermore, the numerical results validate the proposed method’s capability to assess spectral distortion effectively. It demonstrates superior performance to and comparable to established NR benchmarks, while maintaining a logical consistency with FR metrics such as the CC.
The numerical results for the IK dataset are presented in
Table 4. PNN achieved the best scores for CC, SAM, and SID, indicating superior spectral fidelity with respect to the reference. Conversely, GS yielded the lowest CC value, while PWMBF recorded the poorest performance for both SAM and SID. Among the QNR-based indices, optimal selection varied:
favored A-PNN,
selected SR-D, and
identified GS as the top-performing algorithm. Notably, the proposed method identified PRACS as the best fusion model, aligning more closely with the preferences of
for CS-based methods. Regarding the poorest results, SR-D exhibited the highest distortion levels according to the proposed metric, whereas
flagged MTF-GLP-HPM-H as the worst performer.
Quantitative results for the WV-2 dataset are presented in
Table 5. PNN-IDX achieved the best scores for both SAM and SID, indicating superior spectral fidelity, while MTF-GLP-HPM-H recorded the highest CC value. Conversely, MF yielded the lowest correlation coefficient, while PRACS and AWLP exhibited the highest spectral distortion across both SAM and SID measures. Among the QNR-based indices, optimal selection varied:
favored BT-H, whereas both
and
identified GS as the top-performing algorithm. Notably, the proposed method aligned with these latter indices, consistently identifying GS as the best fusion model. Regarding the poorest results, the proposed metric flagged SR-D as the worst performer, while
identified PNN-IDX, the algorithm with the best ground truth spectral fidelity, as the poorest model, further highlighting the divergence between FR and NR assessments.
The numerical assessment results for the WV-3 dataset are summarized in
Table 6. MTF-GLP achieved the highest CC, while BT-H and A-PNN-FT recorded the best performances for SID and SAM, respectively, indicating superior spectral preservation. Conversely, PNN-IDX consistently yielded the poorest results across all full-reference metrics, exhibiting the lowest correlation and highest spectral distortion. Among the QNR-based indices, optimal selection varied significantly:
and
favored BT-H, aligning with the SID results, whereas
selected GS as the top-performing algorithm. Notably, the proposed method also identified GS as the best fusion model. Regarding the poorest results, a distinct contradiction was observed: the proposed metric flagged BT-H as the worst performer (0.2899) despite it achieving the best SID score, while
aligned with the FR benchmarks by identifying PNN-IDX as the poorest model.
The quantitative evaluation for the WV-4 dataset is detailed in
Table 7. FE-HPM achieved the best scores for CC, SAM, and SID, consistently demonstrating superior spectral fidelity across all full-reference benchmarks. Conversely, BT-H recorded the lowest correlation coefficient, while SR-D and PWMBF exhibited the highest spectral distortion in terms of SID and SAM, respectively. Among the QNR-based indices, the optimal selection varied:
favored BDSD,
selected PRACS, and
identified SR-D as the top-performing algorithm. Notably, the proposed method identified PWMBF as the best fusion model; however, this presents a significant contradiction, as PWMBF yielded the poorest SAM value in the reference-based assessment. Regarding the poorest results, a rare consensus was observed: all NR metrics, including the proposed method, flagged BT-H as the worst performer, aligning with its lowest ranking in the CC evaluation.
3.6. Visual Evaluation
The visual performance of the proposed index was evaluated against the benchmark metrics (, , and ) using the GS fusion method, which is selected for its tendency to induce noticeable spectral distortions.
Figure 7 illustrates the distortion maps for the IK dataset. The GS fused image (f) exhibits characteristic spectral shifts relative to the MS reference (e). The maps for
(a) and
(c) display identical patterns concentrated almost exclusively along high-frequency edges. This indicates a bias toward spatial features rather than true spectral deviations. Meanwhile,
(b) appears almost entirely blue, failing to register the distortion. Conversely, the proposed index (d) generates a distinct heatmap with high-intensity values (red and yellow) distributed across broad object surfaces, effectively highlighting the spectral errors that competing metrics mistake for spatial structures.
This behavior is further validated on the WV-2 dataset, as shown in
Figure 8. Here, the GS method introduces spectral deviations across varied urban materials. Consistent with the previous dataset,
(a) and
(c) remain nearly indistinguishable, focusing narrowly on high-contrast features such as bright rooftops, while
(b) significantly underestimates the error magnitude. The proposed index (d), however, captures the widely distributed inconsistencies, producing a heatmap that correlates robustly with the global spectral degradation inherent to the component substitution process.
The analysis of the WV-3 dataset shown in
Figure 9 demonstrates the robustness of the proposed index in scenes with high dynamic range. Visually, the GS image (f) shows spectral shifts in both bright blue industrial rooftops and deep shadowed regions. The competing metrics exhibit a strong radiance bias:
and
detect artifacts only on the bright rooftops, leaving the background and shadows unassessed. In contrast, the proposed index (d) identifies spectral degradation across the entire dynamic range, showing significant responsiveness even in the low-radiance shadowed areas that other metrics fail to register.
Finally, the WV-4 dataset, as shown in
Figure 10, offers the most striking validation. The GS fusion (f) suffers from severe global spectral distortion, appearing as an unnatural reddish-brown hue shift compared to the reference (e). Despite this obvious degradation,
(b) remains unresponsive, and
(a) and
(c) display only scattered, low-intensity noise. The proposed index (d) is the only metric to produce a high-intensity response with prominent hotspots aligning precisely with the most distorted regions, proving its superior capability in quantifying severe spectral artifacts.
3.7. Visual Analysis of Fusion Results
A comprehensive visual inspection of the fusion outcomes across the different datasets, ranging from
Figure 11,
Figure 12,
Figure 13 and
Figure 14, reveals distinct variations in algorithmic performance. This qualitative analysis highlights the critical trade-off between spatial enhancement and spectral preservation, demonstrating how certain methods generalize more robustly across diverse sensor platforms than others.
For the IK dataset illustrated in
Figure 11, the BT-H and MTF-GLP methods distinguish themselves with superior visual performance. These algorithms effectively inject high-frequency spatial details while maintaining rigorous spectral fidelity, resulting in images that are sharp yet natural. The PNN method also delivers competent results, striking a commendable balance between detail enhancement and artifact suppression. In contrast, the BDSD algorithm performs with mediocrity, failing to achieve the spatial crispness defined by the top performers. Meanwhile, the GS and AWLP methods occupy a middle ground; while their outputs are acceptable, they lack the refined clarity and spectral accuracy observed in the leading models.
In the case of the WV-2 dataset presented in
Figure 12, a sharp disparity in spectral preservation capabilities is evident. The BT-H and GS algorithms, alongside the deep learning-based family (PNN, PNN-IDX, and A-PNN), produce the most visually convincing results. These methods are characterized by the precise rendering of spatial features and the maintenance of natural color distributions. Conversely, both BDSD and AWLP suffer from severe spectral degradation that compromises image utility. Specifically, BDSD introduces a pervasive, unnatural green hue, whereas AWLP manifests a distinct brownish cast, indicating a significant failure to preserve the original spectral distribution of the scene.
The visual assessment of the WV-3 dataset in
Figure 13 highlights the robustness of BT-H, GS, and PNN-IDX, which consistently provide very good visual results with high spatial clarity. AWLP also performs well in this scenario. However, spectral distortions remain a challenge for other methods: BDSD again exhibits a strong green bias, while PNN produces an unclear, brownish output. Furthermore, A-PNN fails to retain visual quality, resulting in a generally poor fusion product.
Finally, for the WV-4 dataset presented in
Figure 14, the traditional methods largely outperform the learning-based approaches. BT-H, BDSD, GS, MTF-GLP, and MTF-GLP-HPM-FS all demonstrate good fusion capabilities, balancing spatial enhancement with spectral accuracy. AWLP, however, results in noticeably darker imagery, suggesting a loss of luminance. Notably, the deep learning models (PNN, PNN-IDX, and A-PNN) struggle significantly with this dataset, collectively exhibiting severe spectral shifts towards yellow and green tones, rendering them less suitable for this specific sensor data.
3.8. Quality Assessment with Spectral Degradations
Different pansharpening methods can introduce various color-related artifacts, known as spectral degradations. Common spectral degradations such as color shifting, intensity mismatch, and oversaturation significantly reduce the spectral fidelity of the fused image. To test the proposed method, these spectral degradations were manually generated to verify its efficacy in correct identification and ranking.
3.8.1. Hue and Saturation Shift
These artifacts represent one of the most direct and perceptually obvious sources of spectral distortion. They are particularly common in CS methods, most notably the IHS fusion family. The core issue arises from a fundamental spectral mismatch between the two source images. In the IHS method, the HR MS image is transformed into the IHS color space, and its “intensity” (I) component is replaced by the high-resolution PAN image. The problem is that the broad spectral response of the PAN sensor is not a perfect representation of the synthetic intensity component, which is calculated from the narrower MS bands. When this spectrally inconsistent PAN image is substituted and transformed back to the original color space, it introduces brightness levels that do not align with the original hue (H) and saturation (S) information. This mismatch leads to significant and unrealistic color shifts, altering the appearance of features like vegetation or water [
5].
To systematically simulate and test the proposed index’s sensitivity to this degradation, the image is converted to the HSV (or HSI) color space, where the chromatic (H, S) and brightness (V) components can be manipulated independently. A hue shift is simulated by adding a constant offset,
, to the entire hue channel. This effectively “rotates” all colors on the color wheel, creating a global color cast (e.g., shifting all blues toward purple).
A saturation shift is simulated by multiplying the saturation channel by a scaling factor,
. This modifies the “purity” or “vividness” of the colors.
The severity of the distortion is precisely controlled by the magnitude of (a larger shift) and how far the scaling factor deviates from one (with causing oversaturation and causing desaturation or “washout”).
3.8.2. Non-Linear Intensity Mismatch
This artifact represents a complex form of spectral distortion where the brightness of the image is altered in a non-linear fashion. Unlike a simple, uniform brightening or darkening, this mismatch disproportionately affects different intensity levels, often compressing or expanding the mid-tones while leaving the darkest and brightest pixels relatively unchanged. This frequently leads to a “washed-out” or, conversely, an “overly dark” and “crushed” appearance.
This type of distortion is a common failure mode in MRA methods. These algorithms work by extracting high-frequency spatial details from the PAN image and then using an “injection model” to add them to the up-sampled MS bands. The amount of detail added is controlled by injection gains. When these gains are incorrectly calculated, either by “over-injecting” (adding too much detail) or “under-injecting” (adding too little), the resulting change in brightness is not uniform. Such non-linear intensity shifts can significantly alter the image’s radiometric values, which is particularly detrimental for quantitative analysis as it corrupts the accuracy of derived products like vegetation indices [
6]. This non-linear degradation is effectively simulated by applying a power function, commonly known as gamma correction, to the image. To isolate the brightness component from the color information, this operation is performed on the value (V) channel in the HSV color space:
The severity of this mismatch is controlled by the gamma parameter (). A value of 1.0 results in no change. However, a value of brightens the image by boosting the mid-tones, simulating the washed-out effect of over-injection. Conversely, a value of darkens the image by compressing the mid-tones, mimicking the effect of under-injection. The further deviates from 1.0, the more severe the non-linear spectral distortion.
3.9. Experimental Results with Spectral Degradations
3.9.1. Objective Evaluation
This experiment evaluates the robustness of the proposed metric against manually induced spectral degradations. The primary objective is to verify the monotonicity of the metric; that is, the quality score should exhibit a consistent increase as the severity of the spectral distortion (specifically the hue shift, saturation shift, and non-linear intensity gamma) increases. A higher score must reliably indicate a strictly worse result without suffering from premature saturation or insensitivity.
The assessment of the continuous trends presented in
Figure 15,
Figure 16,
Figure 17 and
Figure 18 benchmarks the performance of the proposed method against established indices:
,
, and
.
For the GS method on the IK dataset, as shown in
Figure 15, the proposed method demonstrates superior capability in quantifying spectral distortions.
In the saturation shift analysis, the curve exhibits excessive sensitivity, marked by a steep initial spike to a near-maximum score (>0.8) at the lowest degradation level (), indicating premature saturation. In contrast, the proposed method follows a nearly linear trajectory with a consistent, gradual rise in penalty. Furthermore, in the intensity gamma evaluation (), standard displays a significant lack of sensitivity, remaining flat, whereas saturates rapidly. The proposed method avoids these extremes, yielding a balanced and monotonic curve that accurately reflects the increasing magnitude of degradation.
An assessment of the WV-2 dataset, as shown in
Figure 16, reinforces these findings. The
curve again exhibits excessive sensitivity to saturation shifts, spiking early, while the proposed method maintains a consistent, gradual trajectory. Similarly, in the intensity gamma evaluation (
), standard
remains insensitive (score
at
), while
rises sharply. The proposed method provides a stable compromise, offering a balanced and monotonic response.
The results for the WV-3 dataset, shown in
Figure 17, corroborate the robust stability of the proposed index, particularly where others fail. Here,
exhibits a critical failure in the saturation shift test, plateauing immediately (>0.9) at the initial degradation level (
).
Conversely, the proposed method initiates at a moderate penalty (∼0.3) and rises monotonically. A similar trend is observed for the intensity gamma, where the proposed metric bridges the gap between the insensitivity of and the premature saturation of , ensuring a reliable assessment of radiometric consistency.
Finally, an evaluation on the WV-4 dataset, as shown in
Figure 18, demonstrates the proposed method’s balanced sensitivity. While
reacts disproportionately to initial saturation shifts (score
) and
remains largely unresponsive, the proposed method exhibits a moderate, monotonic trajectory (reaching
at
). In intensity gamma tests, it avoids the negligible response of
and the sharp spikes of
and
, providing a strictly monotonic response curve that effectively characterizes varying degrees of spectral degradation.
3.9.2. Visual Analysis with Spectral Degradations
Figure 19 provides a visual analysis of a GS-fused image from the IK dataset, showing three types of manually generated spectral distortions at different severity levels. The figure is organized in a
grid, with each row dedicated to one type of artifact.
The top row (a, b, c) illustrates the hue shift distortion. The degradation begins subtly in (a) with a slight, unnatural color cast (), barely perceptible in the image. In (b), the hue shift becomes more pronounced, reaching a noticeable color alteration (). By (c), the effect is most severe, resulting in a complete misrepresentation of the original scene’s colors, which now present a significant deviation from natural hues, disrupting the visual integrity of the image.
The middle row (d, e, f) is dedicated to the saturation shift effect. A slight, noticeable oversaturation is introduced in (d) (). This effect intensifies in (e) () and culminates in (f) (), where colors appear excessively saturated, creating an almost “cartoon-like” quality with noticeable color bleeding, which severely alters the image’s realism.
The bottom row (g, h, i) demonstrates the intensity gamma distortion (non-linear mismatch), focusing on the impact of over-brightening (). The row begins with (g) (), where a severe brightening effect is observed, making the image appear “washed-out,” with a significant loss of contrast. In (h) (), the image is slightly less washed out, but still lacks the depth and richness of the original. By (i) (), the image shows minimal distortion and is closest to the original contrast, though still exhibiting slight differences in brightness levels. This gamma shift highlights how small adjustments in intensity can cause perceptible distortions in image clarity.
3.10. Consistency Analysis of NR and FR Metrics
To quantitatively validate the performance of the proposed NR metric, its scores must be benchmarked against established FR metrics. In this evaluation, the FR metrics (CC, SAM, and SID) are treated as the objective “ground truth” for image quality, as determined in the reduced-resolution validation protocol.
The agreement between the proposed NR metric and these FR metrics is assessed using three standard statistical criteria: the Spearman Rank-Order Correlation Coefficient (SROCC), the Pearson Linear Correlation Coefficient (PLCC), and the Root Mean Square Error (RMSE). These metrics constitute the standard protocol for validating image quality assessment algorithms [
41,
42] and are extensively employed in evaluating recent remote sensing and multi-focus image fusion frameworks [
43,
44].
Before calculating the PLCC and RMSE, a non-linear logistic regression is applied to the raw NR metric scores () to map them onto the same scale as the FR scores. This results in a mapped predicted score, . This step is necessary because the raw NR scores and the FR ground truth scores may not be linearly related, even if they are monotonically associated.
3.10.1. Pearson Linear Correlation Coefficient (PLCC)
The PLCC measures the prediction accuracy of the NR metric after non-linear mapping. It quantifies the linear correlation between the mapped NR metric scores
and the FR ground truth scores
. The PLCC is calculated as
where
,
are the means of the ground truth scores and the mapped predicted scores, respectively.
3.10.2. Spearman Rank-Order Correlation Coefficient (SROCC)
The SROCC measures the prediction monotonicity of the NR metric. It is a non-parametric test that assesses how well the rank order of the NR scores matches the rank order of the FR scores, without assuming a linear relationship. This is crucial for quality assessment, as a good metric must at least agree on which images are better or worse than others.
The SROCC is calculated as
where
N is the total number of fused images;
is the rank of the ground truth FR score
;
is the rank of the predicted NR score
.
3.10.3. Root Mean Square Error (RMSE)
The RMSE measures the prediction error. After applying the non-linear logistic mapping, the RMSE quantifies the average magnitude of the error (or residuals) between the mapped NR scores
and the FR ground truth scores
. The RMSE is calculated as
For a high-performing NR method, the SROCC, PLCC, and KROCC values should be high (closer to one), while the RMSE value should be as low as possible.
3.11. Quantitative Validation Results
To quantitatively assess the proposed MVG-SDI, its performance was benchmarked against the competing NR metrics,
and
. This evaluation was conducted using the RR validation protocol, treating the FR metrics CC, SAM, and SID as the ground truth. The performance was measured using the SROCC for monotonicity, the PLCC for accuracy after non-linear mapping, and the RMSE for prediction error. A superior NR metric should exhibit high SROCC and PLCC values alongside low RMSE values. The results across
Table 8, (IK, WV-2) datasets and
Table 9, (WV-3, and WV-4) datasets are discussed below, with the best value for each comparison highlighted in red in the corresponding tables.
The proposed MVG method demonstrated consistently strong performance, particularly when benchmarked against the CC and SAM metrics. Against CC, the proposed method achieved the top SROCC, PLCC, and RMSE values on the IK, WV-2, and WV-3 datasets. Against SAM, it secured the best performance across all three correlation criteria (SROCC, PLCC, RMSE) on the IK, WV-2, and WV-3 datasets. While its performance against SID was generally strong (e.g., best SROCC and PLCC on WV-3), the metric showed slightly better correlation and lower error against SID on the IK and WV-2 datasets. On the WV-4 dataset, the proposed method performed well but was outperformed by . Overall, the proposed index proved to be a highly effective and generally consistent metric, especially for predicting CC and SAM.
In contrast, showed variable performance. Its SROCC and PLCC values were often lower than the proposed method, especially against CC on the IK (SROCC 0.7308) and WV-3 (PLCC 0.5306) datasets. It exhibited a notably high RMSE (0.3686) when compared against SAM on the WV-2 dataset, indicating significant prediction error in that scenario. However, performed exceptionally well on the WV-4 dataset, achieving the best SROCC, PLCC, and RMSE against CC, the best PLCC and RMSE against SAM (tying for SROCC), and tying for the best performance against SID. This indicates can be highly accurate under certain conditions but lacks the overall consistency of the proposed method.
exhibited highly inconsistent performance. It consistently performed extremely well when benchmarked against SID, achieving the top SROCC, PLCC, and RMSE values on the IK and WV-2 datasets. However, its performance against CC and SAM was often poor. Against CC, it yielded very low SROCC and PLCC values on the WV-2 and WV-3 datasets. Against SAM, it produced a high RMSE on the WV-3 (0.4130) and WV-2 (0.3345) datasets, indicating significant prediction errors. Despite these weaknesses, performed very strongly on the WV-4 dataset, tying for the best SROCC against CC and SAM. This confirms that is highly effective for predicting SID but is unreliable for predicting CC and SAM across different datasets.
3.12. Computational Complexity Analysis
To evaluate the practical efficiency of the proposed MVG-SDI, a runtime comparison against three widely used NR metrics, , , and , was conducted. The computational complexity of the proposed method is primarily determined by the feature extraction and statistical fitting processes.
Theoretically, the complexity of extracting the FDD and CM features is linear with respect to the number of pixels N, i.e., . The subsequent MVG modeling involves calculating the covariance matrix of the feature vectors. With a fixed feature dimension () and a number of patches M proportional to the image size, the fitting process is efficient, scaling as . Consequently, the overall computational complexity of the algorithm remains , ensuring it scales linearly with image resolution.
To validate this empirically, the average execution time across the entire IKONOS dataset, which comprises 200 images with a spatial resolution of pixels, was measured. All experiments were performed on a computer equipped with an Intel Core i5-1035G1 CPU @ 1.19 GHz and 16 GB of RAM, running on a 64-bit operating system. The algorithms were implemented in MATLAB R2024a.
Table 10 presents the average execution times. The results indicate that the proposed MVG-SDI is computationally efficient for practical applications. While it requires slightly more processing time than the simpler
and
indices due to the statistical modeling involved, it is approximately 35% faster than the advanced
metric. This demonstrates that the proposed method offers a favorable trade-off, providing sophisticated spectral distortion detection with a runtime comparable to established benchmarks.