A Review of Image Fusion Algorithms Based on the Super-Resolution Paradigm

A critical analysis of remote sensing image fusion methods based on the super-resolution (SR) paradigm is presented in this paper. Very recent algorithms have been selected among the pioneering studies adopting a new methodology and the most promising solutions. After introducing the concept of super-resolution and modeling the approach as a constrained optimization problem, different SR solutions for spatio-temporal fusion and pan-sharpening are reviewed and critically discussed. Concerning pan-sharpening, the well-known, simple, yet effective, proportional additive wavelet in the luminance component (AWLP) is adopted as a benchmark to assess the performance of the new SR-based pan-sharpening methods. The widespread quality indexes computed at degraded resolution, with the original multispectral image used as the reference, i.e., SAM (Spectral Angle Mapper) and ERGAS (Erreur Relative Globale Adimensionnelle de Synthèse), are finally presented. Considering these results, sparse representation and Bayesian approaches seem far from being mature to be adopted in operational pan-sharpening scenarios.


Introduction
Recent trends in image fusion, including remote sensing applications, involve the super-resolution (SR) paradigm and, more generally, apply constrained optimization algorithms to solve the ill-posed problem of spectral-spatial (pan-sharpening) and spatio-temporal image resolution enhancement.Specifically, pan-sharpening denotes the merging of a monochrome image acquired by a broadband panchromatic (Pan) instrument with a multispectral (MS) image featuring a spectral diversity of bands and acquired over the same area, with a spatial resolution greater for the former.This can be seen as a particular problem of data fusion, in which the goal is to combine the spatial details resolved by the Pan instrument, but not by the MS scanner, and the spectral diversity of the MS image, against the single band of Pan, into a unique product.The most commonly-encountered case is when both the MS and Pan datasets are available at the two dates.However, multitemporal pan-sharpening denotes the process by which MS and Pan datasets that are used to perform the data fusion task are acquired from the same platform, but at different times or from different platforms.In the latter case, we may talk of multi-platform pan-sharpening.A typical application scenario is when either of the platforms mounts only one of the MS and Pan instruments, for example CartoSat-1 (Pan geocoded at 2.5 m) and RapidEye (MS geocoded at 5 m).In this case, pan-sharpening is multi-platform and is most likely to be also multitemporal [1].
The majority of pan-sharpening methods may be labeled as spectral or spatial.In spectral methods, geometric details are extracted from the Pan image by subtracting from it an intensity image obtained by a spectral transformation of the MS bands.In spatial methods, geometric details are extracted from the Pan image by subtracting from it a low-pass version of Pan obtained by means of linear shift-invariant digital filters.Finally, for both approaches, the geometric details are injected into the MS bands interpolated at the scale of the panchromatic band.Spectral methods [2][3][4][5][6][7][8][9][10] are traditionally known as component-substitution (CS), though explicit calculation of the spectral transform, and its inverse may not be necessary.Spatial methods [11][12][13][14][15][16][17][18] may be contextualized within multiresolution analysis (MRA), though in most cases, a unique low-pass filter is required [19].This hard categorization is brought back to previous studies [20,21], in which it is proven that there exists a duality between the classes of spectral and spatial methods featuring complementary properties of robustness to spatial and spectral impairments, respectively.
Super-resolution fusion methods form a new third class of spectral-spatial (pan-sharpening) and spatio-temporal image resolution enhancement algorithms.Conventional approaches to generating an SR image normally require inputting multiple spatial/spectral/temporal low-resolution images of the same scene.The SR task is cast as the inverse problem of recovering the original high-resolution image by fusing the low-resolution images, based on reasonable assumptions or prior knowledge about the observation model that maps the high-resolution image into the low-resolution ones.The fundamental reconstruction constraint for SR is that the recovered image, after applying the same generation model, should reproduce the observed low-resolution images.However, SR image reconstruction is generally a severely ill-posed problem because of the insufficient number of low-resolution images, ill-conditioned registration and unknown blurring operators, and the solution from the reconstruction constraint is not unique.Various regularization methods have been proposed to further stabilize the inversion of this ill-posed problem [22].
A similar approach considers image fusion as a restoration problem.The aim is therefore to reconstruct the original scene from a degraded observation, or, equivalently, to solve a classical deconvolution problem [23,24].As an example of possible application fields, these methods may solve the classical strip-line degradation problem in satellite optical imagery, e.g., Landsat 7ETM+, MODIS, etc. [25].Prior knowledge is required on the nature of the two-dimensional convolution that models the band-dependent point spread function of the imaging system.There is a spectral model between the Pan channel and the MS channels of the same sensor, notwithstanding that the corresponding images feature different spatial resolutions, that is spatial frequency contents.Such a model is well embodied by the plots of the spectral responsivities of the individual channels of the complete sensor (MS and Pan instruments mounted on the same platform) or, in the most general case, the spectral responses of different MS + Pan sensors.While individual narrowband channels (e.g., B, G, R and NIR) approximately cover the same wavelength intervals, the bandwidth of Pan may significantly vary from one instrument to another.Older instruments, like SPOT 1-3 and 5, featured narrowband Pan (approximately spanning through 500-700 nm).Modern very high resolution (VHR) and extremely high resolution (EHR) MS scanners are generally equipped with a broadband Pan instrument covering the wavelengths from 450 nm-800 nm or even 900 nm [26].
Bayesian methods and variational methods have been also proposed in the last decade, with different possible solutions that are based on specific assumptions that make the problem mathematically tractable [27][28][29][30][31].
The paper reviews the concept of super-resolution in an image fusion framework, by resorting to the theoretical interpretation of image super-resolution as a constrained optimization problem.Different SR solutions for spatio-temporal fusion and pan-sharpening are reviewed and critically discussed.The distinctive feature of the paper is the reviewed methodology, i.e., the super-resolution paradigm, which includes constrained-optimization solutions, sparse representation methods and Bayesian restoration approaches.The broad application field is remote sensing image fusion, while specific applications, i.e. spatio-temporal fusion, fusion with missing data (destriping/denoising) and pan-sharpening have been reviewed within the common framework of the adopted methodology.Finally, pan-sharpening has been selected for objective assessment on SR-based methods with respect to the simple, classical, widespread proportional additive wavelet in the luminance (AWLP) method, which serves as a benchmark for clear and immediate comparisons.
A different review philosophy, limited to pan-sharpening, but extended to several methodological approaches, has been very recently proposed in [32].The author deeply investigates the models adopted by each reviewed method and makes comparisons starting from the physical consistency of the adopted models for pan-sharpening.As previously stated, here, the main objective is to review image fusion methods in different fusion application fields, but all adopting super-resolution methodologies and to conclude from the study of the recent literature whether sparse representations or Bayesian methods are mature for being extensively applied to solve operational remote sensing image fusion tasks.
Finally, focusing on pan-sharpening, the well-known and simple, yet fast and effective, proportional additive wavelet in the luminance component (AWLP) algorithm [33] is adopted as a benchmark to assess the performance of the recently-proposed SR-based pan-sharpening methods.Finally, experimental comparisons on true and simulated images are presented in terms of computational time and quality indexes computed at the spatial resolution of the original multispectral images, i.e., SAM (Spectral Angle Mapper) and ERGAS (Erreur Relative Globale Adimensionnelle de Synthèse, from its French acronym).

Restoration-Based Approaches
A class of recently-developed image fusion methods considers pan-sharpening as a restoration problem.The aim of pan-sharpening is therefore to reconstruct the original scene from a degraded observation, or, equivalently, to solve a classical deconvolution problem [23].Following this approach, each band of a multispectral image, neglecting additive noise, can be modeled as the two-dimensional convolution of the corresponding band at a high-spatial resolution, with a linear shift-invariant blur, that is the band-dependent point spread function of the imaging system.
We refer to Mk as the original multispectral images M k resampled to the scale of the panchromatic band P (of size N r × N c pixels).A degradation model is introduced, for which Mk can be obtained as noisy blurred versions of the ideal multispectral images Mk , where N b is the number of bands, the symbol * denotes the 2D convolution operation, H k is the point spread function (PSF) operator for the k-th band and v k , k = 1, . . ., N b , are additive zero-mean random noise processes.The high-resolution panchromatic image is modeled as a linear combination of the ideal multispectral images plus the observation noise: where ∆ is an offset, ω k , k = 1, ..., N b , are the weights that satisfy the condition ∑ N b i=1 ω k = 1 and w is an additive zero-mean random noise [34].
The weights ω k can be calculated from normalized spectral response curves of the multispectral sensor [34] or by linear regression of the down-degraded panchromatic image P d and the original multispectral bands M k [2].The offset ∆ is approximately calculated using the degraded panchromatic image and the sensed low-resolution multispectral images through: where R indicates the scale ratio between the original multispectral and panchromatic images.The rationale of Equation ( 3) is the assumption of the approximate scale invariance of the offset ∆ defined in the Pan-model in Equation (2); at least between the Pan scale in Equation ( 2) and the MS scale in Equation (3).The ideal high-resolution multispectral image can be estimated by solving a constrained optimization problem.In Li and Leung [34], the restored image is obtained by applying a regularized constrained least square (CLS) algorithm in the discrete sine transform (DST) domain to achieve sparse matrix computation.The solution is calculated row by row by applying the regularized pseudoinverse filter to the m-th row of the DST coefficients Mk and P of Mk and P, respectively: where I is the identity matrix and F is an (N b + 1)N c × (N b + 1)N c sparse matrix that is computed from the weights ω k in Equation ( 2), the point spread function operators H k in Equation ( 1) and the DST transform matrix.Finally, λ is the regularization parameter that controls the degree of smoothness of the solution: when λ → 0, Equation (4) reduces to the unconstrained least squares solution, and when λ → ∞, Equation ( 4) becomes the ultra-smooth solution.
The main drawbacks of restoration-based methods are the inaccuracies of the observation models Equation (1) and Equation (2): the PSF operators H k are assumed to be known, but they often differ from their nominal values.Furthermore, the optimal value of the regularization parameter λ is empirically calculated and can vary from sensor to sensor and even on the particular scenario.
The adoption of transformed coefficients in the CLS solution Equation ( 4) is required to obtain sparse matrices and to reduce the computational complexity, that is O(N c β N r ), with 2 < β < 3. On the other hand, when working in a Fourier-related domain, for example, the DST, an intrinsically-smoothed solution is obtained from Equation (4), and poorly-enhanced pan-sharpened images are often produced.

Sparse Representation
A new signal representation model has recently become very popular and has attracted the attention of researchers working in the field of image fusion, as well as in several other areas.In fact, natural images satisfy a sparse model, that is they can be seen as the linear combination of a few elements of a dictionary or atoms.Sparse models are at the basis of compressed sensing [35], which is the representation of signals with a number of samples at a sub-Nyquist rate.In mathematical terms, the observed image is modeled as y = Ax + w, where A is the dictionary, x is a sparse vector, such that ||x|| 0 ≤ K, with K M, with M the dimension of x, and w is a noise term that does not satisfy a sparse model.In this context, fusion translates into finding the sparsest vectors with the constraint ||y − Ax|| 2 2 < , where accounts for the noise variance.The problem is NP-hard, but it can be relaxed into a convex optimization one by substituting the pseudo-norm || • || 0 with || • || 1 [35].

Sparse Image Fusion for Spatial-Spectral Fusion
The pioneering paper by Li and Yang [36] formulated the remote sensing imaging formation model as a linear transform corresponding to the measurement matrix in the compressed sensing (CS) theory [35].In this context, the high-resolution panchromatic and low-resolution multispectral images are referred to as measurements, and the high-resolution MS images can be recovered by applying sparsity regularization.
Formally, it is assumed that any lexicographically-ordered spatial patch of the observed images, namely y MS and y PAN , can be modeled as: where y = y MS y PAN , M = M 1 M 2 , M 1 and M 2 indicate the decimation matrix and the panchromaticmodel matrix, respectively, x is the unknown high-resolution MS image and ν is an additive Gaussian noise term.
The goal of image fusion is to recover x from y.If the signal is compressible by a sparsity transform, the CS theory ensures that the original signal can be accurately reconstructed from a small set of incomplete measurements.Thus, the signal recovering problem Equation ( 5) can be formulated as a minimization problem with sparsity constraints: where ) is a dictionary and x = Dα, which explains x as a linear combination of columns from D. The vector α is very sparse.Finally, the estimated x can be obtained by x = Dα.
The resulting pan-sharpening scheme is illustrated in Figure 1.All of the patches of the panchromatic and multispectral images are processed in raster-scan order, from left-top to right-bottom with steps of four pixels in the PAN image and one pixel in the MS images (1/4 ratio is assumed between PAN and MS spatial scales, as in several spaceborne sensors).First, the PAN patch y PAN is combined with the MS patch y MS to generate the vector y.Then, the sparsity regularization Equation ( 6) is resolved using the basis pursuit (BP) method [43] to get the sparse representation α of the fused MS image patch.Finally, the fused MS image patch is obtained by x = Dα.The generation of the dictionary D is the key problem of all CS-based pan-sharpening approaches.In Li and Yang [36], the dictionary was generated by randomly sampling raw patches from high-resolution MS satellite images.Since such images are not available in practice, [36] reduces to a theoretical investigation on the applicability of compressed sensing to pan-sharpening.More recent papers have proposed different solutions to this problem, in order to deal with practical remote sensing applications.In Li, Yin and Fang [37], the sparse coefficients of the PAN image and low-resolution MS image are obtained by the orthogonal matching pursuit algorithm.Then, the fused high-resolution MS image is calculated by combining the obtained sparse coefficients and the dictionary for the high-resolution MS image.The main assumption is that the dictionaries D ms h , D pan and D ms l have the relationships: First, D pan and D ms l are computed from randomly-selected samples of the available Panand MS data by applying the K-SVD method [44].The dictionary D ms h is estimated by applying an iterative gradient descent method to solve a minimization problem based on the MS dictionary model Equation (8).
Obviously, the computational complexity of the method is huge, while the improvement with respect to effective classical pan-sharpening algorithms is negligible.As an example, the algorithm proposed in Li, Yin and Fang [37] requires about 15 min on a very small (64 × 64) MS image, while, by considering the same hardware and programming software configurations, pan-sharpening methods based on multiresolution analysis (MRA) [11] or component substitution [9] provide pan-sharpened images with the same quality (measured by QNR, Quality with No Reference, Q4, the unique quality index for 4-band images, and ERGAS score indexes) in one or a few seconds.
An interesting solution to the problem of the high computational complexity has been very recently proposed in [45].Conversely to the previous sparse approaches to pan-sharpening, in which the SR theory was employed for generating the whole pan-sharpened image, [45] proposes to use sparse representation only to reconstruct the high-resolution details.This choice better meets the consideration that the key assumption of sparsity is more appropriate for image parts showing high variance (regions with high spatial frequency components).The algorithm is referred to as SR-based details injection (SR-D).In particular, the input multispectral image is tiled in M (overlapped by L pixels) patches of size NR × NR, where R is the resolution ratio between the low-resolution multispectral image and the panchromatic image (R = 4 for most cases), and N is a scalar coefficient tuned by the user.The values of N and L have been empirically set to N = 100 and L = 10.The choice of applying sparse representation to the spatial details only does reduce the computational time with respect to previous SR-based methods.However, the algorithm performances are comparable to those of optimized classical pan-sharpening methods, as will be shown in Section 6.

The SparseFI Family for Pan-Sharpening
Different from Li, Yin and Fang [37], the method proposed in Zhu and Bamler [38], named sparse fusion of images (SparseFI), explores the sparse representation of multispectral image patches in a dictionary trained only from the panchromatic image at hand.Furthermore, it does not assume any spectral composition model of the panchromatic image, that is it does not adopt a composition model similar to Equation (2), which implies a relationship between the dictionaries for PAN and MS, as in Equation ( 7).The method is described synthetically by the scheme reported in Figure 2.
P is a matrix that extracts the region of overlap between the current target patch and previously-reconstructed ones, while w k contains the pixel values of the previously-reconstructed HR multispectral image patch on the overlap region.Parameter β is a weighting factor that gives a trade-off between the goodness of fit of the LR input and the consistency of reconstructed adjacent HR patches in the overlapping area.The algorithm performances are not outstanding [38], since it provides pan-sharpened images with similar quality of adaptive Intensity-Hue-Saturation (IHS) fused products.
An improved version of [38] has been very recently proposed by the same authors [46].It exploits the mutual correlation among multispectral channels by introducing the concept of the joint sparsity model (JSM).The new Jointly Sparse Fusion of Images, J-SparseFI, algorithm can be seen as the result of three main improvements: the adoption of an enhanced SparseFI algorithm, the definition of a JSM and the introduction of a sensor spectral response analysis followed by a channel mutual correlation analysis.The original SparseFI algorithm has been fully parallelized, with patch processing performed independently and hence distributed to multiple threads without requiring cross communication.This improvement has been introduced for all processing steps: dictionary learning, sparse coefficient estimation and HR multispectral image reconstruction.The joint sparsity model is founded on the distributed compressive sensing theory to constrain the solution of an underdetermined system by considering an ensemble of signals being jointly sparse.
The third improvement of J-SparseFI with respect to SparseFI can be explained starting from the analysis of the WorldView-2 spectral responses and, in particular, the channel mutual correlation of the multispectral and panchromatic sensors.Channels 1-5, 7 and 8 are identified as blocks, i.e., each group is composed of adjacent bands with mutual correlation higher than 0.9.Among them, Channels 2-5 (blue, green, yellow and red) have a wavelength range well covered by the panchromatic image and, therefore, are identified as the primary group of joint channels.After excluding the primary group of joint channels, the remaining block, i.e., Channels 7 and 8 (NIR-1 and NIR-2), can be identified as the secondary group of joint channels.Finally, the remaining Channel 1 (coastal) and Channel 6 (red edge) are identified as individual channels.
Primary groups of joint channels, individual channels and secondary groups of joint channels are then sharpened in a sequential manner.First, the HR version of the group of joint channels (blue, green, yellow and red) is reconstructed by JSM using the coupled dictionary pair built up from the HR Pan image and its downsampled version.Then, the coastal channel is reconstructed by modified SparseFI using an updated coupled dictionary pair built-up, instead of using the Pan image, using the previously-reconstructed HR blue channel and its downsampled version, because, among the Pan or the sharpened primary group of joint channels, i.e., Channels 2-5, the blue channel correlates the most with Channel 1.The red edge channel is reconstructed by modified SparseFI using a dictionary pair trained from the HR Pan image and its downsampled image.Finally, the NIR-1 and NIR-2 channels are jointly reconstructed by JSM using a dictionary pair of the previously-reconstructed HR red edge channel and its downsampled version, because of its relatively highest correlation to the target joint channels.

Hybrid SR-Based Approaches for Pan-Sharpening
In Cheng, Wang and Li [39], a method is proposed to generate the high-resolution multispectral (HRM) dictionary from HRP (high resolution panchromatic) and LRM (low resolution multispectral) images.The method includes two steps.The first step is AWLP pan-sharpening to obtain preliminary HRM (high resolution multispectral) images.The AWLP algorithm [33] is a well-known pan-sharpening method in which first a spectral transformation of the MS bands provides an intensity component, then a multiresolution transform (the à-trous wavelet, specifically), i.e., a spatial transform, is applied to spatially enhance the intensity component.The second step performs dictionary training using patches sampled from the results of the first step.As in Li, Yin and Fang [37], a dictionary training scheme is designed based on the well-known K-SVD method.The training process incorporates information from the HRP image, which improves the ability of the dictionary to describe spatial details.Since the method includes both classical (in this case AWLP) and sparse representation-based strategies, it can be categorized as a hybrid SR-based approach.While better quality score indexes are obtained with respect to the boosting pan-sharpening method AWLP, no remarkable improvements are introduced by this method with respect to fast and robust classical component substitution methods, such as Gram-Schmidt Adaptive -Context Adaptive (GSA-CA) [8], as reported in Cheng, Wang and Li [39].
In Huang et al. [42], a spatial and spectral fusion model based on sparse matrix factorization is proposed and tested on Landsat 7 and MODIS acquisitions at the same date.The model combines the spatial information from sensors with high-spatial resolution, with the spectral information from sensors with high-spectral resolution.A two-stage algorithm is introduced to combine these two categories of remote sensing data.In the first stage, an optimal spectral dictionary is obtained from data with low-spatial and high-spectral resolution to represent the spectral signatures of various materials in the scene.Given the simple observation that there are probably only a few land surface materials contributing to each pixel in this kind of images, the problem is formalized as a sparse matrix factorization problem.In the second stage, by using the spectral dictionary developed in the first stage, together with data with high-spatial and low-spectral resolution, the spectrum of each pixel is reconstructed to produce a high-spatial and high-spectral resolution image via a sparse coding technique.
In synthesis, a clustering-or vector-quantization-based method is adopted to optimize a dictionary on a set of image patches by first grouping patterns, such that their distance to a given atom is minimal, and then updating the atom, such that the overall distance in the group of patterns is minimal.This process assumes that each image patch can be represented by a single atom in the dictionary, and this reduces the learning procedure to a K-means clustering.A generalization of this method for dictionary learning is the K-singular value decomposition (K-SVD) algorithm [44], which represents each patch by using multiple atoms with different weights.In this algorithm, the coefficient matrix and basis matrix are updated alternatively.

Sparse Image Fusion for Spatio-Temporal Fusion
Most instruments with fine spatial resolution (e.g., SPOT and Landsat TM with a 10-m and 30-m spatial resolution) can only revisit the same location on Earth at intervals of half to one month, while other instruments with coarse spatial resolution (e.g., MODIS and SPOT VEGETATION with a 250-1000-m spatial resolution) can make repeated observations in one day.As a result, there is so far still no sensor that can provide both high spatial resolution (HSR) and frequent temporal coverage.One possible cost-effective solution is to explore data integration methods that can blend the two types of images from different sensors to generate high-resolution synthetic data in both space and time, thereby enhancing the capability of remote sensing for monitoring land surface dynamics, particularly in rapidly changing areas.In the example in Figure 3, the goal is to predict the unavailable high-spatial-resolution Landsat image at date t 2 from the Landsat images at dates t 1 and t 3 and the low-spatial-resolution MODIS acquisitions at dates t 1 , t 2 , t 3 .One critical problem that should be addressed by a spatio-temporal reflectance fusion model is the detection of the temporal change of reflectance over different pixels during an observation period.In general, such a change encompasses both phenology change (e.g., seasonal change of vegetation) and type change (e.g., conversion of bare soil to concrete surface), and it is considered more challenging to capture the latter than the former in a fusion model.
In Huang and Song [40], a data fusion model, called the sparse-representation-based spatiotemporal reflectance fusion model (SPSTFM), is proposed, which accounts for all of the reflectance changes during an observation period, whether type or phenology change, in a unified way by sparse representation.It allows for learning the structure primitives of signals via an overcomplete dictionary and reconstructing signals through sparse coding.SPSTFM learns the differences between two HSR images and their corresponding LSR acquisitions from a different instrument via sparse signal representation.It can predict the high-resolution difference image (HRDI) more accurately than searching similar neighbors for every pixel, because it considers the structural similarity (SSIM), particularly for land-cover type changes.Rather than supposing a linear change of reflectance as in previous methods, sparse representation can obtain the change prediction in an intrinsic nonlinear form because sparse coding is a nonlinear reconstruction process through selecting the optimal combination of signal primitives.
Formally, the Landsat image and the MODIS image are denoted as L i and M i on the t i date, respectively, where the MODIS images are extended to have the same size as Landsat via bilinear interpolation.Let Y i,j and X i,j represent the HRDI and LRDI between t i and t j , respectively, and their corresponding patches are y i,j and x i,j , which are formed by putting patches into column vectors.The relationship diagram for these variables is reported in Figure 4. L 2 can then be predicted as follows: where W 1 and W 3 are the weighting parameters for the predicted image on t 2 using the Landsat reference image on t 1 and t 3 , respectively.
Block diagram of the spatio-temporal fusion proposed in Huang and Song [40].
In order to estimate Ŷ21 and Ŷ32 in Equation ( 9), the dictionary pair D l and D m must be formulated.The two dictionaries D l and D m are trained using the HRDI and LRDI patches between t 1 and t 3 , respectively, according to the following optimization: where Y and X are the column combination of lexicographically stacking image patches, sampled randomly from Y 13 and X 13 , respectively.Similarly, Λ is the column combination of representation coefficients corresponding to every column in Y and X.
A different approach has been proposed in Song and Huang [41], which adopts a two-step procedure to avoid large prediction errors due to the large spatial resolution difference between MODIS and Landsat 7 data.First, it improves the spatial resolution of MODIS data, and then, it fuses the MODIS with an improved spatial resolution and the original Landsat data.
Denote the MODIS image, the Landsat image and the predicted transition image on t i as M i , L i and T i , respectively.The spatial enhancement of MODIS data by means of the Landsat images contains two steps: the dictionary-pair training on known M 1 and L 1 and the transition image prediction.For training a dictionary pair, the high-resolution image features and low-resolution image features are extracted from the difference image space of L 1 − M 1 and the gradient feature space of M 1 in patch form (e.g., 5 × 5), respectively.Stacking these feature patches into columns forms the training sample matrices Y and X, where Y and X stand for high-resolution samples and low-resolution samples, respectively, and their columns are in correspondence.First, the low-resolution dictionary D l is derived by applying the K-SVD [19] training procedure on X via optimizing the following objective function: where Λ is a column combination of representation coefficients corresponding to every column in X.
To establish a correspondence between high-resolution and low-resolution training samples, the high-resolution dictionary is constructed by minimizing the approximation error on Y with the same sparse representation coefficients Λ * in Equation (11), that is, The solution of this problem can be directly derived from the following pseudoinverse expression (given that Λ * has full row rank): To predict the transition image T 2 from M 2 , the same gradient features X 2 are extracted from M 2 as in the training process.Denote the i-th column of X 2 as x 2i ; then, its sparse coefficient α i with respect to dictionary D i can be obtained by employing the sparse coding technique called orthogonal matching pursuit (OMP).Because the corresponding high-resolution sample and low-resolution sample are enforced and represented by the same sparse coefficients with respect to D h and D l , respectively, the corresponding i-th middle-resolution patch column y 2i can be predicted by y 2i = D h × α i .The other middle-resolution patch columns can be predicted by this same process.After transforming all columns y 2i into a patch form, the difference image Y 2 between T 2 and M 2 is predicted.Thus, T 2 is reconstructed by T 2 = Y 2 + M 2 .For the fusion procedure in the next stage, the transition image T 1 is also predicted in the same procedure.Here, the transition images T 1 and T 2 have the same size and extent as that of L 1 and L 2 .
Finally, Landsat 7 and transition images are fused via high pass modulation (HPM): This fusion is in accordance with a linear temporal change model between T 1 and T 2 .
In general, experiments show that spatio-temporal fusion based on sparse representation performs better on phenology change than type change.This can be interpreted in terms of sparsity theory, that is more representation errors usually arise when there are more complex signals to be represented.Further work is also needed to reduce the computational complexity of spatio-temporal fusion approaches based on sparse representation.

Bayesian Approaches
In its most general formulation [27], the problem of Bayesian image fusion can be described as the fusion of a HyperSpectral (HS) image (Y) with low-spatial resolution and high-spectral resolution and an MS image (X) with high-spatial resolution and low-spectral resolution.Ideally, the fused result Z has the spatial resolution of X and the spectral resolution of Y.It is assumed that all images are equally spatially sampled at a grid of N pixels, which is sufficiently fine to reveal the spatial resolution of X.
The HS image has N b spectral bands, and the MS image has N h (N h < N b ) bands, with N h = 1 in the case of a panchromatic band (pan-sharpening case).
By denoting images column-wise lexicographically ordered for matrix notation convenience, as in the case of Z = [Z T 1 , Z T 2 , . . ., Z T N ] T , where Z i denotes the column vector representing the i-th pixel of Z, the imaging model between Z and Y can be written as: where W is a potentially wavelength-dependent spatially-varying system point spread function (PSF), which performs blurring on Z. N is modeled as multivariate Gaussian-distributed additive noise with zero mean and covariance matrix C N , independent of X and Z. Between Z and X, a jointly normal model is generally assumed.
The approach to the pan-sharpening problem within a Bayesian framework relies on the statistical relationships between the various spectral bands and the panchromatic band.In a Bayesian framework, an estimation of Z is obtained as: Generally, the first probability density function p(Y|Z) of the product in Equation ( 16) is obtained from an observation model Equation (15) where the PSF W reflects the spatial blurring of the observation Y and N reflects the additive Gaussian white noise with covariance matrix C N .The second pdf p(Z|X) in Equation ( 16) is obtained from the assumption that Z and X are jointly normally distributed.This leads to a multivariate normal density for p(Z|X).
Different solutions have been proposed, which are based on specific assumptions that make the problem mathematically tractable.
In Fasbender, Radoux and Bogaert [28], a simplified model is assumed first, Y = Z + N, not accounting for the modulation transfer function of the imaging system; then, a linear-regression model that links the multispectral pixels to the panchromatic ones is considered, and finally, a non-informative prior pdf is adopted for the image Z to be estimated.
In Zhang, De Backer and Scheunders [27], the estimation problem is approached in the domain of the à-trous wavelet coefficients.Since the applied à-trous transformation is a linear operation, the same model in Equation (15) holds for each of the obtained detail images, and the same estimation in Equation ( 16) can be adopted for the transformed coefficients at each scale.The advantage of the application of both models in the wavelet domain is that they are applied at each orientation and resolution level, with a separate estimation of the covariances for each level.This allows for a resolution-and orientation-specific adaptation of the models to the image information, which is advantageous for the fusion process.
In Zhang, Duijster and Scheunders [29], a Bayesian restoration approach is proposed.The restoration is based on an expectation maximization (EM) algorithm, which applies a deblurring step and a denoising step iteratively.The Bayesian framework allows for the inclusion of spatial information from the high-spatial resolution image (multispectral or panchromatic) and accounts for the joint statistics with the low-spatial resolution image (possibly a hyperspectral image).
The key concept in the EM-based restoration procedure is that the observation model in Equation ( 15) is inverted by performing the deblurring and denoising in two separate steps.To accomplish this, the observation model is decomposed as: In this way, the noise is decomposed into two independent parts N and N , with Choosing N to be white facilitates the denoising problem Equation (18).However, W colors the noise, so that N becomes colored.Equation (17) and Equation ( 18) are iteratively solved using the EM algorithm.An estimation of Z is obtained from a restoration of the observation Y combined with a fusion with the observation X.
Bayesian approaches to pan-sharpening suffer from modeling errors due to simplifications that are intentionally introduced to reduce the computational complexity as in Fasbender, Radoux and Bogaert [28], where the Modulation Transfer Functions (MTFs) of the imaging sensors are not considered.
Furthermore, iterative processing and numerical instability make Bayesian approaches more complex and less reliable for practical remote sensing image fusion applications on true image data than multiresolution-based or component substitution fusion algorithms.

Variational Approaches
Pan-sharpening is in general an ill-posed problem that needs regularization for optimal results.The approach proposed in Palsson, Sveinsson and Ulfarsson [30] uses total variation (TV) regularization to obtain a solution that is essentially free of noise while preserving the fine detail of the PAN image.The algorithm uses the majorization-minimization (MM) techniques to obtain the solution in an iterative manner.
Formally, the dataset consists of a high-spatial resolution panchromatic image y PAN and the low-spatial resolution multispectral image y MS .The PAN image has dimensions four-times larger than the MS image; thus, the ratio in pixels is one to 16.The MS image contains four bands, RGB and near-infrared (NIR).The PAN image is of dimension N r × N c , and the MS image is of dimension m × n, where m = N r /4 and n = N c /4.
There are two assumptions that define the model.The first is that the low-spatial resolution MS image can be described as a degradation (decimation) of the pan-sharpened image x.In matrix notation, y MS = M 1 x + , where: is a decimation matrix of size 4mn × 4N r N c , I 4 is an identity matrix of size 4 × 4, ⊗ is the Kronecker product and is zero-mean Gaussian noise.
The second assumption is that the PAN image is a linear combination of the bands of the pan-sharpened image with some additive Gaussian noise.This can be written in the matrix notation as y PAN = M 2 x + , where is zero-mean Gaussian noise and: where ω 1 , . . ., ω 4 are constants that sum to one.These constants determine the weight of each band in the PAN image.Now, M 1 and M 2 have the same number of columns, and thus, the expressions for y MS and y PAN can be combined into a single equation, producing the classical observational model: where One can define the TV of the MS image as: where x is the vectorized four-band MS image, D H = (I 4 ⊗ D H ), D V = (I 4 ⊗ D V ) and the matrices D H and D V are defined such that when multiplied by a vectorized image, they give the first-order differences in the horizontal direction and vertical direction, respectively.The cost function of the TV regularized problem can be formulated as: Minimizing this cost function is difficult because the TV functional is not differentiable.However, MM techniques can be used to replace this difficult problem with a sequence of easier ones: where x k is the current iterate and Q(x, x k ) is a function that maximizes the cost function J(x).This means that Q(x, x k ) ≥ J(x) for x = x k and Q(x, x k ) = J(x) for x = x k .By iteratively solving Equation ( 24), x k will converge to the global minimum of J(x).
A majorizer for the TV term can be written using the matrix notation as: where: and the matrix D is defined as By defining: the function to minimize becomes: It should be noted that all of the matrix multiplications involving the operators D, D T , M and M T can be implemented as simple operations on multispectral images.However, the multiplication with M T corresponds to the nearest neighbor interpolation of an MS image, which is required by the problem formulation, but it provides inferior results with respect to bilinear interpolation, both according to quality metrics and visual inspection.
In general, variational methods are very sensitive to the unavoidable inaccuracies of the adopted observational model.The experimental results on true spaceborne multispectral and panchromatic images show the limitations of this class of pan-sharpening methods.As an example, the algorithm [30] described in this section provides fused images from QuickBird data characterized by spectral and spatial distortions [47], which are slightly lower than those obtained by a very simple (and low-performance) multiresolution-based pan-sharpening method, that is a trivial coefficient substitution method in the undecimated wavelet transform (UDWT) domain: D λ = 0.042 and D S = 0.027 for [30] instead of D λ = 0.048 and D S = 0.055 for the UDWT method.

Performance Comparisons
Four SR-based pan-sharpening methods have been selected for performance assessment: SparseFI [38], J-SparseFI [46], SR-D [45] and the method proposed in [37].The AWLP algorithm [33] has been chosen as a benchmark to compare the new SR-based pan-sharpening methods to a simple, effective and widely-adopted classical pan-sharpening method.
The quality of the fused products is measured by applying the synthesis property of Wald's protocol [48].The synthesis property may not generally be directly verified, since the ideal MS image at the highest spatial resolution is not available.Therefore, synthesis is usually checked at degraded spatial scales.Spatial degradation is achieved by means of proper low-pass filtering followed by decimation by a factor equal to the scale ratio of Pan to MS datasets.Pan is degraded to the resolution of the multispectral image and the original MS image to a lower resolution depending on the scale ratio for which the fusion is assessed (four for IKONOS, QuickBird and WorldView-2 data, as an example).The fusion method is applied to these two sets of images, resulting into a set of fused images at the resolution of the original MS image.The MS image serves now as reference, and the synthesis property can be tested.
Among possible distortion indexes, SAM and ERGAS have been selected for algorithm comparisons.SAM computes the absolute value of the spectral angle between the two vectors representing the fused MS image (starting from spatially degraded data) and the original MS image.SAM is usually expressed in degrees and is equal to zero when the two MS images are spectrally identical.The ERGAS is another global error index based on the average mean squared error computed on each band.
Objective comparisons are reported from experiments published in [37,38,45,46] and performed on simulated images produced from sensed airborne HySpex images and on true IKONOS and WorldView-2 images.
Visual results are reported in Figures 5 and 6 for the true WorldView-II and simulated HySpex datasets, respectively.Figure 6 shows the WorldView-2-like images simulated from the airborne visible to near-infrared (VNIR) HySpex data acquired over Munich, Germany, in 2012 by the German Space Agency (DLR).Starting from the input 0.75-m HySpex hyperspectral data cube, whose true-color composition is illustrated in Figure 6a, a 3-m synthetic multispectral image in Figure 6a and a 0.75-m Pan image that match the specifications of the WorldView-2 imager in terms of the spectral properties have been simulated [46].Therefore, the original 0.75-m data can be used as the reference image for the pan-sharpened products.It is evident that the visual performances of the simple AWLP method are comparable to those provided by the SR-based pan-sharpening methods, both in terms of spectral preservation and spatial detail injection.Locally, a slightly better spectral fidelity in the J-SparseFI result may be noticed in Figure 6f with respect to AWLP in Figure 6d, for example in the red structure at the bottom-right corner of the image.For each selected method, Table 1 reports the measured improvements (in percentage) over AWLP, if any, on SAM, ERGAS and the quality index for images with 2 n bands (Q2n, i.e., Q8 for WorldView-II, Q4 for IKONOS) values, i.e., ∆ ERGAS , ∆ SAM , ∆ Q2n , respectively, after averaging values on the considered datasets.Q2n is a unique quality index taking into account both spatial and spectral quality, ranging from zero, very low quality, to one, which indicates a perfect matching to the reference image with 2 n bands [49].A positive value for ∆ ERGAS and ∆ SAM and a negative value for ∆ Q2n , which indicate a performance loss with respect to AWLP, are shown in red color.This experimental protocol has been adopted to objectively assess the performances of different algorithms in a comparative way, also when a common benchmarking dataset is not available.This is the current situation within the remote sensing image fusion community.The AWLP algorithm is assumed as a standard widespread pan-sharpening method; hence, the resulting comparative analysis is straightforward and clear.
Table 1.Performance comparison on IKONOS and WorldView-II data of recent SR-based pan-sharpening methods with respect to AWLP [33].The computation time for the J-SparseFI refers to an implementation on a 128-core computer [46].The increase of computational time with respect to the fast AWLP method is also shown in Table 1.SparseFI and J-SparseFI are the only methods that provide confident quality improvement over AWLP.However, this improvement can be quantified in a few percent, at the expense of a huge computational complexity.It is worth noting that the J-SparseFI computing time refers to an implementation on a 128-core computer [46].The application of the pioneering method by Li et al. [37] to operational pan-sharpening is impractical, as well.The SR-D [45], even if not highly performant, is promising for its reduced computational complexity due to the application of sparse coding to the high spatial resolution details only.

AWLP [33] SparseFI
Finally, Table 2 reports a synoptic view of different SR-based methods for different application fields in remote sensing image fusion.By considering both the declared computational complexity and the objective assessment of the algorithms, it is evident that, for the most common application fields in the broad domain of remote sensing image fusion, the algorithms based on the super-resolution paradigm are not yet mature for solving fusion processing tasks in operational remote sensing system.

Conclusions
Super-resolution, compressed sensing, Bayesian estimation and variational theory are methodologies that have been recently applied to spectral-spatial and spatio-temporal image resolution enhancement for remote sensing applications.Specific assumptions on the image formation process and model simplifications to make the problem mathematically tractable are normally required to solve the ill-posed problems that are usually encountered through constrained optimization algorithms.
When prior knowledge about the observation model is not sufficiently verified on true image data, due to uncertainty on the band-dependent point spread function of the imaging system or when

t 1 t 2 t 3 Figure 3 .
Figure 3. Predicting the Landsat image at date t 2 from Landsat images at dates t 1 and t 3 and MODIS images at all dates.

Table 2 .
A synoptic view of recent remote sensing image fusion algorithms based on the superresolution paradigm (FE: Filter Estimation; BDF: Bayesian Data Fusion; ASE: Adaptive Structuring Element; SPSTFM: sparse-representation-based spatio-temporal reflectance fusion model.SASFM: Spatial And Spectral Fusion Model).