Pansharpening by Convolutional Neural Networks

: A new pansharpening method is proposed, based on convolutional neural networks. We adapt a simple and effective three-layer architecture recently proposed for super-resolution to the pansharpening problem. Moreover, to improve performance without increasing complexity, we augment the input by including several maps of nonlinear radiometric indices typical of remote sensing. Experiments on three representative datasets show the proposed method to provide very promising results, largely competitive with the current state of the art in terms of both full-reference and no-reference metrics, and also at a visual inspection.


Introduction
Multi-resolution images are widespread in remote sensing as they provide users with images at the highest resolution both in the spatial and in the spectral domains.Since these goals cannot be both achieved by a single sensor, modern systems acquire two images, a panchromatic (PAN) component with high spatial and low spectral resolution, and a multispectral (MS) component with complementary properties.In many cases, one of these components is enough to satisfy the user needs.Sometimes, instead, they are processed jointly [1] to take full advantage of all available information.More often, the two pieces of information are fused through a pan-sharpening algorithm, generating a datacube at high resolution in both the spatial and spectral domains.Pansharpening is very important for remote sensing scene interpretation, and is also used as a pre-processing step for several image processing tasks, like feature extraction, segmentation and classification.Consequently, it has been the object of intense research, and many methods have been proposed in the last decades, as reported for example in [2].
A classical and simple approach is component substitution (CS) [3].It consists in transforming the MS image in a suitable domain where one of the components is replaced by the high-resolution PAN image.After up-sampling the other components, the whole set is back-transformed in the original domain.Clearly, the more correlated the PAN is with the replaced component, the less distortion is introduced.A simple and fast procedure is based on the intensity-hue-saturation (IHS) transform [4], which can be used only when three bands are available.However, a generalized IHS transform (GIHS) can be defined [5] which includes the response of the near-infrared (NIR) band.Other methods use the principal component analysis (PCA) [6], the Brovey transform (BT) [7], and Gram-Schmidt (GS) spectra sharpening [8].Although these techniques preserve spatial information accurately, and are quite robust to co-registration errors, they are often characterized by high spectral distortion, because PAN and MS components are acquired in spectral ranges that overlap only partially.In [9] two enhanced versions of GIHS and GS are proposed to deal with this problem, while in [10] an adaptive approach using partial replacement (PRACS) is presented.
An alternative approach to component substitution, goes through the injection of high-frequency details extracted from the PAN image into the up-sampled version of the MS.In general, methods based on detail injection guarantee a better spectral fidelity than those based on component substitution.They differ in how spatial-details are extracted from the PAN and how they are injected in the MS.In the basic approach, details are obtained as the difference between the PAN image and a smoothed version of itself.The smooth version can be computed by means of a simple low-pass filter on a single level decomposition [11] or through a multiresolution analysis (MRA) on several decomposition bands [12].These last methods rely on redundant representations, characterized by the shift-invariance property, like the á Trous Wavelet transform (ATWT) or the Laplacian pyramid (LP) [13].In order to better capture directional information, methods based on non-separable transforms, like curvelets, have also been developed [14].Unfortunately, MRA approaches may exhibit spatial distortions (e.g., ringing artifacts) which worsen the visual quality of the fused data.
However extracted, detail information must be eventually injected into the up-sampled MS bands to improve spatial resolution.To avoid introducing distortion and artifacts, a model of the involved data is necessary [15].This can be defined at a coarser scale, where both MS and PAN images are available, and then extended to a finer scale.By so doing, however, it is implicitly assumed that the same model holds at different scales, which is actually not true, especially considering very high resolution data, like highly detailed urban environments [16].The problem can be reduced by tuning the filters so as to closely match the modulation transfer functions (MTFs) of the sensors.For example, in [17] a smoothing-filter-based intensity modulation (SFIM) has been proposed.It modulates spatial details by the ratio between a high resolution image and its low-pass version, since this cancels the spectral and topographical contrast of the higher resolution image and retains the higher resolution edges only.However, performance depends critically on the accuracy of PAN-MS co-registration.To obtain better coregistered upscaled images, [18] uses the Induction scaling technique, followed by a new fusion technique called Indusion.Other methods following the same philosophy work in a MRA setting, e.g., ATWT [15,[19][20][21] or with LP representations [16,[22][23][24].Some recent papers, to avoid improper modeling, recast the problem in an optimization framework.In [25], a different MMSE-optimal detail image is extracted for each MS band, by evaluating band-dependent generalized intensities.This method was extended in [26], where parameter estimation is performed through nonlocal parameter optimization based on K-means clustering.In [27] pansharpening is cast as a Bayesian estimation problem, with a suitable joint prior for observed data and unknown pansharpened image.Total variation is used in [28], with the pansharpened image obtained by constrained energy minimization.
In the last few years, apart from component substitution and detail injection, much attention has been devoted to sparse representations.In [29] a method based on compressive sensing with sparsity-inducing prior information has been proposed, where sparsity is enforced by building a dictionary of image patches randomly sampled from high-resolution MS images.Further developments have been proposed in [30][31][32] with the goal to avoid the cost of dictionary construction.As of today, instead, there has been limited interest on deep learning, with some exception like in [33], where a modified sparse denoising autoencoder algorithm is proposed.
In this paper, motivated by the impressive performance of deep learning in a large number of applicative fields, not least remote sensing [34,35], a new pansharpening method is proposed, based on convolutional neural networks (CNN).We build upon the architecture recently proposed in [36] for the closely related super-resolution problem.First, we simply adapt the architecture to the pansharpening case.Then, we improve it, by leveraging on the huge domain-specific knowledge available in the remote sensing field.In particular, we augment the input by including a number of radiometric indices, tailored to features that proved very relevant for applications.Such nonlinear indices could be hardly extracted by a CNN, and only with a much deeper and data-hungry architecture.Performance assessment fully supports our proposal.We carry out experiments on three datasets, comprising images acquired by the Ikonos, GeoEye-1 and WorldView-2 multiresolution sensors and compare results with a score of well-established reference techniques, obtaining significant improvements on all datasets and under all performance metrics, both full-reference and no-reference.
In the rest of the paper we provide some necessary background on deep learning and CNN-based super-resolution (Section 2), describe the proposed method (Section 3), and present and discuss experimental results (Section 4).Eventually we draw conclusions and outline future research.In addition, a list of the abbreviations used in this paper is provided at the end.

Deep Learning and Convolutional Neural Networks
As the name suggests, artificial neural networks (ANN) take inspiration from their biological counterparts, trying to emulate some of their remarkable abilities.In fact, the human visual system can easily solve complex pattern recognition problems that elude the most powerful computers.This happens thanks to a tightly interconnected network of simple processing units, the neurons.Based on this observation, the artificial neuron is designed to perform a very simple task: given a set of inputs, (x 1 , . . ., x K ) it outputs a nonlinear function of their weighted average By so doing, it matches simple input features, such as edges, lines or blobs, in the case of images.The outputs of a large layer of neurons operating in parallel become the input of a further layer of neurons, which combine basic features to extract features of higher semantic value, and this proceeds through several layers, allowing for a high level of abstraction.This deep layered architecture is therefore responsible for the impressive abstraction ability of such networks.By modifying neuron weights in response to suitable input stimuli, the network learns how to perform all sorts of desired tasks.
In a traditional fully-connected network, each neuron takes as input the outputs of all neurons of previous layer, and feeds its own output to all neurons of next layer.As a consequence, a deep ANN includes a very large number of weights, which must be learned on a proportionally large training set, calling for an exceeding computational complexity.Convolutional neural networks (CNN) [37] overcome to a large extent this problem by renouncing full connectivity.In CNN, in fact, each neuron has a limited receptive field, processing features observed only in a local neighborhood of the neuron itself.This makes full sense for many sources, and notably for images, where spatial features are intrinsically local (especially in lower layers) and spatially invariant.Due to this latter property, in particular, one can use the very same set of weights for all neurons of the same layer, by which the output at neuron (i, j) can be expressed as the convolution with the previous layer input (a third summation is required if x i,j is itself a vector) or, in compact matrix notation where * denotes convolution.CNNs reduce drastically the number of connections among neurons, hence the number of free parameters to learn, enabling the use of deep learning in practical applications.As matter of fact, CNNs have become very popular in recent years, thanks to efficient implementations [38], with software available online, relatively fast training achievable with cheap and powerful GPUs, and also thanks to the huge mass of labeled visual data available on the web [39], essential for training complex networks.
When dealing with images, input variables are two-dimensional spatially related entities.Spatial dependencies are propagated throughout the network, which justifies why the features output by intermediate (hidden) layers are represented as images as well.The output y of a layer is also referred to as a feature map and, typically, multiple feature maps are extracted at once in a layer, using different convolutional kernels w.Eventually, depending on the specific task required of the net, the output of the network may be itself an image (think of denoising, segmentation, super-resolution), or else a set of spatially unrelated decision variables (detection, classification, recognition).

CNN-Based Super-Resolution
To the best of our knowledge, pan-sharpening has never been addressed before by deep learning methods.However, pan-sharpening itself can be regarded as a special form of super-resolution, a very active research area in computer vision, and methods developed in one field can be often applied to another.In fact, we will take inspiration from a recently published paper [36] on the super-resolution of natural images via CNN, which is briefly recalled in the following.
In [36] the authors design a CNN which mimics the behavior of a sparse-coding super-resolution method, and in fact demonstrate that the former generalizes the latter, performing the same processing steps and, possibly, more general tasks.This strong analogy, explicitly pursued at design time, is a very remarkable feature, as it allows one to associate a precise meaning with each layer of the CNN architecture, something not easily obtained in other cases.Recall that sparse-coding super-resolution is a machine learning method that involves three main steps: (a) projection of each image patch on a low-resolution dictionary; (b) mapping between the patches of low-resolution and corresponding high-resolution dictionaries; (c) reconstruction through the combination of patches of the high-resolution dictionary.Accordingly, the CNN proposed in [36] is composed of three layers that correspond roughly to the above mentioned steps.Before entering the network, the image is up-sampled to the target resolution via bicubic interpolation.Hence, the first layer computes 64 feature maps using a 9 × 9 receptive field and a ReLU (rectified linear unit, max(0, x) [40]) nonlinearity.The second step computes 32 feature maps using a 1 × 1 receptive field and, again, ReLU.Finally, the third layer, with a 5 × 5 receptive field, and a simple identity activation function, provides the desired high-resolution image.In summary, the three layers compute the following items To gain insight into the network's rationale, let us track a single 9 × 9 input patch x p centered at point p, obtained by the original image through upsampling.This is projected (y p = w 1 * x p ) on 64 different 9 × 9 × 3 patches, a sort of low-resolution dictionary.Next, as w 2 is defined on a 1 × 1 receptive field, a non-linear mapping of the 64-vector y p to a 32-vector z p = f 2 (y p ) follows, reminding of a translation to the high-resolution basis.Finally, through the 5 × 5 convolution in the third layer, z p will contribute to the final reconstruction at p and all neighbors in a 5 × 5 square, reminding of the weighted average of high-resolution patches.
Starting from this reference architecture, the authors of [36] test several variations, adding further hidden layers, for example, or varying the receptive fields.Although some such schemes provide a slightly better performance, this comes at the cost of increased complexity, which speaks in favor of the basic architecture.The best trade-off is achieved using a 3 × 3 receptive field in the second layer.
The various architectures were always designed having in mind the sparse coding paradigm.While this is a precious guideline, it is certainly possible that better results may be obtained, for the same complexity, by removing this constraint.However, understanding the internal behavior of a given network is still an unsolved, and hot, topic in computer vision.

Proposed CNN-Based Pansharpening
Given the excellent performance of the super-resolution method proposed in [36] we decided to apply the same approach to the pansharpening problem.However, we also want to take advantage of the large domain-specific knowledge existing on remote sensing image processing, introducing some reasonable changes aimed at better exploiting the data.Accordingly, in the following, we first provide some details about the targeted datasets, then we describe the basic architecture, directly inherited from the super-resolution method, and finally, based on a first analysis of results, we propose our remote-sensing specific solution.

Datasets
The proposed method, although rather general, it has been conceived for very high resolution data, and in particular it has been tested on data acquired by some of the most advanced systems for remote sensing of optical data.These are Ikonos, GeoEye-1, and WorldView-2, whose main characteristics are summarized in Tables 1 and 2. This sensors selection allowed us to study the robustness of the proposed method with respect to both spectral resolution (Ikonos/GeoEye-1 vs. WorldView-2) and spatial resolution (Ikonos vs. GeoEye-1/WorldView-2).

Basic Architecture
Figure 1 provides a pictorial representation of the basic architecture, which follows closely the super-resolution CNN (SRCNN) architecture.The example is tailored on Ikonos and GeoEye-1 data, with 4-band multispectral components and a panchromatic band with 4 × 4 higher spatial resolution, but applies to other multi-resolution data with obvious modifications.The four low-resolution spectral bands are first upsampled (4 × 4) and interpolated.The result is then stacked with the high-resolution panchromatic band to form the 5-component input.Hence, the network will work at the target resolution from the beginning, with no need of up/down-sampling.The output comprises only 4 bands, which correspond to the input multispectral bands, but at the target panchromatic resolution.We keep using the three-layer structure of [36], but replace the 1 × 1 kernels of the central layer with 5 × 5 kernels as, again, these provide some performance improvements in preliminary experiments.Table 3 summarizes the parameters of the overall architecture.It is worth underlining that this is quite a simple CNN architecture, relatively shallow, and hence easy to train with a limited training set.This is extremely important for the remote sensing field, where training data are often scarce as opposed to the millions images typically available for computer vision applications.We will therefore stick to this three-layer architecture also in the following, adapting it to remote sensing pansharpening based on prior knowledge on the problem, and working only on the input data.
Turning to implementation details, the output of the CNN is a B-band multispectral image with the same spatial resolution as the PAN component.This image should be as close as possible to the ideal image acquired by a multispectral sensor operating at the same spatial resolution of the PAN.Of course, this latter image does not exist.This raises problems not only for performance assessment but, in deep learning, also for network training.We will address both problems using the Wald protocol [41], which consists in working on a downsampled multi-resolution image, for which the original MS component represents a valid reference.
Training based on the Wald training protocol is summarized in Figure 2. In the upper channel, the original image is downsampled, then its MS component is interpolated, and the resulting (B + 1) image is tiled and fed to the CNN.In the lower channel, the original MS component is itself tiled and used as a reference.During training, the CNN parameters are adapted to produce output tiles that match as closely as possible the reference tiles.The learned parameters are eventually used for the pansharpening of the real multiresolution images, at their original resolution.Clearly, the whole procedure rests upon the hypothesis that performance does not depend critically on scale, which is not always the case, as discussed in the Introduction.Therefore, to reduce the possible mismatches, we smooth data before downsampling, following [2,22], using a filter that matches the modulation transfer function of the sensor.Likewise, we upsample the MS component using the interpolation kernel proposed in [13].In any case, great attention must be devoted to assess performance not only on the low-resolution data but also on the original images.
Learning has been carried out as proposed in [36] and we used exactly the same setting, briefly summarized here.It is based on backpropagation with stochastic gradient descent.The updating iteration refers to a batch of 128 input tiles selected at random from the training set.As loss function L, we use the mean square error between the pansharpened tile X and its reference X, averaged on the batch (To account for the border effects of the filtering, suitably cropped versions of X and X are involved in the computation).
where W is the set of all parameters, namely, filter weights and biases.Stochastic gradient descent uses a momentum parameter to reduce randomness, therefore, iteration (i + 1) reads as where µ is the momentum, and α the learning rate.Following the indications given in [36] we set µ = 0.9 and α = 10 −4 , except for the last layer, where we set α = 10 −5 , since a smaller learning rate is known to speed up convergence.In all cases the total number of iterations has been fixed to 1.12 × 10 6 .Additional details about learning can be found in [36].Figure 3 shows a sample result of the proposed pansharpening method.On the left, there is the input multiresolution image, acquired by the GeoEye-1 sensor, with PAN and MS components at their actual scales (all MS images, here and in the following, are projected on a suitable RGB space).On the right, the result of the pansharpening process, and in the center, for reference the interpolated multispectral component.The result looks quite satisfactory: the spatial resolution of the PAN component is fully preserved, and colors, when the comparison makes sense, are very similar to those of the interpolated MS component, suggesting the absence of spectral distortions.In Section 4, this positive judgement will be supported by strong numerical evidence.

Remote-Sensing Specific Architecture
Although the basic architecture provides already very good results, we set to analyze in more depth its behavior over remote sensing imagery with the goal to propose some sensible improvements based on the available prior knowledge.In particular, we focus on the first-layer filters, whose behavior may be interpreted quite easily, while this is much more difficult for layers further away from the input.
With reference to WorldView-2 images, hence with nine input bands, Figure 4a plots the energy of the feature maps at the output of the first layer, sorted in decreasing order.A sharp energy drop is visible after feature 48, suggesting that some of the filters are not really capturing valuable information, and could be removed altogether.This interpretation is reinforced by the analysis of the filter responses, some of which are shown in Figure 5.In each column, we show the 9 × 9 spatial kernels associated with the 8 spectral components of the input plus the panchromatic component, recalled with an abbreviation on the left.For the sake of clarity, we only show some high-energy filters, on the left, and then some filters around the transition highlighted before.Filters beyond #48 exhibit noise-like kernels, confirming that they are not matching relevant features, and are not effectively trained by the network [42].On the contrary, the other filters exhibit well-defined patterns.Spatial structures emerge mainly in the last component, associated with the panchromatic band, while kernels associated with spectral bands are spatially smooth.This is clearly due to the different resolutions of the original bands.Instead, the components associated with spectral bands show strong energy variations, suggesting spectral selectivity.In practice, they are extracting spectral signatures associated with the image land covers.For example, filter #7 matches quite well the vegetation.In fact, a well known indicator for vegetation is the normalized difference vegetation index (NDVI), see Equation (7), which basically computes a normalized difference between responses in near-infrared and red bands.Accordingly, filter #7 has a strongly positive kernel in the 7th patch, associated with the near-infrared component, and negative in the 5th patch, associated with the red component.Similar considerations apply to filter #48 which correlates to the water signature.In fact it differentiates between the first three spectral bands (coastal, red, and green) and those from 6th to 8th (near infrared).Notably, this matches pretty well the definition of the normalized difference water index (NDWI), see Equation (6).(This filter has very small energy because of the scarcity of water samples in the dataset used for training.This is confirmed by the noisy shape of the filter that indicates the need of further training to increase its regularity.However, the network is quite robust anyway, as other filters match the water class, although less obviously than filter #48.)Beyond the two above mentioned feature maps, many more, both in the first and the second layer, exhibit a remarkable correlation with some class-specific radiometric indexes.This interesting behavior, with the network trying to mimic well-established features, has motivated us to add some such indexes in order to "guide" (and hopefully boost) the learning process.Indeed, this intuition was supported by experimental results.A possible explanation lies in the relative invariance of the normalized indexes with respect to the illumination.Invariance, in fact, improves the network capability to "generalize" and hence to perform well on data which are scarcely (or not at all) represented in the training set.Therefore, based on numerical results and on the above considerations, we augmented the input by adding further planes, corresponding to some well-known indices.In particular, we considered the following nonlinear radiometric indices computed from the multispectral components (see also Table 2):

•
Normalized Difference Water Index: • Normalized Difference Vegetation Index: • Normalized Difference Soil Index (applies to WorldView-2 only): • Non-Homogeneous Feature Difference (applies to WorldView-2 only): Radiometric indices can be defined in many different ways [43], and indeed we use different definitions of NDVI and NDWI for the four-band (Ikonos/GeoEye-1) and 8-band (WorldView-2) images, since number and meaning of the bands change.Moreover, two of the indexes are defined only for the richer WorldView-2 images.More indices proposed in the remote sensing community could be tested, but selecting the most informative of them goes outside the scope of this work.
Figure 4b shows the distribution of filter energy after augmenting the input with the nonlinear radiometric indices.The number of inactive filters drops from 16 to 9, and several filters correlate well with the new indices.Moreover, filter energy grows in general, also for filters that were already active.These facts seem to support our conjecture that the remote sensing specific features do serve as a guidance for the network, allowing it to better address the pansharpening task.A stronger support will come from experimental evidence.
Besides augmenting the input, we tested further minor variations to the basic architecture, modifying the number of filters and their spatial support.Although a wide range of alternative architectures have been tested, we will focus on a meaningful subset of them obtained by changing the hyperparameters B rad (hence c 1 ), K 1 , and c 2 as summarized in Table 4. Changes in the second and the third layers have also been tested but eventually discarded, as ineffective.

Results and Discussion
Several experiments have been carried out to find the optimal setting for the proposed technique and assess its performance in comparison with several state-of-the-art references.To this end, three datasets have been designed, with images acquired by the Ikonos, GeoEye-1 and WorldView-2 multiresolution sensors, all characterized by a very high spatial resolution, below one meter for the panchromatic band (sensor characteristics are summarized in Tables 1 and 2).In all cases, the MS component has spatial resolution 4 × 4 lower than the PAN.Datasets are divided (see Table 5) in disjoint training, validation and test subsets.The first two subsets are used for CNN training, while the last one is used for performance assessment of all techniques.For performance assessment, we use a number of metrics proposed in the literature, since no single one can be considered by itself as a reliable indicator of quality.In particular, besides several full-reference metrics, which measure performance on the reduced resolution dataset according to the Wald protocol, we consider also the no-reference QNR [44] and two derived metrics, that work on the full-resolution dataset.In fact, while these latter indicators may be considered less objective than the former, they are insensitive to possible scale-related phenomena.In Table 6, all these metrics are listed together with the corresponding reference for the interested reader.

Comparing Different Networks
In this subsection we report the results of a series of experiments aimed at finding optimal settings for the proposed method, both with and without the external features derived from nonlinear radiometric indices.
Table 7 reports the performance achieved on WorldView-2 images as a function of the parameters c 1 , K 1 , and c 2 .In this case, c 1 , the number of input bands, may be only 9 (without external features) or 13 (with them).For the first-layer filters, with support K 1 × K 1 , we tested small, medium and relatively large receptive fields, with K 1 = 5, 9 and 15, respectively.Finally, we have considered c 2 = 48, 56 or 64 features at the output of the first layer.In both cases, with and without external features, using large filters, K 1 = 15, causes some performance loss.This can be easily explained by looking at Figure 7a, showing the evolution of the loss function (computed on the validation set) during training.Apparently 10 6 iterations are not enough to reach convergence with K 1 = 15.On the other hand, even at convergence (after four millions iterations), results are just on par with the other cases.This is likely due to overfitting problems, typically occurring when the training dataset is not large enough, as well explained in [42].This same limitation observed in width applies in depth as well, as already pointed out in [36] for the super-resolution problem.It would be certainly interesting to test more complex networks, e.g., with larger supports and/or deeper architectures, as we may observe more complex filter's patterns than those shown in Figure 5, but lacking a proportionally larger dataset (not available in this work) for training it would not make sense.Therefore, in particular, we decided not to investigate the K 1 = 15 option further.Concerning the comparison between K 1 = 5 and K 1 = 9, results indicate clearly that the former choice improves spectral accuracy (see D λ ), and the latter spatial accuracy (see D S ), as could be expected.As for the number of output features, they can be reduced from 64 to 48 with no significant accuracy loss, especially in case no external features are used.Finally, let us focus on the comparison between c 1 = 9, and c 1 = 13, that is, on the impact of external radiometric indices.To this end, we have highlighted in color the best result for each indicator both for c 1 = 9 (red, in the upper part of the table) and for c 1 = 13 (blue, in the lower part).The injection of external features improves consistently the best performance.This behavior is also confirmed by the analysis of Figure 7, showing loss function evolution for architectures with different filter support (left) and number of output features (right) with and without radiometric indices.For each basic configuration (solid line), the corresponding architecture with augmented input (dashed line) provides a significant reduction of the loss.In Tables 8 and 9 we report results obtained on the other two datasets.Apart from numerical differences, it seems safe to say that the same phenomena observed for the WorldView-2 dataset emerge also in these cases.

Comparison with the State of the Art
Let us now compare the proposed method (using the acronym PNN in the following, for CNN-based Pansharpening) with a number of state-of-the-art reference techniques.We tested all the techniques analyzed in the recent review paper [2], whose implementation has been made available online [50].However, in the following tables, we report results only for the most competitive ones, listed below, according to the quality indices adopted.We included the recently proposed C-BDSD technique [26], since this is an improvement of BDSD, already among the best references.This has been implemented and configured according to the author's indications.As for the proposed method, we selected for each dataset the architecture which provided the best QNR figure, always with the nonlinear radiometric indices included.With this choice we decided to give more emphasis to full-resolution results, more indicative of real-world operation, although obtained with a no-reference metric.Software and implementation details will be made available online [51] to ensure full reproducibility.Again, we provide results separately for the three datasets in Tables 10-12.Performance figures are obtained by averaging over all test images of the dataset.Numerical results speak very clearly in favor of the proposed solution.Let us focus, for the time being, on the WorldView-2 dataset, as in Table 10.Under all full-reference metrics, CNN-based pansharpening guarantees a large performance gain with respect to all references, with MTF-GLP-HPM taking the role of the closest challenger.The situation changes somewhat with no-reference full-resolution measures, where the gap is more narrow and in a single case, for metric D S , BDSD exhibits a better performance.Notice also that, with these metrics, MTF-GLP-HPM becomes the worst reference, while BDSD and C-BDSD are very competitive.This sheds further light on the importance of using multiple metrics and of using careful visual inspection to complement numerical indications.Notice also that, with the exception of BDSD and C-BDSD for no-reference metrics, the performance gap between the proposed solution and all references is much wider than the gap among the various tested CNN architectures, testifying on the robustness of this approach versus the selected hyperparameters.
The analysis of results obtained on the other datasets, reported in Tables 11 and 12, highlights some differences in the behavior of reference techniques, but the proposed solution keeps being largely preferable, especially for GeoEye-1 data, where the gap with all competitors, including BDSD and C-BDSD, becomes very large, even for no-reference metrics.
To add a further piece of information, Table 13 reports, for each dataset, and for some of the performance metrics considered before, the ranking of the compared techniques averaged over all test images.Considering full-reference metrics, the proposed method proves almost always best over the images of all datasets.Only with the no-reference metric QNR, and for two datasets out of three, BDSD or C-BDSC happen to be preferable for some images, although the proposed solution keeps being the first choice on average.It also worth noting that the proposed solution is robust with respect to data resolution.Considering the original datasets, and their 4 × 4 subsampling, our experiments spanned a resolution range of three octaves, with the proposed method outperforming consistently all references.A further performance gain could be achieved by relaxing the constraint on the number of iterations (at the cost of increased design time), as obvious by Figure 7 where the loss functions is not yet at convergence.Finally, as for all machine learning methods, the availability of more training data would certainly benefit performance and robustness.

Visual Inspection
We conclude our analysis with a careful visual inspection of results.Numerical indicators, in fact, represent only proxies for quality, especially useful during development phases.Only visual inspection allows one to discover a number of artifacts and distortions that elude quantitative analyses.Therefore, for each dataset, we selected a sample image including both urban and natural areas, to allow for a meaningful comparison of pansharpening quality among competing techniques.Together with the reference image, we show (Images should be analyzed on a computer screen with suitable zoom) the output of the proposed method and of a subset of references, chosen for their variety and good objective performance.
First, in Figures 8-10, we show the low-resolution versions, used to compute full-reference metrics by the Wald protocol.For these images, a valid reference exists, namely, the original multispectral component, shown on the left.The pansharpened images produced by the proposed method look very similar to the original MS components, with comparable spatial resolution and without noticeable artifacts or spectral distortions.C-BDSD returns images which look even sharper than the original, with many more high-frequency details, for example in the vegetation areas.Therefore it seems to provide an enhanced version of the target image.It may be debated, though, whether this is desirable for a pansharpening method.For sure, all images, and markedly the GeoEye-1, present a significant spectral distortion, especially visible in the rooftops.Indusion is characterized by a subtle diffused blurring and some strong artifacts, see the highway in the GeoEye-1 image.The behavior of PRACS is not uniform over the images, it works quite well on the WolrdView-2 and GeoEye-1 images, but introduces a very strong blurring in the Ikonos image.ATWT-M3, on the contrary, provides images of consistently very low quality, with intense blurring, arguably worse than the low-resolution original.This, despite the relatively good performance in terms of no-reference metrics.On the other hand, as will be soon obvious, the performance of ATWT-M3 on the actual data is not nearly as bad, raising further alarm on metric reliability, especially in the presence of scale issues.
In Figures 11-13, we show the actual full-resolution images obtained through pansharpening.This time, no ground truth exists.However, we show on the left as reference the up-sampled MS component with bicubic interpolation.By comparison, one can appreciate the new details introduced by pansharpening and, at the same time, spot possible spectral distortion in homogeneous regions.As a general remark, most pansharpened images exhibit a very good quality, with an impressive improvement with respect to interpolated MS, especially visible in the enlarged box.In particular, the proposed method appears as one of the more effective, with a consistent performance over all sensors.At a closer inspection, however, a number of problems emerge again.For the proposed method, a subtle pattern is visible is some homogeneous areas, probably related to the high-degree interpolation polynomial.C-BDSD, as before, is mainly affected by spectral distortion, with a clear over-saturation, specially for the green and red hues.Indusion exhibits significant spatial distortion (mostly, ringing artifacts) in the urban areas, while PRACS (in the Ikonos image) and ATWT-M3 (always) introduce spatial blurring, although less than in the low-resolution case.All these observations are also supported by the detail images (difference between pansharpened image and interpolated reference) shown in the bottom row of each figure.The colored areas in the C-BDSD detail images testify of the mentioned spectral distortion, while the low energy of the the PRACS and especially ATWT-M3 detail images underline the inability of these techniques to inject all necessary high-resolution details in the pansharpened image.

Implementation Details
To conclude this section we provide some implementation details for the proposed method.All implemented PNNs have approximatively the same complexity, and actually they take about the same amount of computational burden in both learning and testing phases.In particular, the learning phase takes about half a day and is carried on GPU (Nvidia Tesla K20c with CUDA 7.5 and CuDNN v3) through the deep learning framework Caffe [52].The testing of any PNN, instead, performed via CPU (PC with 2GHz Intel Xeon processor and 64 Gb) and implemented in MATLAB by means of MatConvNet [53], on a 320 × 320 (in the MS domain) image takes about 20 seconds, which is comparable to the time required by reference algorithms.

Conclusions
Deep learning has proven extremely effective in a large number of image processing and computer vision problems.Based on this observation, we decided to use convolutional neural networks to address the pansharpening task, modifying an architecture recently proposed for super-resolution.To improve performance without resorting to data-intensive deeper architectures, which would call for huge training sets and complexity, we augmented the input by including several maps of nonlinear radiometric indices.This version proved uniformly superior with respect to the basic architecture.We tested the proposed method against a number of state-of-the-art references on three multiresolution datasets obtaining a very good performance under all metrics, both full-reference and no-reference, and also in terms of subjective quality.
We are already working to improve the proposed method, following the same approach taken in this paper, that is, to use the large body of knowledge available in the remote sensing field to exploit the full potential of deep learning.In particular, leveraging on our own work, we will test the use of further external inputs, such as textural features [54] or information derived from external segmenter [1,55].Another line of research concerns training the network with a loss function aligned with typical no-reference metrics used for pansharpening.

Figure 2 .
Figure 2. CNN training through the Wald protocol.

Figure 3 .
Figure 3. Sample result of proposed pansharpening method.Form left to right: input multiresolution image acquired by the GeoEye-1 sensor, interpolation of MS, pansharpened image.

Figure 4 .
Figure 4. Energy of first-layer filters without (a) and with (b) nonlinear radiometric indices.

Figure 5 .
Figure 5.A subset of first-layer filter responses, column-wise re-scaled to compensate for different gains.Filters #7 and #48, highlighted, correspond roughly to vegetation and water features.

Figure 6
Figure6provides further insight into this interpretation.It shows, on the left, a color image obtained by combining three suitable bands of a WorldView-2 image, and then the feature maps associated with filter #7 (center) and filter #48 (right).In the first map, all vegetated areas are clearly highlighted, although some "errors" of interpretation can also be spotted in the top-left part of the image.Likewise, the second map highlights very clearly water basins, again, with some scattered "errors".

Figure 6 .
Figure 6.Some first-layer feature maps obtained for a sample WorldView-2 image.From left to right, RGB composition, feature map #7, feature map #48.The selected feature maps highlight quite accurately vegetation and water, respectively.
PAN MSIkonos 0.82 m GSD at nadir 3.28 m GSD at nadir GeoEye-1 0.46 m GSD at nadir 1.84 m GSD at nadir WorldView-2 0.46 m GSD at nadir 1.84 m GSD at nadir

Table 2 .
Spectral bands of Ikonos, GeoEye-1, and WorldView-2 sensors.When the band is comprised the wavelength range (in nm) is reported.

Table 3 .
Default CNN architecture in case of a multispectral component with B = 4 bands and without external features.c 1 is the total number of input bands.

Table 4 .
Hyper-parameters of the CNN architectures for Pansharpening.B rad : number of nonlinear radiometric indices, K 1 × K 1 : spatial support of first-layer filters, c 2 number of first-layer filters.

Table 6 .
Full-reference (low resolution) and no-reference (full resolution) performance metrics.

Table 7 .
Performance of various configurations of the proposed architecture on WorldView-2 images without (top) and with (bottom) nonlinear radiometric indices.Best result in red and blue, respectively.

Table 8 .
Performance of various configurations of the proposed architecture on Ikonos images without (top) and with (bottom) nonlinear radiometric indices.Best result in red and blue, respectively.

Table 9 .
Performance of various configurations of the proposed architecture on GeoEye-1 images without (top) and with (bottom) nonlinear radiometric indices.Best result in red and blue, respectively.

Table 10 .
Performance comparison on the WorldView-2 dataset.

Table 11 .
Performance comparison on the Ikonos dataset.

Table 13 .
Average ranking over test images for a few metrics.