1. Introduction
High spatial resolution multispectral (HR MS) images contain abundant spatial and spectral information, which is helpful in the interpretation of the recorded scenes, such as environmental monitoring [
1] and land survey [
2]. However, due to the limitation of imaging techniques, it is difficult for remote sensing images to achieve both spatial and spectral resolutions simultaneously. Most satellites, such as QuickBird and GeoEye-1, only capture high spatial resolution panchromatic (PAN) and low spatial resolution multispectral (LR MS) images. Therefore, the pan-sharpening technique is employed to integrate the spatial and spectral information in PAN and MS images for the generation of HR MS images [
3].
Over the past two decades, many pan-sharpening methods have been put forward. According to their paradigms, these methods can be divided into four categories: component substitution (CS)-based methods, multiresolution analysis (MRA)-based methods, model-based methods, and deep neural network (DNN)-based methods. For the first category, some linear transforms are used to project the up-sampled LR MS image into a new space, in which the LR MS image is decomposed as spatial and spectral components. Then, the spatial component of the LR MS image is substituted by the histogram-matched PAN image. Finally, the HR MS image is obtained by an inverse transform on the new components. CS-based methods generally consider intensity–hue–saturation (IHS) [
4], principal component analysis (PCA) [
5], and Gram–Schmidt (GS) [
6] transform for the sharpening of the LR MS image. To adaptively estimate the spatial component of the LR MS image, adaptive GS (GSA) [
7] was proposed, in which the combination weights were calculated by minimizing the mean square error. To efficiently enhance the spatial details in different bands of the LR MS image, a band-dependent spatial detail (BDSD) model [
8] was proposed, in which the combined weights of different bands are estimated adaptively. Recently, robust versions of BDSD were developed in [
9] to obtain better fusion results. In addition, Choi et al. [
10] proposed a partial replacement adaptive CS (PRACS), in which the spatial component of the LR MS image was replaced partially by the PAN image. For the first kind of method, its implementation is simple and straightforward. However, spectral distortions generally occur in the fusion results of these methods.
For MRA-based methods, it is assumed that the spatial information lost in the LR MS image can be found from the corresponding PAN image. Thus, a multiresolution decomposition is applied to the PAN image to extract spatial details. Then, these details are injected into the up-sampled LR MS image to produce the pan-sharpened image. In this category, high-pass filters are designed to extract spatial information, such as Indusion [
11] and generalized Laplacian pyramid (GLP) [
12]. Through integrating the modulation transfer function (MTF), MTF-GLP [
13] was proposed for a more accurate extraction of spatial details. Then, MTF-GLP was further extended by combining the high-pass modulation (HPM) [
14]. Furthermore, some advanced MRA tools [
15,
16] were also introduced to represent the spatial information in PAN and LR MS images. For example, Shah et al. [
15] utilized nonsubsampled contourlet (NSCT) to enhance the spatial details in the LR MS image. Following the decomposition framework, some MRA-like filters [
17,
18] were constructed to infer more reasonable spatial information. The fused images of MRA-based methods exhibit better preservation in terms of the spectral information because only spatial details are injected into the up-sampled LR MS image. However, the spatial performance of their fused images is highly dependent on the filter designed in an empirical process. The design of the filter should consider the MTF of imaging sensors [
19].
For the third category, it is assumed that the LR MS image is the result of the HR MS image through spatial degradation. Similarly, the PAN image is regarded as the spectral degradation result of the HR MS image. Thus, the relationships between source images and the HR MS image can be coded in the spatial and spectral degradation models. Then, the desired HR MS image is obtained by solving the spatial and spectral degradation models between the source images and the HR MS image. To regularize the solution space of the spatial and spectral degradation models, various priors [
20,
21,
22] were employed as the regularizations. For instance, as a popular prior, sparsity is investigated extensively. Zhang et al. [
23] designed a structural sparsity term for the regularization of the spatial and spectral degradation models. Palsson et al. [
24] combined the total variation (TV) regularization with the model mentioned above to fuse the LR MS and PAN images. Furthermore, to find more effective priors, Liu et al. [
25,
26] explored the Hessian prior in the gradient domain of images. In [
27], a variational method, P + XS, was also proposed to fuse the LR MS and PAN images. Effective priors will have a strong constraint on the solution space of the spatial and spectral models. With the help of effective priors, more accurate HR MS images can be estimated. However, in complex scenes, the priors adopted in these methods may be invalid and thus limit their generalization. Moreover, the model-based methods are generally solved by iteration optimization algorithms. Thus, their complexity cannot be ignored.
In recent years, DNNs have attracted a great deal of attention in numerous fields, especially in computer vision tasks [
28,
29], due to their powerful learning capability. For pan-sharpening, DNN-based methods also present state-of-the-art fusion performance. Masi et al. [
30] first proposed a pan-sharpening neural network (PNN) inspired by the super-resolution convolutional neural network (CNN) in [
28]. Then, advanced PNN (A-PNN) [
31] was further proposed to improve the performance of PNN. As an efficient framework, residual learning [
32] is used to depict the spatial structures in the MS image. For example, Yang et al. [
33] injected the spatial details learned by a residual network (ResNet) into the up-sampled LR MS image. Wei et al. [
34] developed a deep convolution neural network through residual learning to boost the accuracy of the fusion results. Taking the minimax game between the distributions of real and fake images into consideration, generative adversarial network (GAN) [
35] is also considered to fuse the LR MS and PAN images. Liu et al. [
36] employed GAN to synthesize the HR MS image, and two sub-networks were established to extract the features from LR MS and PAN images. To alleviate the demand for supervised datasets, Ma et al. [
37] adopted two discriminators to distinguish the spatial and spectral information in the fused images. Diao et al. [
38] proposed a multiscale GAN framework to progressively generate the fused images, and the fused image was discriminated scale by scale by the corresponding discriminators.
Despite the success of DNNs in pan-sharpening, DNN-based methods only focus on the local properties of images owing to the limited receptive field. Thus, it is difficult for DNN-based pan-sharpening methods to capture the global similarity among images efficiently, which makes these methods fail to model various spatial and spectral structures in LR MS and PAN images. To learn the global information in images, a transformer [
39] was developed by introducing the self-attention mechanism. Thus far, transformers have demonstrated tremendous potential in high- and low-level vision tasks. For instance, Yang et al. [
40] employed a transformer to learn relevant textures for the super-resolution of the low-resolution image. Chen et al. [
41] proposed a pre-trained transformer model, which achieved state-of-the-art performance in super-resolution and denoising. Furthermore, the contents of the image at different scales are reflected by distinct global similarities. Thus, the global similarities at different scales should be combined to reconstruct the HR MS image.
In order to exploit the local and global properties at different scales, we propose a multiscale spatial–spectral interaction transformer (MSIT) to integrate the multiscale feature maps for pan-sharpening. First, features are extracted by two multiscale sub-networks based on convolution–transformer encoder from PAN and LR MS images, respectively. To efficiently fuse the information from the two sub-networks at different scales, we design a spatial–spectral interaction attention module (SIAM). Through the interaction of spatial and spectral attention, the redundancy among the features from the two sub-networks is reduced, and meanwhile, their complementarity is enhanced. Finally, a multiscale reconstruction module (MRM) is constructed to generate the fused image. In this module, the features at different scales are merged from coarse to fine to recover the spatial and spectral information in the fused image scale by scale. The experimental results on different datasets show that the proposed MSIT produces better fusion results in terms of objective and subjective evaluations when compared with the classical and state-of-the-art methods. To the best of our knowledge, it is the first transformer for pan-sharpening to explore the spatial–spectral features of PAN and LR MS images via the interaction attention mechanism.
Our contributions are summarized as follows:
To model the local and global dependencies simultaneously, we design multiscale convolution–transformer sub-networks. Spatial and spectral features in PAN and LR MS images are extracted scale by scale by the sub-networks for the description of local and global similarity information in images.
We propose a spatial–spectral interaction attention module to integrate the features from different sub-networks. In SIAM, the spatial information in the concatenated feature of PAN and LR MS images is extracted by the self-attention mechanism. In the same way, the spectral information in the LR MS image is emphasized. Through SIAM, the reduction of redundancy and the enhancement of complementarity among these features are achieved.
To efficiently integrate the local and global information in the features at different scales, we construct a multiscale reconstruction module. In MRM, the feature contents at different scales are inherited into the fused image to recover the subtle spatial and spectral information.
The remainder of the paper is organized as follows.
Section 2 introduces the proposed MSIT in detail, including the network structure and the loss function. Experimental results on different datasets are presented in
Section 3 to show the effectiveness of the proposed MSIT. Conclusions are provided in
Section 4.