1. Introduction
The study and utilization of remote-sensing images is becoming more and more meaningful [
1], it has become a critical and urgent task to obtain remote-sensing satellite images with both high spatial resolution and high temporal resolution. Although the progress of sensor technology has generated great convenience for the study of remote-sensing images [
2], individual satellites are still unable to obtain high spatial resolution images with dense time series, and cost and technical bottlenecks are the main reasons for this problem [
3,
4].
Spatiotemporal fusion is a data post-processing technology developed to reduce the limitation of hardware technology. The fusion process generally requires two data sources [
5], one of which has high spatial resolution and low temporal resolution (hereinafter referred to as fine resolution images), such as the Landsat-8 satellite, which can obtain spatial resolution images of 30 m with a repetition period of 16 days [
6]. Another data source has high temporal resolution and low spatial resolution (hereinafter referred to as rough resolution images). These include the moderate resolution imaging spectrometer (MODIS), which can obtain daily observation data of the Earth, but most of their spatial resolution is not high, at only 500 m or so [
7]. MODIS sensors are capable of acquiring images with low spatial resolution in intensive time (hereinafter referred to as coarse resolution images). The spatiotemporal fusion method can combine the daily data acquired by MODIS sensors and the fine resolution images acquired by Landsat satellites to generate high-spatial-resolution-fused image data with dense time series.
In general, there are four models of spatiotemporal fusion: (1) transformation-based; (2) pixel-reconstruction-based; (3) Bayesian-based; and (4) learning-based models [
8,
9]. Based on the data transformation model, original image pixels are mapped to an abstract space to perform fusion and obtain high-resolution data at unknown times [
10]. The basic idea of the pixel reconstruction-based model is to select pixels near the target pixel to participate in the reconstruction of the target pixel, in which a series of specific rules need to be set. Typical examples include the spatial and temporal adaptive reflectance fusion model (STARFM) [
11] and the spatial and temporal adaptive algorithm for mapping reflectance changes (STAARCH) [
12]. Bayesian-based models [
13] use the Bayesian statistical principle in mathematical statistics, such as the unified fusion method [
14] and Bayesian maximum entropy method [
15]. Bayesian-based models have advantages in processing different input images to produce better prediction results [
5]. However, most of the above traditional algorithms rely on conditions set in advance and are relatively influenced by the quality of the dataset, and their performance is mostly unstable.
Learning-based models have been gradually accepted and have become a new research hotspot. These models are expected to obtain better fusion results than traditional fusion models, especially in the prediction of land cover change. A learning-based fusion model basically does not need to design fusion rules manually. It can automatically learn the best basic features from various quality input datasets and generate high-quality fused images. At present, there are two main ways to build models based on learning, which are sparse representation and deep-learning technology [
16,
17]. The sparse-representation-based approach mainly models between pairs of fine-resolution and coarse-resolution images obtained on the same day [
16], and, through this correlation, obtain some key feature information. The algorithm reconstructs the fine-resolution images used for prediction. Although these methods can obtain better fusion results than traditional methods, some limitations, including sparse coding, high computational cost and computational complexity, limit its universality.
The deep-learning method mainly simulates the working characteristics of the neural structure in the human brain, that is to say, information is continuously transmitted between different neurons. The difference is that deep learning is mainly between different neural network layers, and the parameters learned through the established complex nonlinear mapping are transmitted to the output layer, so as to generate the prediction target results, in which the network contains a large number of learnable parameters. There are many ways to build a deep-learning network architecture. At present, the convolutional neural network (CNN) [
17] is emerging as a lightweight and efficient method for image feature extraction and image reconstruction with strong learning ability.
Researchers in the field of image fusion have increasingly turned to CNN models. The deep convolutional spatiotemporal fusion network (DCSTFN) [
8] uses CNN to extract the texture and spectral feature information from fine resolution images and coarse resolution images [
18]. Using the assumptions used by STARFM, the obtained feature information is comprehensively processed and fused into the final image. DCSTFN is superior to traditional spatiotemporal fusion methods in many aspects, such as the accuracy and robustness of fused images. Song et al. proposed a spatiotemporal fusion hybrid method based on spatiotemporal fusion using deep convolutional neural networks (STFDCNN) [
19]. Here, a single-image superresolution CNN (SRCNN) is used to form nonlinear mapping, and super-resolution is applied multiple times. The fusion effect of this method is relatively good. The main idea of a two-stream CNN (StfNet) [
20] is to learn the feature differences of image data at different dates in pixel space, and StfNet can retain rich texture details.
Recently, Tan et al. proposed an enhanced deep conventional spatiotemporal fusion network (EDCSTFN) [
21], which is a further work on the basis of DCSTFN. EDCSTFN no longer uses the linear assumptions of STARFM, and the prediction image is no longer affected by the reference image. The relationship between them is completely obtained by network autonomous learning, and the objectivity is guaranteed. In addition, the CNN with attention and multiscale mechanisms (AMNet) [
22] is famous for its good effect and innovation, and it can extract more comprehensive image feature information.
Considering that some current spatiotemporal fusion methods have not paid enough attention to extracting more comprehensive features of the input image, as well as the fact that the ability to capture image edge-detail information still needs to be improved, this paper proposes a convolutional neural network based on multiscale feature fusion [
23,
24,
25], and a new spatiotemporal fusion method combined with an efficient spatial-channel attention mechanism to alleviate the above problems. Specifically, the following explorations were conducted:
- (1)
In this paper, multiscale feature fusion is introduced into the spatiotemporal fusion task to extract the feature information of the input image more scientifically and comprehensively for the characteristics of different scales of the input image, and to improve the learning ability and efficiency of the network.
- (2)
In this paper, an efficient spatial-channel attention mechanism is proposed, which makes the network not only consider the expression of spatial feature information, but also pay attention to local channel information in the learning process, and further improves the ability of the network to optimize feature learning.
- (3)
In this paper, we propose a new edge loss function and incorporate it into the compound loss function, which can help the network model to better and more fully extract the image edge information. At the same time, the edge loss can also reduce the resource loss and time cost of the network, and reduce the complexity of the compound loss function.
The main chapters of this paper are organized as follows:
Section 2 describes the relevant materials and the proposed method.
Section 3 presents a series of experiments and their results, as well as an analysis of the results.
Section 4 discusses the performance and advantages of our proposed network structure on different datasets.
Section 5 summarizes the content of the full paper and provides an outlook for future work.
5. Conclusions
In this paper, a multiscale spatiotemporal fusion network based on an attention mechanism is proposed. Multiscale feature fusion is introduced into the spatiotemporal fusion task to obtain the spatial details and temporal changes in remote-sensing images at different scales for the purpose of extracting richer and more comprehensive feature information. A spatial-channel attention module is used to filter the spatial features and channel information of the fusion network in order to obtain more important feature information. The edge loss function is added and incorporated into the compound loss function to reduce the overhead and complexity of the network and further improve the prediction accuracy and the quality of the fused images. The effectiveness of the proposed network is verified by the results of an ablation experiment and comparative experiment. In addition, from the comparison of the residual diagrams, it can be concluded that, in the spectral performance, our proposed method shows more blue regions with the value of 0, showing the great advantage of our proposed model. Taken together, our method clearly outperforms STARFM, DCSTFN, EDCSTFN and AMNet, and our method has a more accurate prediction capability with a richer expression of spectral information. The combination of images and indexes shows that our proposed method achieves good results in both subjective visual and objective evaluation. However, the performance in terms of details needs to be improved. This is because we found that the model is able to predict better the texture edge information in the region with more spectral color information, but at the same time the expression of the spectral information in that part is biased. The follow-up study found that the strong edge structure of the edge-extraction operator we used may cause the model to focus too much on the structural information and ignore the expression of the spectral information. Future work needs to focus on structural information while paying attention to the expression of spectral information, and, in the meantime, the feasibility of applying other edge operators to spatiotemporal fusion tasks needs to be explored. In addition, in view of the better performance shown by traditional methods, the combination of traditional methods and deep-learning methods can also be explored in the future, to further improve the quality of image fusion.