Next Article in Journal
Spatiotemporal Variations in Biophysical Water Quality Parameters: An Integrated In Situ and Remote Sensing Analysis of an Urban Lake in Chile
Previous Article in Journal
Retrieval of Tree Height Percentiles over Rugged Mountain Areas via Target Response Waveform of Satellite Lidar
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Satellite Image Sequences through Multi-Scale Optical Flow-Intermediate Feature Joint Network

1
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
2
School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(2), 426; https://doi.org/10.3390/rs16020426
Submission received: 4 December 2023 / Revised: 13 January 2024 / Accepted: 19 January 2024 / Published: 22 January 2024
(This article belongs to the Section AI Remote Sensing)

Abstract

:
Satellite time-series data contain information in three dimensions—spatial, spectral, and temporal—and are widely used for monitoring, simulating, and evaluating Earth activities. However, some time-phase images in the satellite time series data are missing due to satellite sensor malfunction or adverse atmospheric conditions, which prevents the effective use of the data. Therefore, we need to complement the satellite time series data with sequence image interpolation. Linear interpolation methods and deep learning methods that have been applied to sequence image interpolation lead to large errors between the interpolation results and the real images due to the lack of accurate estimation of pixel positions and the capture of changes in objects. Inspired by video frame interpolation, we combine optical flow estimation and deep learning and propose a method named Multi-Scale Optical Flow-Intermediate Feature Joint Network. This method learns pixel occlusion and detailed compensation information for each channel and jointly refines optical flow and intermediate features at different scales through an end-to-end network together. In addition, we set a spectral loss function to optimize the network’s learning of the spectral features of satellite images. We have created a time-series dataset using Landsat-8 satellite data and Sentinel-2 satellite data and then conducted experiments on this dataset. Through visual and quantitative evaluation of the experimental results, we discovered that the interpolation results of our method retain better spectral and spatial consistency with the real images, and that the results of our method on the test dataset have a 7.54% lower Root Mean Square Error than other approaches.

Graphical Abstract

1. Introduction

Big data in remote sensing refer to the massive volumes of information generated by Earth observation satellites, particularly when dealing with time series data from satellites like MODIS with frequent revisits [1,2]. Remote sensing time series data can be used for monitoring, analyzing, and predicting various natural and human phenomena, including land use/cover change detection, wetland monitoring, crop classification, disaster early warning, etc. [3,4,5,6,7].
However, satellite sensors may malfunction or there may be adverse atmospheric conditions during the acquisition of remotely sensed data, and these problems often result in missing data for intermediate time phase images, making it impossible to form a complete time series [8,9]. There are three instances of remote sensing data that are missing, shown in Figure 1.
One prevalent approach for filling in the gaps in a sequence of images at intermediate time phases is the use of image interpolation. This technique involves generating the missing images by utilizing information from the images at the preceding and subsequent time phases. Image interpolation is a foundational task in digital image processing and computer vision, classified into intra-image interpolation and inter-image interpolation based on the dimension of interpolation [10]. Inter-image interpolation, also termed sequence image interpolation, focuses on creating multiple new images within a sequence of existing images. Our research involves a comprehensive exploration and generalization of the methodology for sequential image interpolation.
The fundamental approach to sequential image interpolation is pixel-wise linear interpolation. This method employs a linear weighted combination of pixel values at equivalent positions within two adjacent images to compute the pixel values of the intervening image. Vandal et al. [11] compared their deep learning method with linear interpolation and showed that their method is better at time enhancement from 10 min to 1 min for remote sensing data with different spatial resolutions. The requirements for the spectral integrity of interpolated images cannot be met by the simple and consistent weighted linear approach [12].
Shape-based interpolation is a method of interpolating sequential images by extracting the shape contours of an object from an image sequence and interpolating them in accordance with the changes of an object at each time instance. Raya et al. [13] first proposed a shape-based interpolation algorithm, which first segmented the given image data into binary images, then converted the binary images back to gray images, and finally interpolated the gray images. This method is seldom applied to remote sensing data. The reason is that remote sensing data encompass a multitude of objects and intricate textures, which significantly complicates the extraction of contours.
In the last few years, deep learning has achieved a good performance in satellite image processing tasks, including image fusion, image classification, object detection, and picture reconstruction [14,15,16,17,18,19,20,21,22,23]. It employs high-parameter models to comprehend complex textures and spectral features for image interpolation tasks, using the images from two time phases to output the intermediate phase image. Niklaus et al. [24] used a local convolution on two input frames for intermediate frame interpolation. They employed a convolutional neural network (CNN) to learn spatially adaptive filters for each pixel, addressing occlusion, blur, and brightness changes. Jin et al. [25] introduced a separable convolution network for remote sensing sequential image interpolation. This network uses separable 1D convolution kernels to capture spatial features of sequential images. Tests on Landsat-8 and drone data showed its effectiveness in generating high-quality time series interpolation images. For high-resolution images, Peleg et al. [26] proposed an interpolation motion neural network to improve interpolation quality. This network uses a cost-effective structured architecture, end-to-end training, and a multi-scale customized loss function. It classifies interpolation motion estimation, enhancing the interpolation effect and speed on high-resolution images. Deep learning-based interpolation methods show robustness to occlusion, blur, and illumination changes. However, their unpredictability and uninterpretability may introduce noise.
Optical flow is a technique frequently employed in computer vision, primarily for target detection and tracking in satellite image processing [27,28]. It aims to determine the exact location of pixels at each stage within a sequence of images. This technique is instrumental in analyzing the motion and changes of objects across different frames in the sequence.
The accuracy of interpolation can be enhanced by accurately estimating pixel motion and addressing occlusion and blurring when optical flow and deep learning are integrated. This concept is frequently used in the field of video frame interpolation, and effectively increases the precision and speed of interpolation. Liu et al. [29] developed a deep voxel flow layer (DVF), akin to an optical flow field. The method captures the direction of pixel flow between adjacent frames and is trained to generate intermediate frames. Jiang et al. [30] proposed a video interpolation method called Super SloMo, based on a convolutional neural network. It initially calculates the bidirectional optical flow between input frames using a U-Net [31], then linearly combines the bidirectional optical flow according to the time step to obtain the approximate intermediate frame optical flow. This is then refined with another U-Net, while also predicting visibility, distorting and fusing the input frames, and excluding the insignificant. The result of the experiment has shown that the U-Net produces high-quality intermediate frames that surpass those produced by current video frame interpolation techniques in the field of remote sensing. Vandal et al. [11] extended Super SloMo to multiple channels and applied it to the GOES-R/Advanced Baseline Imager mesoscale dataset. This enhances the satellite data with different spatial resolutions from 10 min to 1 min in temporal resolution, enabling more accurate weather detection.
In addition, to enhance the speed and accuracy of frame interpolation, Kong et al. [32] proposed simpler and more efficient model structures. They designed an end-to-end network, IFRNet, which extracts pyramidal features from a given input and subsequently recombines bilateral intermediate streams with robust intermediate features until the desired output is produced. Their progressive intermediate features not only aid in the estimation of intermediate information but also compensate for contextual details, thereby efficiently generating intermediate images. This approach signifies a promising advancement in the field of frame interpolation.
This paper draws inspiration from IFRNet [32], to which we have introduced modifications to tailor it for the task of interpolating sequences of satellite images. As far as we know, this is the first attempt to use the optical flow-based interpolation method on satellite images possessing complex object characteristics. The main work in this article is as follows:
  • Given the complexity of objects and the richness of spectra in satellite imagery, we endeavor to integrate optical flow estimation with deep learning. This integration aims to develop a robust method for enhancing satellite image sequences.
  • We carried out comprehensive experiments on two time-series satellite datasets—Landsat-8 and Sentinel-2. The results demonstrate that our approach yields lower interpolation errors across the four seasons: spring, summer, autumn, and winter. This evidence substantiates that our method can be effectively applied to the interpolation of satellite image sequences in various seasons.
The remainder of this paper is organized as follows: Section 2 provides a detailed description of the satellite datasets and the proposed methodology. The experiments and their results are presented in Section 3. Section 4 offers an in-depth discussion on the rationale behind the results. Finally, Section 5 draws conclusions from the study.

2. Materials and Methods

2.1. Datasets

We selected data from two satellites—Landsat-8 and Sentinel-2—consisting of eight and four time series. Their details are given in Table 1 and Table 2, and each time series contains images from five time phases. For the Landsat-8 satellite dataset, we used seven bands: coastal, blue, green, red, near-infrared, short-wave infrared-1, and short-wave infrared-2, all with a spatial resolution of 30 m and a size of 5120 × 5120 per image. Figure 2a shows two examples of time series of the Landsat-8 satellite data. For the Sentinel-2 satellite dataset, we used 8 bands, including blue, green, red, 3 vegetation red edge, near-infrared (wide), and near-infrared (narrow), of which 4 bands, 3 vegetation red edge and near-infrared (narrow), have a spatial resolution of 20 m and a size of 5490 × 5490, and the other bands have a spatial resolution of 10 m and a size of 10,980 × 10,980 per image. The time resolution of the Landsat-8 dataset and the Sentinel-2 dataset is 16 days and 5 days, respectively. Figure 2b shows two examples of time series of the Sentinel-2 satellite data.
In addition, we chose two satellite datasets from latitudes that differ by about ten degrees to ensure climate consistency. The two satellite datasets cover the four seasons: spring, summer, autumn and winter, respectively. The cloud coverage of the images for each temporal phase of both datasets is within 5%.

2.2. Interpolation Model

The objective of satellite image sequencing is to utilize two time-phase images, I 0 and I 1 , to interpolate an image for a specific intermediate time phase t, thereby completing the satellite image sequence. We assume that the two time phases are 0 and 1, and the interpolated result is denoted as I ^ t . The interpolation model can be represented as Equation (1):
I ^ t = f ( I 0 , I 1 , t ) ,
where t [ 0 , 1 ] signifies the time interval between the intermediate time phase to be interpolated and time phase 0. f ( x ) symbolizes the non-linear relationship between I 0 , I 1 , t, and I ^ t . In this study, the non-linear relationship f ( x ) is approximated using a neural network.

2.3. Architecture of the Model

The network architecture in this paper is depicted in Figure 3, and comprises two main components: an encoder and a decoder. The network accepts images from the two time phases ( I 0 , I 1 ) as well as the intermediate phase T(t) as inputs, which aligns with Equation (1). It generates an image of the intermediate time phase I ^ t by collaboratively and interactively enhancing the images from the two time phases. It learns the intermediate optical flow field and intermediate features across various scales during this process. The following is a detailed explanation of this network.

2.3.1. Extraction of Feature Pyramids

The encoder ε contains four sub-encoders ε k ( k { 1 , 2 , 3 , 4 } ) . Each sub-encoder has two convolutional layers and one activation layer. The activation function makes use of the PReLU [33]. The input data are two time phase images, I 0 , I 1 , in a satellite image sequence, and the purpose is to extract the feature pyramid ϕ 0 , ϕ 1 containing four scales from the two images. The encoder can be expressed as Equation (2):
ε ( [ I 0 , I 1 ] ) = [ ϕ 0 1 , ϕ 1 1 ] , [ ϕ 0 2 , ϕ 1 2 ] , [ ϕ 0 3 , ϕ 1 3 ] , [ ϕ 0 4 , ϕ 1 4 ] ,
where I 0 and I 1 are instances of the parameters in Equation (1). ϕ 0 k , ϕ 1 k ( k { 1 , 2 , 3 , 4 } ) is the four-scale features extracted from I 0 and I 1 .

2.3.2. Joint Refinement of Optical Flow-Intermediate Feature

The decoder D is composed of four sub-decoders, D k ( k { 1 , 2 , 3 , 4 } ) . These sub-decoders iteratively refine the intermediate optical flow by backward warping the pyramidal features ϕ 0 k and ϕ 1 k through the intermediate optical flow fields F t 0 k and F t 1 k . Each sub-decoder also outputs a higher-level reconstructed intermediate feature ϕ ^ t k , which compensates for the missing image information caused by the backward warping. The subsequent sub-decoder generates an improved intermediate optical flow, F t 0 k and F t 1 k , which can be more accurately pixel-aligned to produce better ϕ ˜ 0 k and ϕ ˜ 1 k . This, in turn, enhances the reconstruction of intermediate features at higher levels. Therefore, this decoder facilitates the simultaneous enhancement of both the intermediate optical flow and the intermediate features, contributing to a more accurate and detailed interpolation of satellite image sequences. The decoders D 4 and D 1 are expressed as Equations (3) and (4). The decoders D 2 and D 3 are expressed as Equation (5):
F t 0 3 , F t 1 3 , ϕ ^ t 3 = D 4 ( [ ϕ 0 4 , ϕ 1 4 , T ] )
[ F t 0 , F t 1 , M , R ] = D 1 F t 0 1 , F t 1 1 , ϕ ^ t 1 , ϕ ˜ 0 1 , ϕ ˜ 1 1
F t 0 k 1 , F t 1 k 1 , ϕ ^ t k 1 = D k F t 0 k , F t 1 k , ϕ ^ t k , ϕ ˜ 0 k , ϕ ˜ 1 k k { 2 , 3 } ,
where T represents the single-channel data of size 1 × H × W, which is extended by the intermediate time phase t as defined in Equation (1). The warped images ϕ ˜ 0 k and ϕ ˜ 1 k are generated by warping the pyramidal features ϕ 0 k and ϕ 1 k backward through the intermediate optical flow. This process is governed by Equations (6) and (7):
ϕ ˜ 0 k = w ( ϕ 0 k , F t 0 k ) , k { 1 , 2 , 3 }
ϕ ˜ 1 k = w ( ϕ 1 k , F t 1 k ) , k { 1 , 2 , 3 } ,
where w means backward warping.

2.3.3. Synthesis of Intermediate Images

The rear stage encoder D 1 outputs the intermediate bidirectional optical flow fields F t 0 and F t 1 , a mask M, and a residual R. All of these outputs are the same size as the input image. The mask M is generated by passing through a sigmoid layer, and its values range from 0 to 1. This mask contains the extracted bidirectional occlusion information. The residual R compensates for details such as the presence of regions in the target image that have been occluded in the images from the preceding and subsequent time phases, as well as the contours of objects. Both the mask M and the residual R have the same number of channels as the input image. In summary, the final synthesis of the image at the intermediate time phase t is given by Equation (8):
I ^ t = M I ˜ 0 + ( 1 M ) I ˜ 1 + R ,
where I ˜ 0 = w ( I 0 , F t 0 ) , I ˜ 1 = w ( I 1 , F t 1 ) , ⊙ denotes element-wise multiplication.

2.3.4. Details of the Model

The feature pyramid extracted from the image by the encoder comprises four scales with 32, 48, 72, and 96 channels, respectively. Each sub-decoder in the network contains a convolutional layer, a residual block, and an inverse convolutional layer, all followed by a PReLU activation function.
The structure of the residual block is shown in Figure 4. It includes five convolutional layers. Convolutional layers 2 and 4 update only a subset of the channels of the output feature maps from the previous convolutional layer. These updated outputs are then spliced with the remaining channel feature maps before being input into convolutional layers 3 and 5, respectively. This process is performed in the channel dimension.
The residual block adopts the concept of residual connections [34]. The feature maps output from the convolutional layer 5 are added to the input feature maps of the residual block. The final output of the residual block is obtained through the PReLU activation function. This operation enhances the network’s ability to transmit information, preventing issues such as gradient explosion and gradient disappearance, which could halt information transmission. As a result, the network’s characterization ability is improved, and network degradation is prevented. The residual block also incorporates the concept of interleaved convolution [35]. In this approach, convolutional layers 2 and 4 convolve only some of the channels of the output results of convolutional layers 1 and 3, respectively. The convolved results are then spliced with the channels not involved in this convolution. This operation reduces the number of parameters in the convolution kernel and the amount of computation, thereby enhancing the efficiency and speed of the model.

2.4. Loss Functions

The loss function we use is a linear combination of the reconstruction loss L r , the feature space geometric consistency loss L g , and the spectral loss L s a m , which is expressed as Equation (9):
L = λ r L r + λ g L g + λ s a m L s a m .
We set the weight of each loss function as λ r = 0.1 , λ g = 0.01 , λ s a m = 0.5 .

2.4.1. Reconstruction Loss

The reconstruction loss L r shows the pixel difference between the real image I t and the interpolated result I ^ t ; the reconstruction loss formula used in this paper is shown in Equation (10):
L r = ρ ( I ^ t I t ) + L c e n ( I ^ t , I t ) ,
where ρ ( x ) = ( x 2 + ϵ 2 ) α with α = 0.5 , ϵ = 10 3 is the Charbonnier loss [36], which is a more robust loss function than the Mean Square Error (MSE), avoids the oversensitivity of MSE and can better handle outliers, while L c e n is the census loss, which calculates the soft Hamming distance between census-transformed [37] image patches of size 7 × 7.

2.4.2. Feature Space Geometric Consistency Loss

The pyramid features ϕ 0 k , ϕ 1 k ( k { 1 , 2 , 3 , 4 } ) , extracted by the encoder ϵ , are in a sense equivalent to the intermediate feature ϕ ^ t k reconstructed by the decoder D k + 1 . Inspired by the local geometric alignment property of the census transform, the present experiments first extract the features ϕ t k from the real image I t using the same parameter of the common encoder ε . Then, the census loss L c e n  [38] is extended to the multi-scale feature space for asymptotic supervision, and the soft Hamming distance between the 3 × 3 blocks of feature maps corresponding to the census transform is computed based on the channel dimension. This loss can be expressed as Equation (11):
L g = k = 1 3 L c e n ( ϕ ^ t k , ϕ t k ) .
The pyramidal features extracted by the encoder ϵ contain low-level structural information that is useful for frame synthesis, and the addition of this loss function regularizes the reconstructed intermediate features to maintain a better geometric layout.

2.4.3. Spectral Loss

The spectral loss L s a m quantifies the degree of spectral similarity between images by calculating the cosine of the spectral vectors between the interpolated result I ^ t and the real image I t , which is calculated as Equation (12):
L s a m = 1 N H W n = 1 N i = 1 H j = 1 W y ^ n , i , j , y n , i , j | y ^ n , i , j | 2 | y n , i , j | 2 1 ,
where · , · denotes the inner product, · 2 is the Euclidean norm, N, H, and W are the number of channels, height, and width of the image, respectively, and y ^ n , i , j , y n , i , j are the spectral vectors of the interpolated resultant image and the real image at pixel point ( i , j ) , respectively.

3. Experiments and Results

3.1. Experiments Strategy

As shown in Table 3 and Table 4, we set up four sets of experiments for the Landsat-8 and Sentinel-2 satellite datasets, respectively. Each set of experimental data contains two time series for the Landsat-8 satellite, and there is one time series for the Sentinel-2 satellite, and each time series has 5 time-phase images. In each group of experiments, the image of phase 1 and phase 5, the intermediate time phase t ( t { 0.25 , 0.5 , 0.75 } ) and the image of the intermediate time phase t are used as network inputs to train the network and estimate the intermediate images of phase 2, phase 3, and phase 4. The output results are compared with the reference images. The detailed training process is shown in Algorithm 1.
Algorithm 1 Training process.
  • / / For each set of experiments
  • for each i [ 1 , 8 ]  do
  •     Input Image 1: Image of the first time phase I 0 i
  •     Input Image 2: Image of the last time phase I 1 i
  •     Input Time Phase: the intermediate time phase t ( t { 0.25 , 0.5 , 0.75 } )
  •     Input Reference Image: Image of the intermediate time phase I t i
  •     train ( I 0 i , I 1 i , I t i )
  •     weight initialized randomly
  •      I ^ t i = f ( I 0 i , I 1 i , t )
  •      L i = λ r L r i ( I t i , I ^ t i ) + λ g L g i ( I t i , I ^ t i ) + λ s a m L s a m i ( I t i , I ^ t i )
  •     weight save as W i
  • end for
Due to the limited computer memory, the images need to be trained in blocks. Each scene image is divided into a collection of blocks with a size of 256 × 256. Among them, there are 400 blocks per image for Landsat-8 data and 1764 blocks per image for Sentinel-2 data. The division strategy of the two satellite datasets is shown in Figure 5. The red box is the validation sample, the blue box is the testing sample, and the others are the training samples. The ratio of the three types of samples is 1:1:8. In addition, in order to make the network training more stable and accelerate convergence, the data need to be preprocessed before entering the network training. The range of Landsat-8 data and Sentinel-2 data used in this experiment is [0, 65,535], so the input data are normalized as follows:
x = x min ( x ) max ( x ) min ( x ) ,
where min ( x ) is 0 and max ( x ) is 65,535.
In addition, the experiment also sets up a data augmentation strategy: randomly flip horizontally or vertically and randomly reverse the image sequence to improve the robustness and generalization ability of the model. In order to improve network convergence, we train the network using the adaptive moment estimation (Adam) [39] optimization method and a learning rate decay strategy with cosine annealing [40].
This experiment is implemented using Python language coding and uses the PyTorch deep learning library to implement the proposed network model. An NVIDIA GeForce RTX 3060Ti GPU is used for training.

3.2. Evaluation Indicator

We evaluate the experimental results using three metrics: Root Mean Square Error (RMSE) [41], Structural Similarity (SSIM) [42], and Peak Signal-to-Noise Ratio (PSNR) [42].
RMSE reflects the change in pixel error between the interpolated resultant image I ^ t and the true image I t , which is calculated as in Equation (14):
R M S E = 1 W · H I ^ t I t 2 ,
where W and H are the width and height of the image. The range of the R M S E is related to the size of the pixel values of the image, and the smaller the value, the closer the interpolated resultant image is to the real image.
SSIM is based on the visual characteristics of the human eye and measures the similarity between the interpolated image and the reference image in three respects—luminance L ( X , Y ) , contrast C ( X , Y ) , and structure S ( X , Y ) —which is calculated as in Equation (15):
S S I M ( X , Y ) = L ( X , Y ) · C ( X , Y ) · S ( X , Y ) ,
where L ( X , Y ) = 2 u X u Y + C 1 u X 2 + u Y 2 + C 1 , C ( X , Y ) = 2 σ X σ Y + C 2 σ X 2 + σ Y 2 + C 2 , S ( X , Y ) = σ X Y + C 3 σ X σ Y + C 3 , μ X , σ X are the mean and standard deviation of the experimental result X; μ Y , σ Y are the mean and standard deviation of the reference image Y, respectively; σ X Y is the correlation coefficient between the experimental result X and the reference image Y. C 1 , C 2 , C 3 are constants to prevent the denominator from becoming zero, usually set as C 1 = ( K 1 · L ) 2 , C 2 = ( K 2 · L ) 2 , C 3 = C 2 2 , K 1 = 0.01 , K 2 = 0.03 , L = 255 . The value of SSIM ranges from 0 to 1, and the larger the value, the more similar the interpolated image and the reference image are, and vice versa.
PSNR is the ratio between the maximum possible power of a signal and the power of the noise that affects the fidelity of its representation, measured in dB, which is calculated as in Equation (16):
P S N R = 10 · log 10 M A X I 2 M S E ,
where M A X I is the maximum value of the pixel color, which is 255 for 8-bit images. M S E = 1 W · H I ^ t I t 2 is the MSE between the interpolated image and the reference image. The higher the PSNR, the better the interpolation result.

3.3. Results

We tested the trained model on two satellite datasets, Landsat-8 and Sentinel-2, and evaluated the results both visually and quantitatively. Due to the large number of experiments, we selected one experiment from each of the two satellite experiments, Experiment 2 for Landsat-8 and Experiment 2 for Sentinel-2, and presented the visual evaluation of the results of these two experiments. At the same time, we presented the quantitative evaluation results of all experiments in the comparative experiments section.

3.3.1. Evaluation of Pixel Error

We conducted pixel error analysis on the interpolation results. We linearly mapped the pixel value range from [0, 65,535] to [0, 255]. Subsequently, we computed the pixel error between the interpolated result and the actual image. A pixel error range of [0, 10] indicates low error, (10, 80] indicates medium error, and (80, 255] indicates high error.
The visual effect maps of the actual images from the three time phases of Experiment 2 for the Landsat-8 and Sentinel-2 test datasets, the visual effect maps of the interpolated results, and the pixel error maps are depicted in Figure 6 and Figure 7. These figures illustrate that the spatial features of the test results from the three time phases of the proposed method across different datasets align closely with those of the corresponding actual images. Most of the pixel errors are concentrated in the range of 0 to 10, with a minor proportion of pixel errors falling in the range of 10 to 80, and very few pixel errors in the range of 80 to 255. This outcome underscores the robust generalization capability of our method.

3.3.2. Evaluation of Point Spectrum

To showcase the predictive capability of our method with respect to feature spectra, we conducted tests on point spectra. Specifically, to mitigate the influence of chance, we randomly selected 100 test points from each of the two test datasets. We then computed the average of these points in each spectral channel for comparison with the actual values.
The point spectral maps of the results from Experiment 2 for the Landsat-8 and Sentinel-2 test datasets are depicted in Figure 8 and Figure 9. These maps demonstrate that the point spectral features of our method’s interpolated results across the three time phases in different datasets align closely with those of the corresponding actual images. The interpolation results are within a range of 4 from the actual image in each spectral channel for the Landsat-8 test dataset and within 0.1 for the Sentinel-2 test dataset. This outcome indirectly validates the effectiveness of our method in applying the spectral loss function.

3.3.3. Comparative Experiments

To substantiate the superior performance of our method over existing approaches, we selected the linear method (Linear) and three deep learning methods—Unet [31], Super SloMo [30], and IFRNet [32] for comparative analysis. The Linear method is frequently employed as a baseline in comparative tests of sequential image interpolation, while the other three methods have recently demonstrated good performance in sequence image interpolation.
Table 5 and Table 6 present the quantitative evaluations of the five methods across three metrics—RMSE, PSNR, and SSIM—for the three time phases. On the test datasets of Landsat-8 and Sentinel-2, our method outperforms others in terms of indicator values. Specifically, our method achieves a 7.54% lower RMSE compared to other approaches. Furthermore, only our method surpasses the performance of IFRNet on the Landsat-8 dataset. Similarly, on the Sentinel-2 dataset, only our method outperforms IFRNet and Unet.
The pixel error maps of the comparative experiments from Experiment 2 on the Landsat-8 satellite dataset are depicted in Figure 10 and Figure 11. These figures reveal that the pixel error between our method’s interpolation results and the actual image across all three time phases is lower than that of the other four methods on the test datasets. Both the Linear and Super SloMo methods exhibit substantial pixel errors. Moreover, other methods display large errors in image areas with extensive vegetation coverage, as these areas undergo more changes over time. In contrast, our method yields lower pixel errors than other methods due to the incorporation of optical flow learning and prediction, enabling our method to accurately capture object changes and thus precisely predict pixel positions.
Similarly, we computed the interpolated results of these five methods in comparison with the actual images based on the spectral values at 100 random points from Experiment 2 on the Landsat-8 satellite dataset. The results are shown in Figure 12 and Figure 13. They indicate that, in the test datasets of two time series datasets, the spectral features of our method’s interpolated results across the three time phases at the specified point are closer to the actual spectral features than those of the other four methods.

4. Discussion

To illustrate the efficacy of incorporating optical flow learning in our deep learning method, we selected a 256 × 256 block from each of the two test datasets to display the intermediate three temporal optical flow and interpolation results. As depicted in Figure 14 and Figure 15, our model accurately predicted the intermediate optical flow by jointly refining the intermediate optical flow and the intermediate features at multiple scales. The bi-directional optical flows have the capability to capture the changes in object pixels between the images from two input time phases. Moreover, our method predicts images that are visually indistinguishable from actual images, owing to the multi-channel occlusion information mask and the multi-channel detail compensation information residuals.
Furthermore, we provide a detailed analysis of why our method surpasses the other four techniques. The Linear method focuses solely on the pixel values to be interpolated in the images from the preceding and subsequent time phases, without taking into account the spatial and spectral features of the images. Unet considers the spatial features of the images from the preceding and subsequent time phases, but neglects the spectral features and the pixel correspondences in the temporal images (optical flow). Both Super SloMo and IFRNet take into account the spatial features of the images and the pixel correspondences in the temporal images. However, Super SloMo separates the refinement of the optical and spatial features into two phases, resulting in a disconnection between the two. IFRNet performs a joint refinement of the optical flow and the spatial features, but its interpolation effect is superior for RGB images and inferior for multispectral satellite images. In contrast, our proposed method excels at handling multispectral satellite imagery by extending the number of detail-compensated residual channels to match the number of spectra and implementing a spectral loss function to enhance the supervision of spectral learning.

5. Conclusions

Inspired by video frame interpolation, this paper proposed a method for satellite sequence image interpolation. This method leverages a multi-layer convolutional network to learn the intermediate optical flow and intermediate features of the image. It predicts the precise positions of pixels and complex spatial features, ultimately yielding a highly accurate image for the intermediate time phase. The key conclusions of this paper can be summarized as follows:
(1)
Our proposed method predicts images for intermediate time phases based on optical flow and intermediate features. It excels at capturing the intermediate optical flow and integrating the intermediate features to produce superior interpolated images. Our research offers a novel paradigm for interpolating satellite image sequences of complex objects.
(2)
Thanks to the multi-channel mask structure of occlusion information and the spectral loss function in our model, our approach yields lower interpolation errors across all four seasons: spring, summer, autumn, and winter. Specifically, our method’s results on the test dataset exhibit a 7.54% RMSE compared to other approaches.
In future work, we aim to explore the utilization of information from images in multiple time phases of a time series to guide the generation of images in intermediate temporal phases, rather than relying solely on information from the images of the preceding and subsequent time phases of the image to be interpolated.

Author Contributions

Conceptualization, Z.Z. and P.T.; methodology, K.S.; software, K.S. and Z.-Q.L.; validation, K.S.; formal analysis, K.S.; investigation, K.S.; resources, K.S. and W.Z.; data curation, K.S.; writing—original draft preparation, K.S.; writing—review and editing, K.S., Z.Z. and P.T.; visualization, K.S.; supervision, Z.Z., P.T., Z.-Q.L. and W.Z.; funding acquisition, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (No. 2019YFE0197800), the Youth innovation Promotion Association, CAS (No. 2022127), and the “Future Star” Talent Plan of Aerospace lnformation Research Institute, CAS (No. 2020KTYWLZX03 and No. 2021KTYWLZX07).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Guo, H.; Wang, L.; Liang, D. Big Earth Data from space: A new engine for Earth science. Sci. Bull. 2016, 61, 505–513. [Google Scholar] [CrossRef]
  2. Vatsavai, R.R.; Ganguly, A.; Chandola, V.; Stefanidis, A.; Klasky, S.; Shekhar, S. Spatiotemporal data mining in the era of big spatial data: Algorithms and applications. In Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, Redondo Beach, CA, USA, 6 November 2012; pp. 1–10. [Google Scholar]
  3. Salmon, B.P.; Olivier, J.C.; Wessels, K.J.; Kleynhans, W.; van den Bergh, F.; Steenkamp, K.C. Unsupervised Land Cover Change Detection: Meaningful Sequential Time Series Analysis. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2011, 4, 327–335. [Google Scholar] [CrossRef]
  4. Chen, L.; Jin, Z.; Michishita, R.; Cai, J.; Yue, T.; Chen, B.; Xu, B. Dynamic monitoring of wetland cover changes using time-series remote sensing imagery. Ecol. Inform. 2014, 24, 17–26. [Google Scholar] [CrossRef]
  5. Li, Q.; Wang, C.; Zhang, B.; Lu, L. Object-Based Crop Classification with Landsat-MODIS Enhanced Time-Series Data. Remote Sens. 2015, 7, 16091–16107. [Google Scholar] [CrossRef]
  6. Li, M.; Zhang, L.; Ding, C.; Li, W.; Luo, H.; Liao, M.; Xu, Q. Retrieval of historical surface displacements of the Baige landslide from time-series SAR observations for retrospective analysis of the collapse event. Remote Sens. Environ. 2020, 240, 111695. [Google Scholar] [CrossRef]
  7. Marghany, M. Nonlinear Ocean Dynamics: Synthetic Aperture Radar; Elsevier: Amsterdam, The Netherlands, 2021. [Google Scholar]
  8. Shen, H.; Li, X.; Cheng, Q.; Zeng, C.; Yang, G.; Li, H.; Zhang, L. Missing information reconstruction of remote sensing data: A technical review. IEEE Geosci. Remote Sens. Mag. 2015, 3, 61–85. [Google Scholar] [CrossRef]
  9. Li, R.; Zhang, X.; Liu, B. Review on Filtering and Reconstruction Algorithms of Remote Sensing Time Series Data. J. Remote Sens. 2009, 13, 335–341. [Google Scholar]
  10. Marghany, M. Remote Sensing and Image Processing in Mineralogy; CRC Press: Boca Raton, FL, USA, 2022. [Google Scholar]
  11. Vandal, T.J.; Nemani, R.R. Temporal interpolation of geostationary satellite imagery with optical flow. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 3245–3254. [Google Scholar] [CrossRef]
  12. Dosovitskiy, A.; Brox, T. Generating images with perceptual similarity metrics based on deep networks. In Proceedings of the Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
  13. Raya, S.P.; Udupa, J.K. Shape-based interpolation of multidimensional objects. IEEE Trans. Med. Imaging 1990, 9, 32–42. [Google Scholar] [CrossRef]
  14. Li, J.; Hong, D.; Gao, L.; Yao, J.; Zheng, K.; Zhang, B.; Chanussot, J. Deep learning in multimodal remote sensing data fusion: A comprehensive review. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102926. [Google Scholar] [CrossRef]
  15. Vivone, G.; Dalla Mura, M.; Garzelli, A.; Restaino, R.; Scarpa, G.; Ulfarsson, M.O.; Alparone, L.; Chanussot, J. A New Benchmark Based on Recent Advances in Multispectral Pansharpening: Revisiting Pansharpening with Classical and Emerging Pansharpening Methods. IEEE Geosci. Remote Sens. Mag. 2021, 9, 53–81. [Google Scholar] [CrossRef]
  16. Zhang, W.; Tang, P.; Zhao, L. Remote Sensing Image Scene Classification Using CNN-CapsNet. Remote Sens. 2019, 11, 494. [Google Scholar] [CrossRef]
  17. Bazi, Y.; Bashmal, L.; Rahhal, M.M.A.; Dayil, R.A.; Ajlan, N.A. Vision Transformers for Remote Sensing Image Classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
  18. Ma, W.; Li, N.; Zhu, H.; Jiao, L.; Tang, X.; Guo, Y.; Hou, B. Feature Split-Merge-Enhancement Network for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5616217. [Google Scholar] [CrossRef]
  19. Li, X.; Deng, J.; Fang, Y. Few-Shot Object Detection on Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5601614. [Google Scholar] [CrossRef]
  20. Yu, D.; Ji, S. A New Spatial-Oriented Object Detection Framework for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4407416. [Google Scholar] [CrossRef]
  21. Li, T.; Gu, Y. Progressive Spatial–Spectral Joint Network for Hyperspectral Image Reconstruction. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5507414. [Google Scholar] [CrossRef]
  22. Sun, X.; Wang, P.; Lu, W.; Zhu, Z.; Lu, X.; He, Q.; Li, J.; Rong, X.; Yang, Z.; Chang, H.; et al. RingMo: A Remote Sensing Foundation Model with Masked Image Modeling. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5612822. [Google Scholar] [CrossRef]
  23. Yao, F.; Lu, W.; Yang, H.; Xu, L.; Liu, C.; Hu, L.; Yu, H.; Liu, N.; Deng, C.; Tang, D.; et al. RingMo-Sense: Remote Sensing Foundation Model for Spatiotemporal Prediction via Spatiotemporal Evolution Disentangling. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5620821. [Google Scholar] [CrossRef]
  24. Niklaus, S.; Mai, L.; Liu, F. Video frame interpolation via adaptive convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 670–679. [Google Scholar]
  25. Jin, X.; Tang, P.; Houet, T.; Corpetti, T.; Alvarez-Vanhard, E.G.; Zhang, Z. Sequence image interpolation via separable convolution network. Remote Sens. 2021, 13, 296. [Google Scholar] [CrossRef]
  26. Peleg, T.; Szekely, P.; Sabo, D.; Sendik, O. Im-net for high resolution video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2398–2407. [Google Scholar]
  27. Cassisa, C.; Simoens, S.; Prinet, V.; Shao, L. Sub-grid physical optical flow for remote sensing of sandstorm. In Proceedings of the 2010 IEEE International Geoscience and Remote Sensing Symposium, IEEE, Honolulu, HI, USA, 25–30 July 2010; pp. 2230–2233. [Google Scholar]
  28. Deng, C.; Cao, Z.; Fang, Z.; Yu, Z. Ship detection from optical satellite image using optical flow and saliency. In Proceedings of the MIPPR 2013: Remote Sensing Image Processing, Geographic Information Systems, and Other Applications, SPIE, Wuhan, China, 26–27 October 2013; Volume 8921, pp. 100–106. [Google Scholar]
  29. Liu, Z.; Yeh, R.A.; Tang, X.; Liu, Y.; Agarwala, A. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4463–4471. [Google Scholar]
  30. Jiang, H.; Sun, D.; Jampani, V.; Yang, M.H.; Learned-Miller, E.; Kautz, J. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9000–9008. [Google Scholar]
  31. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  32. Kong, L.; Jiang, B.; Luo, D.; Chu, W.; Huang, X.; Tai, Y.; Wang, C.; Yang, J. Ifrnet: Intermediate feature refine network for efficient frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1969–1978. [Google Scholar]
  33. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
  34. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  35. Zhang, T.; Qi, G.J.; Xiao, B.; Wang, J. Interleaved Group Convolutions. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4383–4392. [Google Scholar] [CrossRef]
  36. Charbonnier, P.; Blanc-Feraud, L.; Aubert, G.; Barlaud, M. Two deterministic half-quadratic regularization algorithms for computed imaging. In Proceedings of the 1st International Conference on Image Processing, IEEE, Austin, TX, USA, 13–16 November 1994; Volume 2, pp. 168–172. [Google Scholar]
  37. Grevera, G.J.; Udupa, J.K. Shape-based interpolation of multidimensional grey-level images. IEEE Trans. Med. Imaging 1996, 15, 881–892. [Google Scholar] [CrossRef] [PubMed]
  38. Luo, K.; Wang, C.; Liu, S.; Fan, H.; Wang, J.; Sun, J. Upflow: Upsampling pyramid for unsupervised optical flow learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1045–1054. [Google Scholar]
  39. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  40. Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
  41. Huang, X.; Shi, J.; Yang, J.; Yao, J. Evaluation of color image quality based on mean square error and peak signal-to-noise ratio of color difference. Acta Photonica Sin. 2007, 36, 295–298. [Google Scholar]
  42. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Figure 1. Three instances of remote sensing data that are missing: (a) displays a MODIS instrument failure on the Aqua satellite, which caused about 70% of the data in Aqua MODIS band 6 to be lost; (b) displays a Landsat-7 satellite digital product, where the image had pixel failure due to a malfunctioning Landsat-7 ETM+ on-board scanning line corrector; and (c) displays a Landsat-8 OLI_TIRS satellite digital product that is useless for direct use due to its 87.3% cloud coverage.
Figure 1. Three instances of remote sensing data that are missing: (a) displays a MODIS instrument failure on the Aqua satellite, which caused about 70% of the data in Aqua MODIS band 6 to be lost; (b) displays a Landsat-7 satellite digital product, where the image had pixel failure due to a malfunctioning Landsat-7 ETM+ on-board scanning line corrector; and (c) displays a Landsat-8 OLI_TIRS satellite digital product that is useless for direct use due to its 87.3% cloud coverage.
Remotesensing 16 00426 g001
Figure 2. Examples of Landsat-8 and Sentinel-2 satellite time series. There are two time series examples for Landsat-8 (a) and Sentinel-2 (b), respectively, with a linear stretch of 2% and a true color synthesis.
Figure 2. Examples of Landsat-8 and Sentinel-2 satellite time series. There are two time series examples for Landsat-8 (a) and Sentinel-2 (b), respectively, with a linear stretch of 2% and a true color synthesis.
Remotesensing 16 00426 g002
Figure 3. The architecture of the Multi-scale Optical Flow-Intermediate Feature joint Network. The network begins with the encoder ε , which extracts the four-scale features from the two time-phase images, I 0 and I 1 . Subsequently, the intermediate optical flows and features are collectively refined by four decoders D k ( k { 1 , 2 , 3 , 4 } ) in an ascending order of scale. The time phase T is input into the decoder D 4 . The decoder D 1 outputs the bidirectional optical flows, a multi-channel occlusion information mask, and multi-channel detail compensation information residuals, all of which are equal in size to the original image. Finally, these outputs are synthesized with the images from the preceding and subsequent time phases to generate the image at the intermediate time phase T.
Figure 3. The architecture of the Multi-scale Optical Flow-Intermediate Feature joint Network. The network begins with the encoder ε , which extracts the four-scale features from the two time-phase images, I 0 and I 1 . Subsequently, the intermediate optical flows and features are collectively refined by four decoders D k ( k { 1 , 2 , 3 , 4 } ) in an ascending order of scale. The time phase T is input into the decoder D 4 . The decoder D 1 outputs the bidirectional optical flows, a multi-channel occlusion information mask, and multi-channel detail compensation information residuals, all of which are equal in size to the original image. Finally, these outputs are synthesized with the images from the preceding and subsequent time phases to generate the image at the intermediate time phase T.
Remotesensing 16 00426 g003
Figure 4. Detailed structure of the residual block used by each sub-decoder. The residual block (b) uses a residual connection (a) as a whole. There are five convolution layers within the residual block, which uses an interleaved convolution (c) at convolution layer 2 and convolution layer 4.
Figure 4. Detailed structure of the residual block used by each sub-decoder. The residual block (b) uses a residual connection (a) as a whole. There are five convolution layers within the residual block, which uses an interleaved convolution (c) at convolution layer 2 and convolution layer 4.
Remotesensing 16 00426 g004
Figure 5. Distribution of training, testing, and validation samples in our experiment: (a) Landsat-8 and (b) Sentinel-2 images; areas inside blue and red boxes and the remainder of the images show the distribution of testing, validation, and training samples, and training set:validation set:test set = 8:1:1.
Figure 5. Distribution of training, testing, and validation samples in our experiment: (a) Landsat-8 and (b) Sentinel-2 images; areas inside blue and red boxes and the remainder of the images show the distribution of testing, validation, and training samples, and training set:validation set:test set = 8:1:1.
Remotesensing 16 00426 g005
Figure 6. Pixel error map for Experiment 2 of multi-temporal sequence image interpolation for the Landsat-8 satellite dataset. (a,b) are the results of two time series from Experiment 2 of the Landsat-8 satellite dataset. Time phase 1, 2 and 3 represent the experimental results for the middle three time phases t = 0.25, 0.50, and 0.75. I t , I ^ t , and I t I ^ t represent the real image, the interpolation result, and the pixel error between the two. In the pixel error map, grey, green, and blue represent the absolute value of the pixel difference in the ranges 0–10, 10–80, and 80–255, with larger values representing larger errors.
Figure 6. Pixel error map for Experiment 2 of multi-temporal sequence image interpolation for the Landsat-8 satellite dataset. (a,b) are the results of two time series from Experiment 2 of the Landsat-8 satellite dataset. Time phase 1, 2 and 3 represent the experimental results for the middle three time phases t = 0.25, 0.50, and 0.75. I t , I ^ t , and I t I ^ t represent the real image, the interpolation result, and the pixel error between the two. In the pixel error map, grey, green, and blue represent the absolute value of the pixel difference in the ranges 0–10, 10–80, and 80–255, with larger values representing larger errors.
Remotesensing 16 00426 g006
Figure 7. Pixel error map for Experiment 2 of multi-temporal sequence image interpolation for the Sentinel-2 satellite dataset. Time phases 1, 2, and 3 represent the experimental results for the middle three time phases t = 0.25, 0.50, and 0.75. I t , I ^ t , and I t I ^ t represent the real image, the interpolation result, and the pixel error between the two. In the pixel error map, grey, green, and blue represent the absolute value of the pixel difference in the ranges 0–10, 10–80, and 80–255, with larger values representing larger errors.
Figure 7. Pixel error map for Experiment 2 of multi-temporal sequence image interpolation for the Sentinel-2 satellite dataset. Time phases 1, 2, and 3 represent the experimental results for the middle three time phases t = 0.25, 0.50, and 0.75. I t , I ^ t , and I t I ^ t represent the real image, the interpolation result, and the pixel error between the two. In the pixel error map, grey, green, and blue represent the absolute value of the pixel difference in the ranges 0–10, 10–80, and 80–255, with larger values representing larger errors.
Remotesensing 16 00426 g007
Figure 8. Spectral curves map at 100 random points for Experiment 2 of multi-temporal sequence image interpolation for the Landsat-8 satellite dataset. (a,b) are the results of two time series from Experiment 2 of the Landsat-8 satellite dataset. Time phases 1, 2, and 3 represent the experimental results for the middle three time phases t = 0.25, 0.50, and 0.75. For each time phase image, we plotted the average of the spectra of the real image and the interpolated result at 100 random points (True and Ours), and the absolute value of the difference between the two spectra.
Figure 8. Spectral curves map at 100 random points for Experiment 2 of multi-temporal sequence image interpolation for the Landsat-8 satellite dataset. (a,b) are the results of two time series from Experiment 2 of the Landsat-8 satellite dataset. Time phases 1, 2, and 3 represent the experimental results for the middle three time phases t = 0.25, 0.50, and 0.75. For each time phase image, we plotted the average of the spectra of the real image and the interpolated result at 100 random points (True and Ours), and the absolute value of the difference between the two spectra.
Remotesensing 16 00426 g008
Figure 9. Spectral curves map at 100 random points for Experiment 2 of multi-temporal sequence image interpolation for the Sentinel-2 satellite dataset. Time phases 1, 2, and 3 represent the experimental results for the middle three time phases t = 0.25, 0.50, and 0.75. For each time phase image, we plotted the average of the spectra of the real image and the interpolated result at 100 random points (True and Ours), and the absolute value of the difference between the two spectra.
Figure 9. Spectral curves map at 100 random points for Experiment 2 of multi-temporal sequence image interpolation for the Sentinel-2 satellite dataset. Time phases 1, 2, and 3 represent the experimental results for the middle three time phases t = 0.25, 0.50, and 0.75. For each time phase image, we plotted the average of the spectra of the real image and the interpolated result at 100 random points (True and Ours), and the absolute value of the difference between the two spectra.
Remotesensing 16 00426 g009
Figure 10. Pixel error map for comparative experiments of Experiment 2 for multi-temporal sequence image interpolation on the Landsat-8 satellite dataset. (A,B) are the results of two time series from Experiment 2 of the Landsat-8 satellite dataset. Time phases 1, 2, and 3 represent the experimental results for the middle three time phases t = 0.25, 0.50, and 0.75. In the pixel error map, grey, green, and blue represent the absolute value of the pixel difference in the ranges 0–10, 10–80, and 80–255, with larger values representing larger errors.
Figure 10. Pixel error map for comparative experiments of Experiment 2 for multi-temporal sequence image interpolation on the Landsat-8 satellite dataset. (A,B) are the results of two time series from Experiment 2 of the Landsat-8 satellite dataset. Time phases 1, 2, and 3 represent the experimental results for the middle three time phases t = 0.25, 0.50, and 0.75. In the pixel error map, grey, green, and blue represent the absolute value of the pixel difference in the ranges 0–10, 10–80, and 80–255, with larger values representing larger errors.
Remotesensing 16 00426 g010aRemotesensing 16 00426 g010b
Figure 11. Pixel error map for comparative experiments of Experiment 2 for multi-temporal sequence image interpolation on the Sentinel-2 satellite dataset. Time phases 1, 2, and 3 represent the experimental results for the middle three time phases t = 0.25, 0.50, and 0.75. In the pixel error map, grey, green, and blue represent the absolute value of the pixel difference in the ranges 0–10, 10–80, and 80–255, with larger values representing larger errors.
Figure 11. Pixel error map for comparative experiments of Experiment 2 for multi-temporal sequence image interpolation on the Sentinel-2 satellite dataset. Time phases 1, 2, and 3 represent the experimental results for the middle three time phases t = 0.25, 0.50, and 0.75. In the pixel error map, grey, green, and blue represent the absolute value of the pixel difference in the ranges 0–10, 10–80, and 80–255, with larger values representing larger errors.
Remotesensing 16 00426 g011
Figure 12. Spectral curves map at 100 random points for comparative experiments of Experiment 2 for multi-temporal sequence image interpolation on the Landsat-8 satellite dataset. (a,b) are the results of two time series from Experiment 2 of the Landsat-8 satellite dataset. For each time phase image, we plotted the average of the spectra of the real image and the interpolated results of the five methods at 100 random points.
Figure 12. Spectral curves map at 100 random points for comparative experiments of Experiment 2 for multi-temporal sequence image interpolation on the Landsat-8 satellite dataset. (a,b) are the results of two time series from Experiment 2 of the Landsat-8 satellite dataset. For each time phase image, we plotted the average of the spectra of the real image and the interpolated results of the five methods at 100 random points.
Remotesensing 16 00426 g012
Figure 13. Spectral curves map at 100 random points for comparative experiments of Experiment 2 for multi-temporal sequence image interpolation on the Sentinel-2 satellite dataset. For each time phase image, we plotted the average of the spectra of the real image and the interpolated results of the five methods at 100 random points.
Figure 13. Spectral curves map at 100 random points for comparative experiments of Experiment 2 for multi-temporal sequence image interpolation on the Sentinel-2 satellite dataset. For each time phase image, we plotted the average of the spectra of the real image and the interpolated results of the five methods at 100 random points.
Remotesensing 16 00426 g013
Figure 14. Interpolation results and optical flow visualization of a Landsat-8 image block. The first row displays the images of the previous time phase and the next time phase; the second to third rows are the results of the three intermediate time phases of the images of the previous time phase and the next time phase.
Figure 14. Interpolation results and optical flow visualization of a Landsat-8 image block. The first row displays the images of the previous time phase and the next time phase; the second to third rows are the results of the three intermediate time phases of the images of the previous time phase and the next time phase.
Remotesensing 16 00426 g014
Figure 15. Interpolation results and optical flow visualization of a Sentinel-2 image block. The first row displays the images of the previous time phase and the next time phase; the second to third rows are the results of the three intermediate time phases of the images of the previous time phase and the next time phase.
Figure 15. Interpolation results and optical flow visualization of a Sentinel-2 image block. The first row displays the images of the previous time phase and the next time phase; the second to third rows are the results of the three intermediate time phases of the images of the previous time phase and the next time phase.
Remotesensing 16 00426 g015
Table 1. The details of the Landsat-8 satellite dataset. This table contains the path and row, the acquisition date, the latitude and longitude of the image center point, and the number of phases included for each Landsat-8 satellite time series.
Table 1. The details of the Landsat-8 satellite dataset. This table contains the path and row, the acquisition date, the latitude and longitude of the image center point, and the number of phases included for each Landsat-8 satellite time series.
Path,RowDateCenter LatitudeCenter LongitudePhase Number
91,7821 June 202225°59 42 S150°32 35 E5
 7 July 2022   
 23 July 2022   
91,798 August 202227°26 2 S150°11 2 E5
 24 August 2022   
91,8420 January 201934°36 50 S148°16 34 E5
 5 February 2019   
 21 February 2019   
 9 March 2019   
 24 March 2019   
92,8520 January 202236°2 46 S147°51 58 E5
 5 February 2022   
 21 February 2022   
 9 March 2022   
 24 March 2022   
99,7331 March 201818°47 24 S139°53 38 E10
 16 April 2018   
 2 May 2018   
 18 May 2018   
 3 June 2018   
99,747 September 201820°13 55 S139°33 40 E10
 23 September 2018   
 9 October 2018   
 25 October 2018   
 1 November 2018   
Table 2. The details of the Sentinel-2 satellite dataset. This table contains the tile ID, the acquisition date, the latitude and longitude of the image center point, and the number of phases included of each Landsat-8 satellite time series.
Table 2. The details of the Sentinel-2 satellite dataset. This table contains the tile ID, the acquisition date, the latitude and longitude of the image center point, and the number of phases included of each Landsat-8 satellite time series.
Tile IDDateCenter LatitudeCenter LongitudePhase Number
18STF11 January 201936°31 12 N77°44 14 W5
 16 January 2019   
 21 January 2019   
 26 January 2019   
 31 January 2019   
31TDN26 March 202047°21 29 N2°24 8 E5
 31 March 2020   
 5 April 2020   
 10 April 2020   
 15 April 2020   
30TXT7 July 202047°20 28 N0°56 57 W5
 12 July 2020   
 17 July 2020   
 22 July 2020   
 27 July 2020   
35TNM28 September 201846°27 26 N27°42 52 E5
 3 October 2018   
 8 October 2018   
 23 October 2018   
 18 October 2018   
Table 3. Setup of multi-temporal image interpolation experiments. We set up four sets of experiments using the Landsat-8 satellite dataset, and in each set of experiments the inputs are Input Image 1 and Input Image 2, as well as the intermediate time phase t (Input Time Phase). The interpolated results are compared with the Reference Image (the real image).
Table 3. Setup of multi-temporal image interpolation experiments. We set up four sets of experiments using the Landsat-8 satellite dataset, and in each set of experiments the inputs are Input Image 1 and Input Image 2, as well as the intermediate time phase t (Input Time Phase). The interpolated results are compared with the Reference Image (the real image).
Experiment IDPath,RowInput Image 1Input Image 2Reference ImageInput Time Phase
191,8420 January 201924 March 20195 February 20190.25
21 February 20190.5
9 March 20190.75
92,8520 January 202224 March 20225 February 20220.25
21 February 20220.5
9 March 20220.75
299,7331 March 20183 June 201816 April 20180.25
2 May 20180.5
18 May 20180.75
99,7431 March 20183 June 201816 April 20180.25
2 May 20180.5
18 May 20180.75
391,7821 June 202224 August 20227 July 20220.25
23 July 20220.5
8 August 20220.75
91,7921 June 202224 August 20227 July 20220.25
23 July 20220.5
8 August 20220.75
499,737 September 20181 November 201823 September 20180.25
9 October 20180.5
25 October 20180.75
99,747 September 20181 November 201823 September 20180.25
9 October 20180.5
25 October 20180.75
Table 4. Setup of multi-temporal image interpolation experiments. We set up four sets of experiments using the Sentinel-2 satellite dataset, and in each set of experiments the inputs are Input Image 1 and Input Image 2, as well as the intermediate time phase t (Input Time Phase). The interpolated results are compared with the Reference Image (the real image).
Table 4. Setup of multi-temporal image interpolation experiments. We set up four sets of experiments using the Sentinel-2 satellite dataset, and in each set of experiments the inputs are Input Image 1 and Input Image 2, as well as the intermediate time phase t (Input Time Phase). The interpolated results are compared with the Reference Image (the real image).
Experiment IDTile IDInput Image 1Input Image 2Reference ImageInput Time Phase
118STF11 January 201931 January 201916 January 20190.25
21 January 20190.5
26 January 20190.75
231TDN26 March 202015 April 202031 March 20200.25
5 April 20200.5
10 April 20200.75
330TXT7 July 202027 July 202012 July 20200.25
17 July 20200.5
22 July 20200.75
435TNM28 September 201818 October 20183 October 20180.25
8 October 20180.5
13 October 20180.75
Table 5. Quantitative comparison of multi-temporal sequential image interpolation results on the Landsat-8 dataset. For each indicator, the results in bold are the best results and those marked with * are the second best results. Time phases 1, 2, and 3 represent the experimental results for the middle three time phases t = 0.25, 0.50, and 0.75. Arrows ↓ and ↑ stand for smaller is better and bigger is better.
Table 5. Quantitative comparison of multi-temporal sequential image interpolation results on the Landsat-8 dataset. For each indicator, the results in bold are the best results and those marked with * are the second best results. Time phases 1, 2, and 3 represent the experimental results for the middle three time phases t = 0.25, 0.50, and 0.75. Arrows ↓ and ↑ stand for smaller is better and bigger is better.
ExperimentMethodTime Phase 1Time Phase 2Time Phase 3
ID RMSE↓ PSNR↑ SSIM↑ RMSE↓ PSNR↑ SSIM↑ RMSE↓ PSNR↑ SSIM↑
1Linear9.35729.1540.9579.22529.3300.9589.10629.5130.959
Unet [31]10.99727.9800.94210.07428.7690.9499.82329.0280.955
Super SloMo [30]8.99629.5500.9618.81329.7660.9638.58430.1020.966
IFRNet [32]8.140 *30.197 *0.971 *7.892 *30.367 *0.9777.642 *30.734 *0.975 *
Ours7.86030.3830.9727.21631.3590.972 *7.06931.7360.978
2Linear8.44830.0910.9678.39530.2240.9726.52832.2990.979
Unet [31]10.76328.0910.9479.83028.9480.9549.57929.0700.957
Super SloMo [30]8.17830.3290.9718.37230.1680.9696.21632.7030.980 *
IFRNet [32]7.384 *31.173 *0.977 *6.900 *31.912 *0.979 *5.696 *33.434 *0.984
Ours6.77031.9400.9806.27532.7320.9815.53233.7520.984
3Linear8.58830.0380.9658.73429.8040.9637.19231.5000.977
Unet [31]9.30429.2280.9589.25429.2980.9588.86229.7400.961
Super SloMo [30]7.84130.4010.9747.80030.4460.9756.92831.7380.979
IFRNet [32]6.884 *31.931 *0.979 *7.077 *31.655 *0.978 *6.196 *33.150 *0.982 *
Ours6.15433.3750.9836.26833.1120.9815.64833.4530.984
4Linear8.61829.8580.9648.97629.6740.9618.50930.1630.967
Unet [31]9.27629.2290.9589.34429.1920.9579.17929.4580.959
Super SloMo [30]7.130 *30.9810.9767.39631.1110.976 *7.99830.2790.971
IFRNet [32]7.130 *31.545 *0.978 *6.910 *31.910 *0.9797.127 *31.615 *0.978 *
Ours6.37732.4360.9806.88931.9170.9796.21033.1190.980
Table 6. Quantitative comparison of multi-temporal sequential image interpolation results on the Sentinel-2 dataset. For each indicator, the results in bold are the best results and those marked with * are the second best results. Time phases 1, 2, and 3 represent the experimental results for the middle three time phases t = 0.25, 0.50, and 0.75. Arrows ↓ and ↑ stand for smaller is better and bigger is better.
Table 6. Quantitative comparison of multi-temporal sequential image interpolation results on the Sentinel-2 dataset. For each indicator, the results in bold are the best results and those marked with * are the second best results. Time phases 1, 2, and 3 represent the experimental results for the middle three time phases t = 0.25, 0.50, and 0.75. Arrows ↓ and ↑ stand for smaller is better and bigger is better.
ExperimentMethodTime Phase 1Time Phase 2Time Phase 3
ID RMSE↓ PSNR↑ SSIM↑ RMSE↓ PSNR↑ SSIM↑ RMSE↓ PSNR↑ SSIM↑
1Linear11.02928.0830.95912.03127.3890.94211.50127.6330.944
Unet [31]9.21129.7510.9639.30829.7450.96210.16728.5570.961 *
Super SloMo [30]10.89828.2340.96011.10927.9980.95910.48528.5240.961 *
IFRNet [32]8.368 *30.754 *0.970 *9.96928.9490.961 *9.472 *29.709 *0.961 *
Ours7.48631.4110.9759.821 *29.558 *0.9609.14029.8200.965
2Linear11.96127.4580.94312.92426.4990.94011.13527.7650.959
Unet [31]9.57529.948 *0.9599.12429.8540.966 *7.888 *31.104 *0.973 *
Super SloMo [30]11.27627.9960.94810.15728.8160.96010.31328.5360.961
IFRNet [32]9.412 *29.7330.962 *9.028 *29.980 *0.966 *8.82129.8580.970
Ours8.59730.8560.9678.47130.6640.9707.23731.8730.978
3Linear11.29527.9140.94811.82227.4660.94314.50025.5980.932
Unet [31]10.12629.1100.96010.16028.6530.960 *12.85226.0000.940 *
Super SloMo [30]10.86828.2410.96010.82628.3230.960 *13.05326.4030.936
IFRNet [32]9.230 *29.747 *0.963 *9.670 *29.707 *0.95911.911 *27.460 *0.943
Ours8.96230.1230.9679.19429.7870.96411.80927.5010.943
4Linear11.44027.6870.94613.12126.2490.93611.39227.8350.947
Unet [31]9.41729.721 *0.96211.15828.170 *0.960 *8.35230.5310.968
Super SloMo [30]10.11129.2070.96012.19126.9360.9419.77129.6900.959
IFRNet [32]9.348 *29.6790.963 *10.933 *28.1490.960 *8.282 *30.791 *0.971
Ours8.01030.9010.9709.12629.8530.9667.99930.9030.970 *
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shi, K.; Liu, Z.-Q.; Zhang, W.; Tang, P.; Zhang, Z. Enhancing Satellite Image Sequences through Multi-Scale Optical Flow-Intermediate Feature Joint Network. Remote Sens. 2024, 16, 426. https://doi.org/10.3390/rs16020426

AMA Style

Shi K, Liu Z-Q, Zhang W, Tang P, Zhang Z. Enhancing Satellite Image Sequences through Multi-Scale Optical Flow-Intermediate Feature Joint Network. Remote Sensing. 2024; 16(2):426. https://doi.org/10.3390/rs16020426

Chicago/Turabian Style

Shi, Keli, Zhi-Qiang Liu, Weixiong Zhang, Ping Tang, and Zheng Zhang. 2024. "Enhancing Satellite Image Sequences through Multi-Scale Optical Flow-Intermediate Feature Joint Network" Remote Sensing 16, no. 2: 426. https://doi.org/10.3390/rs16020426

APA Style

Shi, K., Liu, Z. -Q., Zhang, W., Tang, P., & Zhang, Z. (2024). Enhancing Satellite Image Sequences through Multi-Scale Optical Flow-Intermediate Feature Joint Network. Remote Sensing, 16(2), 426. https://doi.org/10.3390/rs16020426

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop