Gap-Filling and Missing Information Recovery for Time Series of MODIS Data Using Deep Learning-Based Methods

: Sensors onboard satellite platforms with short revisiting periods acquire frequent earth observation data. One limitation to the utility of satellite-based data is missing information in the time series of images due to cloud contamination and sensor malfunction. Most studies on gap-ﬁlling and cloud removal process individual images, and existing multi-temporal image restoration methods still have problems in dealing with images that have large areas with frequent cloud contamination. Considering these issues


Introduction
The rapid development of remote sensing technology provides abundant earth observation images at varied spectral, spatial, and temporal resolutions and promotes a broad range of applications, such as land cover and land use classification [1,2], map vectorization [3], land surface phenology monitoring [4,5], and natural disaster modeling [6,7]. Multispectral imaging sensors that typically operate in a few spectral bands ranging from visible, near infrared, and shortwave infrared wavelengths are common among a variety of remote sensing sensors, as the multispectral imaging system acquires the images of the earth's surface by detecting solar radiation reflected from targets on the ground. The utility of multispectral images is limited by large amounts of missing information in the remote sensing data due to cloud contamination and sensor malfunction [8]. Restoring missing information and gap-filling for remote sensing images help provide continuous and consistent data for downstream applications.
oped the Convolutional-Mapping-Deconvolutional Network using both optical and SAR images to realize thick cloud removal. Although the proposed framework made progress in restoring multi-temporal remote sensing images using deep learning approaches, there is room for improvements as the restored images are prone to blurring and become inaccurate when large areas have thick cloud covers or when multi-temporal images have large overlapping areas with missing data.
Aiming at addressing the above-mentioned limitations, we propose a content-sequencetexture generation (CSTG) network, which is based on a deep learning reconstruction method that accounts for restoring the content, temporal sequence, and spatial texture of images. The goal of this study is to design and evaluate the method that consists of two networks (i.e., a content generation network and a sequence-texture generation network) for recovering time series of remote sensing images with missing data. Different from most recovery methods that rely on multi-temporal information, we aim to develop a method that allows for reconstructing time series of remote sensing images simultaneously, even if there are no completely cloud-free images in the time series. The idea for the proposed deep learning network is to consider content consistency, texture details, and sequential trend information and apply both spectral loss and structure similarity loss to improve the accuracy of image reconstruction. Compared to existing methods, we aim to develop robust methods to process time series of images even with large missing areas and/or overlapped clouds on multi-temporal images.

Study Materials
Moderate resolution Imaging Spectroradiometer (MODIS) onboard the Terra and Aqua satellite platform provides near-daily global coverage of land surface observations. Due to cloud contamination and weather conditions, daily MODIS data have serious quality issues, and thus the MODIS science team develops and offers 8-day or 16-day composite MODIS products, which largely reduce low-quality data. There are still considerable gaps and missing data in the time series of MODIS products, making it an ideal case for gap-filling and missing information recovery studies. We used both the 500-m 8-day composite surface reflectance data of MOD09A1 and the 250-m 16-day composite vegetation index data of MOD13Q1 for studies. We used surface reflectance data in the red, blue, and near-infrared bands from both products.
The gap-filling methods were conducted to training, validate, and test at the areas of Asia, Europe, and North America in the northern hemisphere ( Figure 1). Each study dataset has an image size of 400 × 400 and the images have high-quality observations in the time series. The data with blue segments were applied to train the model and the data with green segments were used for model validation. We used the data with red segments of A-F and a whole tile image with a pink segment for model testing.
Study area A crosses the borders of Russia and Ukraine. It has a temperate continental climate with distinctive seasonal surface changes. The images used from MOD09A1 were acquired on 30 September and 8 October in 2020. Study area B is situated in the central part of Poland. The area contains several land cover types such as buildings, rivers, and vegetation. The images from MOD09A1 were acquired on 26 February and 29 August in 2019. Study area C is located along the Atlantic coast in the northeastern United States. The data from MOD09A1 were acquired on 26 February, 22 March, 23 April, and 17 November in 2016. Study area D crosses the border between China and Myanmar. Most of the study areas are located in Yunnan in China with a small area that covers the northeastern portion of Myanmar. The data from MOD13Q1 were acquired on 2 February, 5 March, 31 October, 16 November, and 18 December in 2019. Study area E comes from the entire tile (4800 × 4800) of H27V06 in MOD13Q1 acquired on 25 June in 2019. The H27V06 tile covers southwestern China, Hanoi, Laos, Thailand, and northern Myanmar. The remote sensing images in this area are prone to cloud contamination in summer due to the influence of monsoon climate. The area with cloud contamination in the study image exceeds two-third of the entire image, representing a challenging case for gap-filling. Study area F is located near  15 October, and 24 November in 2020. There are lush vegetation and a wide variety of species in the study area with a subtropical monsoon climate. The climate is hot and humid, making it suitable for testing the ability of the models. All the products we used in this research are listed in Table 1. ote Sens. 2022, 14, x FOR PEER REVIEW 4 of influence of monsoon climate. The area with cloud contamination in the study image e ceeds two-third of the entire image, representing a challenging case for gap-filling. Stu area F is located near the Pearl River estuary. The study images from MOD09A1 we acquired on 18 February, 21 March, 22 April, 15 October, and 24 November in 2020. The are lush vegetation and a wide variety of species in the study area with a subtropical mo soon climate. The climate is hot and humid, making it suitable for testing the ability of t models. All the products we used in this research are listed in Table 1.    We divided the test data into two categories (artificial dataset and observed dataset) to assess the performance of the developed network. The artificial dataset is the cloud data derived from original cloud-free images preprocessed by artificial masks and is used to compare the reconstruction results with the original images quantitatively to assess the model's accuracy. The observed dataset is original images contaminated by clouds, for demonstrating the practical ability of the model. Each category was applied to two time steps: individual image and time series of images at regional or large-scale missing area, respectively.

Methodology
To address the limitations associated with methods in previous research, we proposed a gap-filling and image recovery method for a time series of images based on a network of content-sequence-texture generation. The overall flowchart of the proposed method is shown in Figure 2. In the training process, we chose several cloud-free images with time series information and cut them into patches, which were considered as target data. We added masks with random positions and sizes into the cloud-free patches and filled in the mask areas with random spectral noise as the network input. When conducting model testing, we stacked the time series images for restoration and excluded pixels with clouds and/or poor qualities in the images based on quality control data. Time series images were then split into smaller patches. The image patches were put into a content generation network to fill gaps initially and obtain general content. And a sequence-texture generation network was used to refine the texture details in order to reduce the texture and spectral differences. After mosaicking gap-filled patches into the generated maps, the corresponding pixels in the generated maps were selected to replace the mask areas in the original images. The high-quality, gap-filled images are generated as follows: where R h denotes the final high-quality result, R denotes feature maps generated by CSTG network and R 0 represents the original image after poor-pixel masking. M 0 denotes the

Data Quality Evaluation and Preprocessing
To fill gaps in remote sensing images, it is necessary to determine the pixels and areas that are contaminated by clouds or have low quality. There are various methods for detecting low-quality data in the products derived from different sensors. For instance, FMask [37] and MSCFF [38] algorithms have been used to extract the masks of clouds from Landsat and Sentinel data, respectively. The MODIS products provide a layer of state flags, which consists of binary codes to represent the quality of observations for each pixel. Although the quality control data in the MODIS products are helpful to extract pixels with poor qualities, there are often cloudy pixels that are missed in the quality control data. As the reflectance of clouds in visible bands is much higher than that of other objects, we conservatively masked out the pixels with reflectance more than 0.25 in the visible bands. To reduce speckle noise, we expanded the masks using the method of 5-pixel morphological dilation. Then we filled in masked areas with random noise, which follows normal distribution with the same mean value and variance as high-quality surrounding areas. The random noise helps enhance spectral information in the original image and can be written as: where V denotes a set of random variables for filling in masked areas; μ and σ 2 are the mean value and the variance of the normal distribution, respectively, and can be calculated as: where ( , ) denotes the surface reflectance of the pixel located at the coordinates ( , ); and ℎ denote the weight and the height of the image, respectively; denotes the aggregation of all masked pixels, and denotes the total number of masked pixels.

Data Quality Evaluation and Preprocessing
To fill gaps in remote sensing images, it is necessary to determine the pixels and areas that are contaminated by clouds or have low quality. There are various methods for detecting low-quality data in the products derived from different sensors. For instance, FMask [37] and MSCFF [38] algorithms have been used to extract the masks of clouds from Landsat and Sentinel data, respectively. The MODIS products provide a layer of state flags, which consists of binary codes to represent the quality of observations for each pixel. Although the quality control data in the MODIS products are helpful to extract pixels with poor qualities, there are often cloudy pixels that are missed in the quality control data. As the reflectance of clouds in visible bands is much higher than that of other objects, we conservatively masked out the pixels with reflectance more than 0.25 in the visible bands. To reduce speckle noise, we expanded the masks using the method of 5-pixel morphological dilation. Then we filled in masked areas with random noise, which follows normal distribution with the same mean value and variance as high-quality surrounding areas. The random noise helps enhance spectral information in the original image and can be written as: where V denotes a set of random variables for filling in masked areas; µ and σ 2 are the mean value and the variance of the normal distribution, respectively, and can be calculated as: where r (x,y) denotes the surface reflectance of the pixel located at the coordinates (x, y); w and h denote the weight and the height of the image, respectively; M denotes the aggregation of all masked pixels, and m denotes the total number of masked pixels.

The Content Generation Network
The content generation network as shown in Figure 3 is used to initially fill the gaps in the images based on supplementary information from the other images in the time series. It consists of three components: multi-scale feature extraction module, feature encoding module, and decoding module. Typically, the convolution kernel sizes are small. Odd integer values make them convenient for use as a padding strategy and make it easy to locate the output pixels. We set the kernel sizes of three, five and seven, respectively, to perform the multi-scale feature extraction for each image in the time series. After concatenating extracted features, a network structure that is similar to the auto-encoder was constructed to fill in the missing area of the images. We encoded multi-scale features using a MaxPooling layer to compress the input size and increase the receptive field while retaining as many detailed features as possible. Several convolution blocks, of which each was composed of a convolution layer, a batch normalization layer (BN), and a rectified linear unit (ReLU), were designed to extract features. Then we concatenated features from each time step in the channel dimension. We employed decoding with several convolution blocks and an UpSampling layer to ensure that the feature maps have the same size as the input maps. All the features were compressed into a feature group with a series of feature maps, and the number of feature maps was equal to the summary of the bands in the input time series. Studies have proved that loss functions with considerations on both global and local scales may have good performance on cloud removal [35]. The loss function of the content generation network is defined based on root mean squared errors (RMSEs) between the output feature maps and the target maps, which refer to the high-quality input data before masking randomly.

The Content Generation Network
The content generation network as shown in Figure 3 is used to initially fill the gaps in the images based on supplementary information from the other images in the time series. It consists of three components: multi-scale feature extraction module, feature encoding module, and decoding module. Typically, the convolution kernel sizes are small. Odd integer values make them convenient for use as a padding strategy and make it easy to locate the output pixels. We set the kernel sizes of three, five and seven, respectively, to perform the multi-scale feature extraction for each image in the time series. After concatenating extracted features, a network structure that is similar to the auto-encoder was constructed to fill in the missing area of the images. We encoded multi-scale features using a MaxPooling layer to compress the input size and increase the receptive field while retaining as many detailed features as possible. Several convolution blocks, of which each was composed of a convolution layer, a batch normalization layer (BN), and a rectified linear unit (ReLU), were designed to extract features. Then we concatenated features from each time step in the channel dimension. We employed decoding with several convolution blocks and an UpSampling layer to ensure that the feature maps have the same size as the input maps. All the features were compressed into a feature group with a series of feature maps, and the number of feature maps was equal to the summary of the bands in the input time series. Studies have proved that loss functions with considerations on both global and local scales may have good performance on cloud removal [35]. The loss function of the content generation network is defined based on root mean squared errors (RMSEs) between the output feature maps and the target maps, which refer to the high-quality input data before masking randomly. The loss function consists of both global loss and local loss as follows: Global loss refers to RMSEs calculated for the entire image patches, and local loss refers to RMSEs calculated for the masked areas in the image patches. Global loss helps improve global consistency of the generated feature map and local loss helps improve local consistency of the recovered area. The equations for global and local losses are as follows: The loss function consists of both global loss and local loss as follows: Global loss refers to RMSEs calculated for the entire image patches, and local loss refers to RMSEs calculated for the masked areas in the image patches. Global loss helps improve global consistency of the generated feature map and local loss helps improve local consistency of the recovered area. The equations for global and local losses are as follows: where L c denotes the total loss of the content generation network; L global and L local denote global loss and local loss, respectively; w and h denote the weight and the height of an image patch, respectively; m denotes the total number of the mask pixels; M denotes the set of mask pixels; p (x,y) denotes the predicted surface reflectance in the output feature maps of the pixel located at the coordinates x and y; and r (x,y) denotes the observed surface reflectance in the target maps of the pixel located at the coordinates x and y.

The Sequence-Texture Generation Network
The sequence-texture generation network accounts for changes in the time series of images at the same position and the texture characteristics of neighbor pixels for missing information recovery. The core idea of the network is to learn spectral change information in the time series with recurrent neural networks (RNN) and contextual information from the neighboring cloud-free areas with convolutional neural networks (CNN) and further enrich the spatiotemporal information in the feature map obtained by content generation. LSTM-CNN is used as the model structure ( Figure 4a). In the model structure, ConvLSTM [39] is used to extract time series features of the input images. LSTM is a deep learning network that is suitable for tackling sequential problems with regular time intervals [40]. ConvLSTM inherits the long-term memory functions and sequence modeling advantages of LSTM which could catch information from previous time steps. The main difference between ConvLSTM and LSTM is the dimension of the input data. The input of LSTM is onedimension data, which makes it difficult to deal with spatial-temporal images or videos, while ConvLSTM expands the dimension of the input data into three to solve this problem. Another improvement is that ConvLSTM uses convolution operation to calculate the cell output and the hidden layer state, instead of single matrix multiplication used in LSTM. Figure 4b illustrates the internal structure of a ConvLSTM cell at a given time step. The cell runs through three self-parameterized controlling gates, i.e., input gate (i t ), forget gate (f t ), and output gate (o t ), and these are calculated as follows: The cell output C t and the hidden state h t can be calculated as: where x t denotes the input at time step t; C t−1 denotes the output of the past cell; h t−1 denotes the past hidden states; W and b are the different weights and bias in the input gate, the forget gate, the output gate, and the cell output; • and * denote the convolution operator and the Hadamard product, respectively. In general, the unit at time step t receives both cell status and outputs from the unit at the previous time step, and passes both the cell status and outputs to the unit at the next time step, thereby forming a time sequence memory. the cell status and outputs to the unit at the next time step, thereby forming a time sequence memory. A CNN with a residual connection structure was used for texture generation. The residual connection structure breaks the symmetry of CNN and helps pass detailed information to the decoding layers such that it improves the representation of the deep network. The networks described above are able to generate gap-filled maps with continuous spectral reflectance and detailed texture.
The loss function for the sequence-texture generation network consists of both spectrum (SPE) loss and structural similarity (SSIM) loss, which is calculated as: where , , and denote the overall loss, spectrum loss, and structural similarity loss, respectively. is a balanced parameter that is empirically set to 0.01. The SPE loss is defined based on RMSE as the sum of global loss and local loss: SSIM accounts for the similarity of brightness, contrast, and structure of two images, whereas the loss function is defined as follows: where ̅ and ̅ denote the mean values of the output feature maps and the target maps, respectively; , , and denote the variance of the output, the variance of the target, and their covariance, respectively; and 1 and 2 denote two different equilibrium parameters.

Model Training and Setup
We normalized each group of training datasets and cut them into small patches with a window size of 32 × 32. The cropping stride was set to 20, such that the patches overlapped. The generated patches were augmented in five directions, i.e., the patches were A CNN with a residual connection structure was used for texture generation. The residual connection structure breaks the symmetry of CNN and helps pass detailed information to the decoding layers such that it improves the representation of the deep network. The networks described above are able to generate gap-filled maps with continuous spectral reflectance and detailed texture.
The loss function for the sequence-texture generation network consists of both spectrum (SPE) loss and structural similarity (SSIM) loss, which is calculated as: where L t , L SPE , and L SSI M denote the overall loss, spectrum loss, and structural similarity loss, respectively. λ is a balanced parameter that is empirically set to 0.01. The SPE loss is defined based on RMSE as the sum of global loss and local loss: SSIM accounts for the similarity of brightness, contrast, and structure of two images, whereas the loss function is defined as follows: where p and r denote the mean values of the output feature maps and the target maps, respectively; σ p , σ r , and σ pr denote the variance of the output, the variance of the target, and their covariance, respectively; and c 1 and c 2 denote two different equilibrium parameters.

Model Training and Setup
We normalized each group of training datasets and cut them into small patches with a window size of 32 × 32. The cropping stride was set to 20, such that the patches overlapped. The generated patches were augmented in five directions, i.e., the patches were rotated 90 • , 180 • , and 270 • , respectively and flipped both vertically and horizontally. We obtained 24,000 groups of training patches and 4800 validation patches for the content generation network training. Holdout validation was used to evaluate the model performance during the training process. Different from cross validation, hold out procedure means the training set and the validation set are independent and have no intersection with each other. In our study, we chose seven different study areas, five for making a training set and two for a validation set. In each epoch, we added masks with random positions and sizes to multi-temporal patches and filled with spectral noise. When we finished each training epoch and obtained the model, the validation data is fed into the model to evaluate the model performance on the data set that is completely different from the training data.
We generated 20 sets of cloud masks with random positions and radiuses and added the masks to sequential patches. After filling gaps using the content generation network, we obtained 20 sets of data for sequence-texture generation, of which 16 were used as the training datasets, and the remaining four were used as the validation sets. We cut the training datasets into 32 × 32 patches and augmented the data by rotating and flipping. Finally, 38,400 groups of training patches and 9600 groups of validation patches were generated for the sequence-texture generation network.
The networks were implemented based on the Tensorflow framework using the Adam [41] adaptive optimization method for training, which were conducted on the Ubuntu 18.04.2 system (published by Canonical in 26 April 2018 in London, England) equipped with NVIDIA RTX 2080Ti GPUs. The initial learning rate for the content generation network and the sequence-texture generation network were set as 0.0005 and 0.001, respectively, and the attenuation coefficients were both set as 0.8 at the intervals of 500 epochs. We trained each network with 1000 epochs and eventually selected the model with the lowest validation loss as the final model for testing.

Model Comparisons
We also introduced several existing methods designed for other remote sensing products to compare the results, including weighted linear regression (WLR) proposed by Zeng et al. [30], spatial-temporal-spectral (STS) joint CNN developed by Zhang et al. [42], Chen's method [34], and Zhang's method [35]. Among these methods, WLR is designed for Landsat ETM+ SLC-off imagery; STS and Zhang's method are generally applied in Landsat and Sentinel-2 data; and Chen's method is for ZY-3 image reconstruction. All of them were applied in this research to restore artificial missing areas and conduct gap-filling for individual images.
The WLR used a linear relationship to derive missing values from locally similar pixels. It selected similar pixels according to a searching window and accounted for both spatial and spectral differences to calculate the weights of each similar pixel. The WLR also used a regularization method to deal with the situation when available reference images were not sufficient. The method has been packaged into executable software (http: //sendimage.whu.edu.cn/send-resource-download/ (accessed on 3 November 2016)) and has proven its performance on artificial and observed Landsat SLC-off ETM+ images.
The STS combines spatial-temporal-spectral joint information to construct a deep CNN and applies multi-scale feature extraction units to obtain context information. Dilated convolutional layers were used to enlarge the receptive field, and skip connections were used to transfer the information from the previous layer. Studies have found the method could perform well on reconstructing both SLC-off ETM+ and Terra MODIS band 6 data.
The method developed by Chen et al. used one CNN to detect and mask cloudy areas and another CNN to fill gaps based on content, texture, and spectral generation. The idea was successfully used to recover ZY-3 multispectral and panchromatic images.
Zhang et al. took four steps to recover missing information in images, including multitemporal cloud detection, patch group stacking and sorting, deep learning recovering model, and weighted aggregation and progressive iteration. The method developed by Zhang et al. that combines both deep learning models and mathematical models performed well on reconstructing the Landsat-8 and Sentinel-2 data.
As among above-mentioned methods, the WLR method focuses on recovering images on the basis of statistics, and the other three comparison methods are mainly based on deep learning. We retrained and predicted reconstruction results from three deep-learning-based models using the MODIS data. As WLR, STS and Chen's method are designed for individual image reconstruction, we chose a high-quality image as a reference to recover cloudy images. As only Zhang's method and the CSTG method can reconstruct multitemporal images simultaneously, we further compared them in reconstruction time series of images.

Evaluation Metrics
To evaluate the method's performance, we compared the surface reflectance in both visible and near-infrared bands. Three evaluation metrics were used, i.e., the coefficient of determination (R 2 ), mean absolute error (MAE), and structural similarity index method (SSI M). MAE is one of the most evaluative indexes that can explain the difference between model simulation and ground truth value. SSIM is widely used to measure the similarity between two images and can be regarded as an index to measure the quality of distorted images. The metrics mentioned above are calculated as follows: where w and h denote the weight and the height of the image patch for testing, respectively; M denotes the set of mask pixels and m denotes the total number of the mask pixels; p (x,y) and r (x,y) denote predicted and observed surface reflectance for pixels located at the coordinates x and y, respectively; p and r denote the mean values of the predicted data and the observed data, respectively; σ p , σ r , and σ pr denote the variance of prediction, the variance of observation, and their covariance, respectively; c 1 and c 2 are equilibrium parameters. Figure 5 presents the recovery results of individual images using five different methods when applying local and dispersive masks. Visually, the images recovered based on WLR, Zhang's method and our proposed CSTG method captured spectral and spatial characteristics of the original image well (Figure 5d,g,h). There are no obvious boundaries between the masked areas and their surroundings in the recovered images, and their results are closer to the original image than the results derived from the other methods. STS has the problem of texture blurring (e.g., the areas marked using yellow boxes in Figure 5e) and Chen's method suffers from the problem of spectrum inconsistency (e.g., the areas marked by the yellow box in Figure 5f). Figure 6 exhibits the recovery results of individual temporal images using five different methods when applying a large and intensive mask. The images recovered by Zhang's method and CSTG (Figure 6g,h) are more consistent with original images than other methods. Although WLR performs well at a small scale, it has the problems of oversmoothing and missing texture details at a large scale (e.g., the yellow box in Figure 6d). Both STS and Chen's method appear to have the same problems as those in Figure 5 Figure 6 exhibits the recovery results of individual temporal images using five different methods when applying a large and intensive mask. The images recovered by Zhang's method and CSTG (Figure 6g,h) are more consistent with original images than other methods. Although WLR performs well at a small scale, it has the problems of oversmoothing and missing texture details at a large scale (e.g., the yellow box in Figure 6d). Both STS and Chen's method appear to have the same problems as those in Figure 5.    Figure 6 exhibits the recovery results of individual temporal images using five different methods when applying a large and intensive mask. The images recovered by Zhang's method and CSTG (Figure 6g,h) are more consistent with original images than other methods. Although WLR performs well at a small scale, it has the problems of oversmoothing and missing texture details at a large scale (e.g., the yellow box in Figure 6d). Both STS and Chen's method appear to have the same problems as those in Figure 5.   Figure 7a shows the pixel-level error maps, which can reflect the mean reflectance difference of each pixel between the reconstructed maps and the original images of each band. According to the error maps, different methods tend to show miscalculation in similar locations. In addition, there are less areas whose mean errors are over 0.02 (red spots in Figure 7a) in WLR, Zhang's method and CTSG compared with the other two methods. Scatter plots related to both surface reflectance and normalized difference vegetation index (NDVI) were also derived for the masked areas between the recovered images and the original images. For the surface reflectance in four bands (Figure 7b), WLR, Zhang's method, and CSTG generated results that are consistent with the original images, where the R 2 values range from 0.944 to 0.954, the MAE values range from 0.012 to 0.025, and the SSIM values range from 0.966 to 0.971, respectively. When comparing the NDVI values between the original images and the images recovered using WLR, Zhang's method, and CSTG (Figure 7c), the R 2 values range from 0.960 to 0.969, the MAE values range from 0.014 to 0.016, and the SSIM values range from 0.951 to 0.955, respectively. Both STS and Chen's method produce images that have slightly lower R 2 and higher MAE with the original images than the other three methods. When evaluating individual bands as shown in Table 2, CSTG shows the best performance among the tested methods in Band 1, 2 and 4, where R 2 ranges from 0.930 to 0.943, SSIM ranges from 0.950 to 0.958, and MAE ranges from 0.008 to 0.024. Chen's method shows higher SSIM at most bands and has higher accuracy in Band 3 (R 2 = 0.934, MAE = 0.005, and SSIM = 0.953). According to the metrics of MAE, we found that all the methods show lower values in the visible bands (i.e., Band 1, 3, and 4) than in the near-infrared band (Band 2).  Figure 7a shows the pixel-level error maps, which can reflect the mean reflectance difference of each pixel between the reconstructed maps and the original images of each band. According to the error maps, different methods tend to show miscalculation in similar locations. In addition, there are less areas whose mean errors are over 0.02 (red spots in Figure 7a) in WLR, Zhang's method and CTSG compared with the other two methods. Scatter plots related to both surface reflectance and normalized difference vegetation index (NDVI) were also derived for the masked areas between the recovered images and the original images. For the surface reflectance in four bands (Figure 7b), WLR, Zhang's method, and CSTG generated results that are consistent with the original images, where the 2 values range from 0.944 to 0.954, the MAE values range from 0.012 to 0.025, and the SSIM values range from 0.966 to 0.971, respectively. When comparing the NDVI values between the original images and the images recovered using WLR, Zhang's method, and CSTG (Figure 7c), the 2 values range from 0.960 to 0.969, the MAE values range from 0.014 to 0.016, and the SSIM values range from 0.951 to 0.955, respectively. Both STS and Chen's method produce images that have slightly lower 2 and higher MAE with the original images than the other three methods. When evaluating individual bands as shown in Table 2, CSTG shows the best performance among the tested methods in Band 1, 2 and 4, where 2 ranges from 0.930 to 0.943, SSIM ranges from 0.950 to 0.958, and MAE ranges from 0.008 to 0.024. Chen's method shows higher SSIM at most bands and has higher accuracy in Band 3 ( 2 = 0.934, MAE = 0.005, and SSIM = 0.953). According to the metrics of MAE, we found that all the methods show lower values in the visible bands (i.e., Band 1, 3, and 4) than in the near-infrared band (Band 2).   When applying a large and intensive mask, all methods show high similarity in surface reflectance and NDVI between the original images and the recovered images ( Figure 8 and Table 3). In addition, it suggests in Figure 8a that Zhang's method and CSTG provide error maps with lower values in average. In particular, the map derived by CSTG has fewer pixels with values over 0.02 than the other maps, showing a higher accuracy in prediction. For surface reflectance in all bands, Zhang's method and CSTG exhibit high similarity between the original images and the recovered images, where R 2 values are 0.979 and 0.982, MAE values are 0.010 and 0.009, and SSIMs are 0.909 and 0.926, respectively. All models show higher R 2 values but lower SSIM values in surface reflectance when applying a large and intensive mask than when applying small and dispersive masks. For individual bands, nearly all models have higher recovery accuracy in the visible bands (Band 1, 3, and 4) than in the near-infrared band (Band 2) as shown in Table 3. The bias in Band 2 reduce the restored accuracy at the large scale. Among all methods, CSTG is able to restore images with consistently high accuracies when applying both large and small masks.  When applying a large and intensive mask, all methods show high similarity in surface reflectance and NDVI between the original images and the recovered images ( Figure  8 and Table 3). In addition, it suggests in Figure 8a Table 3. The bias in Band 2 reduce the restored accuracy at the large scale. Among all methods, CSTG is able to restore images with consistently high accuracies when applying both large and small masks.   Taking two scenes as examples, Figure 9 displays the recovery results of time series of images with irregularly distributed masks. For images that have few repetitive and overlapped masks in the time series ( Figure 9A), both Zhang's method and CSTG produce reasonably recovered images as compared to the original images in terms of image texture and spectrum in the modestly cloudy areas ( Figure 9A Figure 9B, 26 February and 23 April). One possible reason is that Zhang's method sorts the reference patches according to their integrity. Patches with higher integrity are considered as reliable patches, which are assigned higher weights in calculation. When there are repetitively masked areas in patches that are erroneously considered as reliable, the model will use the masked value as the pixel data for subsequent analysis and generate unsatisfactory results. CSTG improves the performance of image recovery when there are repetitive and overlapped masks in the time series of images and is able to provide reconstructed images reasonably.  In (b,c), the solid red lines denote the regression lines and the dashed black lines denote the 1:1 lines. Taking two scenes as examples, Figure 9 displays the recovery results of time series of images with irregularly distributed masks. For images that have few repetitive and overlapped masks in the time series ( Figure 9A), both Zhang's method and CSTG produce reasonably recovered images as compared to the original images in terms of image texture and spectrum in the modestly cloudy areas ( Figure 9A, 18 July and 30 September), and Zhang's method fails to recover some dark shadows in the largely masked areas ( Figure  9A, 29 August and 16 October). For images that have considerably repetitive and overlapped masks in time series ( Figure 9B), images recovered using Zhang's method have partial grid-like shades and black shadows after reconstruction, no matter if the masked areas are small ( Figure 9B, 5 March and 18 December) or large ( Figure 9B, 26 February and 23 April). One possible reason is that Zhang's method sorts the reference patches according to their integrity. Patches with higher integrity are considered as reliable patches, which are assigned higher weights in calculation. When there are repetitively masked areas in patches that are erroneously considered as reliable, the model will use the masked value as the pixel data for subsequent analysis and generate unsatisfactory results. CSTG improves the performance of image recovery when there are repetitive and overlapped masks in the time series of images and is able to provide reconstructed images reasonably.  Table 4 illustrates the evaluation metrics derived by comparing the surface reflectance in all bands between the original images and the images recovered using different methods. As the R 2 and SSIM are designed to evaluate the similarity of the whole images, they are largely affected by the area of the missing regions. Therefore, a decrease in R 2 and SSIM shows in both methods when the missing areas extend. Considering the images with small missing areas (scenario A, Image 2 and Image 4), Zhang's method and CSTG have similar performance according to different metrics. While in larger missing areas (scenario A, Image 1, Image 3 and Image 5), CSTG shows obvious lower MAE, higher R 2 and SSIM. According to scenario B, although both Zhang's method and CSTG have low MAE values when comparing the original images with the recovered images, Zhang's method has lower R 2 (decreased from 0.974 to 0.805) and SSIM values (decreased from 0.965 to 0.911) as the overlap of the masks increases. CSTG still has R 2 over 0.94 and SSIM over 0.93 when comparing the recovered images with the original images. The mean values for the evaluation metrics indicate that CSTG generally performs better than Zhang's method in the two simulated circumstances, and it demonstrates that CSTG is robust for restoring time series of images.  Figure 10 illustrates the images recovered from individual MOD13Q1 images using different methods. Both Zhang's method and CSTG could remove clouds and restore cloud-free images as compared to the original images. For two zoomed areas as shown in Figure 10b,c, CSTG effectively reduced the spectral noise of the cloudy image and captured the texture details well; however, images recovered using Zhang's method have jagged grids and missed some texture details. Figure 11 shows the recovery results for time series of images acquired in 2020. Both methods could remove most clouds and restore the contents of the land surface in the time series of images. In areas with little cloud noise, two methods produce relatively consistent recovery results (e.g., 22 April and 24 November). In areas with considerable cloud contamination, CSTG largely reduced the speckle noise in the repetitively cloudy areas and the image fragmentation due to spectral differences (18 February,21 March,and 15 October) as compared to Zhang's method. Overall, CSTG performs well on removing clouds and filling image gaps and provides high-quality images in a time series.  Figure 11 shows the recovery results for time series of images acquired in 2020. Both methods could remove most clouds and restore the contents of the land surface in the time series of images. In areas with little cloud noise, two methods produce relatively consistent recovery results (e.g., 22 April and 24 November). In areas with considerable cloud contamination, CSTG largely reduced the speckle noise in the repetitively cloudy areas and the image fragmentation due to spectral differences (18 February,21 March,and 15 October) as compared to Zhang's method. Overall, CSTG performs well on removing clouds and filling image gaps and provides high-quality images in a time series.

Ablation Study of Sequence-Texture Generation Network
To understand the role of sequence-texture network as proposed in this research, we took the first image in Figure 9B as an example and compared the feature maps generated with and without the sequence-texture generation network. Table 5 shows quantitative metrics evaluated by the model with and without a sequence-generation network. Without a sequence-texture generation network, 2 reduced from 0.959 and 0.962 to 0.870 and 0.896 in image 3 and image 4, respectively, and MAE raised to 0.033 in image 4. This suggests a significant improvement in each image with the sequence-texture generation network according to quantitative metrics, especially in image 1, image 3, and image 4, where missing areas are large and overlapping.

Ablation Study of Sequence-Texture Generation Network
To understand the role of sequence-texture network as proposed in this research, we took the first image in Figure 9B as an example and compared the feature maps generated with and without the sequence-texture generation network. Table 5 shows quantitative metrics evaluated by the model with and without a sequence-generation network. Without a sequence-texture generation network, R 2 reduced from 0.959 and 0.962 to 0.870 and 0.896 in image 3 and image 4, respectively, and MAE raised to 0.033 in image 4. This suggests a significant improvement in each image with the sequence-texture generation network according to quantitative metrics, especially in image 1, image 3, and image 4, where missing areas are large and overlapping.

Discussion
To demonstrate the spatial transferability of CSTG, Figure 12 shows the results of different land surface characteristics with thick clouds recovered by CSTG. It is notable that although the surface characteristics that we did not include in our training data (such as barren areas in Figure 12a) are generated with slight flaws, the results are still acceptable. It also shows that our method could finish gap filling tasks both in vegetation regions (Figure 12b,d) and non-vegetation regions (Figure 12a,c). Even when the cloud cover in the patch window extends to a large area (Figure 12c,d), our method could also reconstruct them and help to get clear textures. different land surface characteristics with thick clouds recovered by CSTG. It is notable that although the surface characteristics that we did not include in our training data (such as barren areas in Figure 12a) are generated with slight flaws, the results are still acceptable. It also shows that our method could finish gap filling tasks both in vegetation regions (Figure 12b,d) and non-vegetation regions (Figure 12a,c). Even when the cloud cover in the patch window extends to a large area (Figure 12c,d), our method could also reconstruct them and help to get clear textures. The method proposed in this study can obtain higher accuracy in recovery of both individual images and time series of images, which is predominantly due to abundant temporal information extracted in the networks. Compared to comparison models for individual image reconstruction that can only obtain one temporal image from another time, such as WLR, STS and Chen's method, our method may obtain more temporal information and the change tendency in time series. WLR considers local similar pixels in spectra and could always obtain recovery results with high spectral similarity with the original images. Some non-local information, such as textures and structures are ignored, so the texture details may be blurred when recovering large missing areas of images. In CSTG, sequence-texture generation network and the structural similarity loss are designed to improve this problem. Additionally, prior spectral noise added in CSTG make sense as it helps the network to obtain more global spectral information than the other deep learning methods. Moreover, compared with the multi-temporal image recovery method, our model can deal with overlapped clouds, as we included this type of situation when we generate random masks in model training. Sequence-texture generation network can also reduce spectral discrepancy caused by overlapping masks. It avoids the situation that mask pixels are also calculated as supplementary data, thus reducing grid-like shades and black shadows. When we perform model prediction, we need to clip large images into smaller patches for recovery and mosaic the recovered patches. As shown in Figure 13, if we splice patches directly without overlap, it may lead to inaccurate predictions in the edge areas The method proposed in this study can obtain higher accuracy in recovery of both individual images and time series of images, which is predominantly due to abundant temporal information extracted in the networks. Compared to comparison models for individual image reconstruction that can only obtain one temporal image from another time, such as WLR, STS and Chen's method, our method may obtain more temporal information and the change tendency in time series. WLR considers local similar pixels in spectra and could always obtain recovery results with high spectral similarity with the original images. Some non-local information, such as textures and structures are ignored, so the texture details may be blurred when recovering large missing areas of images. In CSTG, sequence-texture generation network and the structural similarity loss are designed to improve this problem. Additionally, prior spectral noise added in CSTG make sense as it helps the network to obtain more global spectral information than the other deep learning methods. Moreover, compared with the multi-temporal image recovery method, our model can deal with overlapped clouds, as we included this type of situation when we generate random masks in model training. Sequence-texture generation network can also reduce spectral discrepancy caused by overlapping masks. It avoids the situation that mask pixels are also calculated as supplementary data, thus reducing grid-like shades and black shadows. When we perform model prediction, we need to clip large images into smaller patches for recovery and mosaic the recovered patches. As shown in Figure 13, if we splice patches directly without overlap, it may lead to inaccurate predictions in the edge areas (Figure 13a) due to the padding strategy in convolution. To solve this problem, we might clip a patch of size l with stride s (s < l) to make sure there is overlap between adjacent images. By abandoning the edge areas with a size of (l − s)/2 in each patch, we are able to reconstruct a large image without block effects as shown in Figure 13b.  Figure 13a) due to the padding strategy in convolution. To solve this problem, we might clip a patch of size l with stride s (s < l) to make sure there is overlap between adjacent images. By abandoning the edge areas with a size of (l − s)/2 in each patch, we are able to reconstruct a large image without block effects as shown in Figure 13b. As the proposed framework includes two parts of networks, our experiments take a slightly longer time to train and test the model than Zhang's method. It took 25 h to train the model and 117.9 s to predict the final results of Figure 9A. In comparison, for Zhang's method, it only took 21 h for fitting the model and 72.2 s for prediction. It is also notable that the prediction time efficiency of the proposed method only depends on the size of the testing images and the performance of GPU. The prediction time of Zhang's method is also influenced by the mask areas of each image. This suggests that the proposed model is more applicable for the situation where the missing areas are large and a higher accuracy is needed. We will learn further about the existing reconstruction models with high time efficiency and try to shorten the running time of our model in the future.
The method proposed in this study aims to restore long time series of MODIS data that often have repetitive and overlapping clouds. The method needs high-quality images for the training sets and it is difficult to find an area that has completely cloud-free MODIS data throughout the entire time series, thus it is still challenging to restore long time series images. To overcome the above-mentioned problem, we might first train a model with a relatively short time sequence and use the model repetitively to recover the cloudy areas for a given year, and further use the recovered yearly data as a training set to recover the cloudy data in the long time series. In this way, the method proposed in this study has the potential to reconstruct long time series of remote sensing data. The experiments for a broad application of the developed method are beyond the scope of this study and we will conduct them in further studies. Additionally, we screened out poor-quality pixels in the images based on the quality of data provided in the MODIS product, which includes some high-quality pixels and we may miss some useful information. If each image in the time series has a very large proportion of areas with poor-quality observations, the developed method might not perform well as the network does not receive correct information.
In addition, the current network structure consists of both the content generation network and the sequence-texture generation network, and we aim to develop an end-to-end network in our future research. As the proposed framework includes two parts of networks, our experiments take a slightly longer time to train and test the model than Zhang's method. It took 25 h to train the model and 117.9 s to predict the final results of Figure 9A. In comparison, for Zhang's method, it only took 21 h for fitting the model and 72.2 s for prediction. It is also notable that the prediction time efficiency of the proposed method only depends on the size of the testing images and the performance of GPU. The prediction time of Zhang's method is also influenced by the mask areas of each image. This suggests that the proposed model is more applicable for the situation where the missing areas are large and a higher accuracy is needed. We will learn further about the existing reconstruction models with high time efficiency and try to shorten the running time of our model in the future.
The method proposed in this study aims to restore long time series of MODIS data that often have repetitive and overlapping clouds. The method needs high-quality images for the training sets and it is difficult to find an area that has completely cloud-free MODIS data throughout the entire time series, thus it is still challenging to restore long time series images. To overcome the above-mentioned problem, we might first train a model with a relatively short time sequence and use the model repetitively to recover the cloudy areas for a given year, and further use the recovered yearly data as a training set to recover the cloudy data in the long time series. In this way, the method proposed in this study has the potential to reconstruct long time series of remote sensing data. The experiments for a broad application of the developed method are beyond the scope of this study and we will conduct them in further studies. Additionally, we screened out poor-quality pixels in the images based on the quality of data provided in the MODIS product, which includes some high-quality pixels and we may miss some useful information. If each image in the time series has a very large proportion of areas with poor-quality observations, the developed method might not perform well as the network does not receive correct information. In addition, the current network structure consists of both the content generation network and the sequence-texture generation network, and we aim to develop an end-to-end network in our future research.

Conclusions
As existing studies have difficulty processing time series of remote sensing images with missing data, we proposed a content-sequence-texture generation (CSTG) network to gap-fill time series of satellite images. The method combines a content generation network and a sequence-texture generation network for recovering time series of remote sensing images with missing data. We evaluated the performance of CSTG using time series of MODIS data and compared it with four existing methods. The results indicate that CSTG has higher accuracy when restoring images with large and overlapping masks. Comparing the surface reflectance between the original images and the images recovered using CSTG, R 2 , MAE, and SSIM are 0.954, 0.012, and 0.971, respectively, when applying small and dispersive masks, and R 2 , MAE, and SSIM are 0.982, 0.009, and 0.926, respectively, when applying a large and intensive mask. The model is robust to time series of images that have overlapping cloudy areas. The method highlights the potential of the deep learning methods on reconstructing remote sensing images to provide high-quality time series data for downstream applications.