Prediction of Short-Time Cloud Motion Using a Deep-Learning Model

: A cloud image can provide signiﬁcant information, such as precipitation and solar irradiation. Predicting short-time cloud motion from images is the primary means of making intra-hour irradiation forecasts for solar-energy production and is also important for precipitation forecasts. However, it is very challenging to predict cloud motion (especially nonlinear motion) accurately. Traditional methods of cloud-motion prediction are based on block matching and the linear extrapolation of cloud features; they largely ignore nonstationary processes, such as inversion and deformation, and the boundary conditions of the prediction region. In this paper, the prediction of cloud motion is regarded as a spatiotemporal sequence-forecasting problem, for which an end-to-end deep-learning model is established; both the input and output are spatiotemporal sequences. The model is based on gated recurrent unit (GRU)- recurrent convolutional network (RCN), a variant of the gated recurrent unit (GRU), which has convolutional structures to deal with spatiotemporal features. We further introduce surrounding context into the prediction task. We apply our proposed Multi-GRU-RCN model to FengYun-2G satellite infrared data and compare the results to those of the state-of-the-art method of cloud-motion prediction, the variational optical ﬂow (VOF) method, and two well-known deep-learning models, namely, the convolutional long short-term memory (ConvLSTM) and GRU. The Multi-GRU-RCN model predicts intra-hour cloud motion better than the other methods, with the largest peak signal-to-noise ratio and structural similarity index. The results prove the applicability of the GRU-RCN method for solving the spatiotemporal data prediction problem and indicate the advantages of our model for further applications.


Introduction
Recently, cloud-motion prediction has received significant attention because of its importance for the prediction of both precipitation and solar-energy availability [1]. Research has shown that the prediction of the short-time motion of clouds, especially of convective clouds, is important for precipitation forecasts [2][3][4][5][6][7]. Since most models of solar variability [8,9] and of solar irradiation [10][11][12] require cloud motion velocity as the main input, accurate cloud motion estimation is also essential for the intra-hour forecast of solar energy [13][14][15][16]. The difference between weather forecasts and solar forecasts is that the latter are usually conducted in a shorter time window (less than one hour). Otherwise, cloud-motion prediction is essentially similar in these two fields. Because the temperature of clouds is lower than that of the ground, clouds can be identified from infrared (IR) satellite images (with wavelengths of 10.5 to 12.5 µm) in which the intensity of IR radiation is correlated with focused on high-level deep CNN "visual percepts", the novel convolutional long short-term memory (ConvLSTM) network proposed by Shi et al. [41] has convolutional structures in both the input-to-state and state-to-state transitions to extract "visual percepts" for precipitation now-casting. Ballas et al. [42] extended this work and proposed a variant form of the gated recurrent unit (GRU). They captured spatial information using an RNN with convolutional operation and empirically validated their GRU-RCN model on a video classification task. GRU-RCN has fewer parameters than ConvLSTM.
Since both the input and output of a cloud-motion forecast are spatiotemporal sequences, cloud-motion prediction is a spatiotemporal-sequence forecast problem for which GRU-RCN would seem well suited. However, Ballas et al. [42] focused on video classification, which is quite different from our forecast problem. Given the input video data, the output of their model is a number that depends on the class of the video; in our problem, the output should have a spatial domain as well. We need to modify the structure of the GRU-RCN model and apply it directly on the pixel level.
Moreover, there exists another challenge in the cloud motion prediction problem: new clouds often appear suddenly, at the boundary. To overcome this challenge, our model includes information about the surrounding context in which each small portion of the cloud is embedded; this was not considered in previous methods.
In this paper, we suggest the use of deep-learning methods to capture nonstationary information regarding cloud motion and deal with nonrigid processes. We propose a multiscale-input end-to-end model with a GRU-RCN layer. The model takes the surrounding context into account, achieves precise localization, and extracts information from multiple scales of resolution. Using a database of FenYun-2G IR satellite images, we compare our model's intra-hour predictions to those of the state-of-the-art variational optical-flow (VOF) method and three deep learning models (ConvLSTM, LSTM, and GRU); our model performs better than the other methods.
The remainder of this paper is organized as follows: Section 2 introduces the GRU-RCN model. Section 3 describes the data we used and the experiments we conducted. Section 4 presents the results, as well as briefly describes the other methods with which the GRU-RCN model was compared. Section 5 discusses the advantages and disadvantages of our model and our plans for future work. Section 6 provides our concluding remarks.

Deep CNN
Deep CNNs [28] have been proven to extract and utilize image features effectively and have achieved great success in visual recognition tasks. Regular neural networks do not scale well to full images because, in the case of large images, the number of model parameters increases drastically, leading to low efficiency and rapid overfitting. The deep-CNN architecture avoids this drawback. The deep CNN contains a sequence of layers, typically a convolutional layer, a pooling layer, and a fully connected layer. In a deep CNN, the neurons in a given layer are not connected to all the neurons in the preceding layer but only to those in a kernel-size region of it. This architecture provides a certain amount of shift and distortion invariance.

GRU
A GRU [43] is a type of RNN. An RNN is implemented to process sequential data; it defines a recurrent hidden state, the activation of which depends on the previous state. Given a variable-length sequence X = (x 1 , x 2 , . . . x t ), the hidden state h t of the RNN at each time step t is updated by: where φ is a nonlinear activation function. An RNN can be trained to learn the probability distribution of sequences and thus to predict the next element in the sequence. At each time step t, the output can be represented as a distribution of probability.
However, because of the vanishing-gradient and exploding-gradient problems, training an RNN becomes difficult when input/output sequences span long intervals [44]. Variant RNNs with complex activation functions, such as LSTMs and GRUs, have been proposed to overcome this problem. LSTMs and GRUs both perform well on machine-translation and video-captioning tasks, but a GRU has a simpler structure and lower memory requirement [45].
A GRU compels each recurrent unit to capture the dependencies of different timescales adaptively. The GRU model is defined by the following equations: where denotes element-wise multiplication; W, W z , W r and U, U z , U r are weight matrices; x t is current input; h t−1 is the previous hidden state; z t is an update gate; r t is a reset gate; σ is the sigmoid function; h t is a candidate activation, which is computed similarly to that of the traditional recurrent unit in an RNN; and h t is the hidden state at time step t. The update gate determines the extent to which the hidden state is updated when the unit updates its contents, and the reset gate determines whether the previous hidden state is preserved. More specifically, when the value of the reset gate of a unit is close to zero, the information from the previous hidden state is discarded and the update is based exclusively on the current input of the sequence. By such a mechanism, the model can effectively ignore irrelevant information for future states. When the value of the reset gate is close to one, on the other hand, the unit remembers long-term information.

GRU-RCN
In this section, we will introduce the GRU-RCN layer utilized in our model. A GRU converts input into a hidden state by fully connected units, but this can lead to an excessive number of parameters. In cloud imaging, the inputs of satellite images are 3-D tensors formed from the spatial dimensions and input channels. We regard the inputs as a sequence X = (x 1 , x 2 , . . . x t ); the size of the hidden state should be the same as that of the input. Let H, Wid, and C be the height, width, and number of the channels of input at every time step, respectively. If we apply GRU on inputs directly, the size of both the weight matrix W and the weight matrix U should be H × Wid × C.
Images are composed of patterns with strong local correlation that are repeated at different spatial locations. Moreover, satellite images vary smoothly over time: the position of a tracked cloud in successive images will be restricted to a local spatial neighborhood. Ballas et al. [42] embedded convolution operations into the GRU architecture and proposed the GRU-RCN model. In this way, recurrent units have sparse connectivity and can share their parameters across different input spatial locations. The structure of the GRU-RCN model is expressed in the following equations: h l where * denotes convolution and the superscript l denotes the layer of the GRU-RCN; the weight matrices W l , W l z , W l r and U l , U l z , U l r are 2-D convolutional kernels; and h l t = h l t (i, j), where h l t (i, j) is a feature vector defined at the location (i, j).
With convolution, the sizes of W l , W l z , W l r and U l , U l z , U l r are all K 1 × K 2 × C, where K 1 × K 2 is the convolutional-kernel spatial size (chosen in this paper to be 3 × 3), significantly lower than that of the input frame H × Wid. Furthermore, this method preserves spatial information, and we use zero padding in the convolution operation to ensure that the spatial size of the hidden state remains constant over time. The candidate hidden representation h l t (i, j), the activation gate z l t (i, j), and the reset gate r l t (i, j) are defined based on a local neighborhood of size K 1 × K 2 at the location (i, j) in both the input data x l t and the previous hidden state h l t−1 . In addition, the size of the receptive field associated with h l t (i, j) increases with every previous timestep h l t−1 , h l t−2 . . . as we go back in time. The model implemented in this paper is, therefore, capable of characterizing the spatiotemporal pattern of cloud motion with high spatial variation over time.

Multi-GRU-RCN Model
In this section, we will introduce the model structure of Multi-GRU-RCN. Ballas et al. [42] focused on the problem of video classification and therefore implemented a VGG16 model structure in their paper. However, this does not fit our problem well: we need to operate on the pixel level directly. The model structure of Shi et al. [41] consists of an encoding network as well as a forecasting network, and both networks are formed by stacking several ConvLSTM layers. In their model, there is a single input, and the input and output data have the same dimension. We modified this model structure and proposed a new one, which can extract information from the surrounding context. The model structure is presented in Figure 1. both the input data and the previous hidden state ℎ . In addition, the size of the receptive field associated with ℎ , increases with every previous timestep ℎ , ℎ ... as we go back in time. The model implemented in this paper is, therefore, capable of characterizing the spatiotemporal pattern of cloud motion with high spatial variation over time.

Multi-GRU-RCN Model
In this section, we will introduce the model structure of Multi-GRU-RCN. Ballas et al. [42] focused on the problem of video classification and therefore implemented a VGG16 model structure in their paper. However, this does not fit our problem well: we need to operate on the pixel level directly. The model structure of Shi et al. [41] consists of an encoding network as well as a forecasting network, and both networks are formed by stacking several ConvLSTM layers. In their model, there is a single input, and the input and output data have the same dimension. We modified this model structure and proposed a new one, which can extract information from the surrounding context. The model structure is presented in Figure 1. There are multiple inputs in this model, and the input from each small region has the same center as the input from a larger region. The input from the small region has the same dimension as the output, while the input from the large region has four times as much area. We consider the region that is included in the large region but excluded in the small region as the surrounding context. The purpose of utilizing multiple inputs from different regions is to enrich the information with the surrounding context. Like the model of Shi et al. [41], our model consists of an encoding part and a forecasting part. In addition to stacked GRU-RCN layers, batch normalization [46] was introduced into both the encoding and forecasting parts to accelerate the training process and avoid overfitting. When utilizing input from a large region, we used a max pooling layer to reduce the dimension and improve the ability of the model to capture invariance information of the object in the image. The initial states of the forecasting part are copied from the final state of the encoding part. We concatenate the output states of the two inputs and subsequently feed this into a 1 × 1 convolutional layer with ReLU activation to obtain the final prediction results.

Experimental Setup
In this paper, we used a set of satellite data from 2018. For computational convenience, we first normalized the data to the range 0 to 1. We randomly selected data from 200 days as training data, data from 20 days as validation data, and data from another 20 days as test data. Because each image There are multiple inputs in this model, and the input from each small region has the same center as the input from a larger region. The input from the small region has the same dimension as the output, while the input from the large region has four times as much area. We consider the region that is included in the large region but excluded in the small region as the surrounding context. The purpose of utilizing multiple inputs from different regions is to enrich the information with the surrounding context. Like the model of Shi et al. [41], our model consists of an encoding part and a forecasting part. In addition to stacked GRU-RCN layers, batch normalization [46] was introduced into both the encoding and forecasting parts to accelerate the training process and avoid overfitting. When utilizing input from a large region, we used a max pooling layer to reduce the dimension and improve the ability of the model to capture invariance information of the object in the image. The initial states of the forecasting part are copied from the final state of the encoding part. We concatenate the output states of the two inputs and subsequently feed this into a 1 × 1 convolutional layer with ReLU activation to obtain the final prediction results.

Experimental Setup
In this paper, we used a set of satellite data from 2018. For computational convenience, we first normalized the data to the range 0 to 1. We randomly selected data from 200 days as training data, data from 20 days as validation data, and data from another 20 days as test data. Because each image was too large (512 × 512 pixels) for training, we divided it into small patches, and set the patch sizes at 64 × 64 pixels and 128 × 128 pixels. Every 64 × 64 patch was paired with a 128 × 128 patch (the 64 × 64 patch being in the center region of the 128 × 128 patch). Thus, each 512 × 512 frame was divided into 16 pairs of patches. The patches were at the same location but across consecutive time steps. Because the average velocity of cloud motion in the training dataset is 14.35 m/s according to the FY-2G Atmospheric Motion Vector data, about 81.05% of the pixels in each patch can be tracked in the next time step's patch at the same location.
The data instances were partitioned into nonoverlapping sequences, each of which is n frames long. For each sequence, we used the first n − 1 frames as input to predict the last frame. For example, the first sequence implements frames 1 to n − 1 to predict frame n, and the second sequence implements frames n + 1 to 2n − 1 to predict frame 2n. To select the exact value of n, we designed a pretraining process. For each value of n between 2 and 10, we set the batch size to 32 and randomly selected 100 batches from the training dataset. We trained the model using these batches of data and computed the average mean squared error (MSE) among all batches and the running time. The results are presented in Figure 2.
Atmosphere 2020, 11, x FOR PEER REVIEW 6 of 16 2G Atmospheric Motion Vector data, about 81.05% of the pixels in each patch can be tracked in the next time step's patch at the same location. The data instances were partitioned into nonoverlapping sequences, each of which is n frames long. For each sequence, we used the first n-1 frames as input to predict the last frame. For example, the first sequence implements frames 1 to 1 to predict frame , and the second sequence implements frames 1 to 2 1 to predict frame 2 . To select the exact value of , we designed a pretraining process. For each value of between 2 and 10, we set the batch size to 32 and randomly selected 100 batches from the training dataset. We trained the model using these batches of data and computed the average mean squared error (MSE) among all batches and the running time. The results are presented in Figure 2. As seen in Figure 2, the running time increases almost linearly with the increase in . The average MSE evidently falls as increases from 2 to 6 but thereafter remains almost constant. The minimum MSE (i.e., the minimum batch MSE among all batches), however, is not very sensitive to the value of . This indicates that when is larger than 6, the running time increases as increases, but this does not lead to a reduction in the MSE. Therefore, the value 6 was chosen for the experiments reported in this paper. The satellite collected data every hour; therefore, there were 12,800 cases in the training dataset, 1280 cases in the validation dataset, and 1280 cases in the test dataset. The outline of the GRU-RCN layer is demonstrated in Figure 3. The output of the encoder will be implemented in the decoder for the production of the final output. We trained the GRU-RCN model by minimizing the MSE loss using backpropagation through time (BPTT) with a learning rate of 10 −3 . The kernel size of the convolutional structure in our GRU-RCN layer was set to 3 × 3. As seen in Figure 2, the running time increases almost linearly with the increase in n. The average MSE evidently falls as n increases from 2 to 6 but thereafter remains almost constant. The minimum MSE (i.e., the minimum batch MSE among all batches), however, is not very sensitive to the value of n. This indicates that when n is larger than 6, the running time increases as n increases, but this does not lead to a reduction in the MSE. Therefore, the value n = 6 was chosen for the experiments reported in this paper. The satellite collected data every hour; therefore, there were 12,800 cases in the training dataset, 1280 cases in the validation dataset, and 1280 cases in the test dataset.
The outline of the GRU-RCN layer is demonstrated in Figure 3. The output of the encoder will be implemented in the decoder for the production of the final output. We trained the GRU-RCN model by minimizing the MSE loss using backpropagation through time (BPTT) with a learning rate of 10 −3 . The kernel size of the convolutional structure in our GRU-RCN layer was set to 3 × 3.

Test Benchmark
The performance of the proposed method was determined by two metrics: the peak signal-tonoise ratio (PSNR) [47] and structural similarity (SSIM) index [48]. PSNR is a widely used metric for evaluating the accuracy of algorithms. This metric indicates the reconstruction quality of the method used. The observed value at the prediction time step is not of practical relevance. Information regarding future events is not involved in the generation of the forecast satellite image; however, it still serves as a useful benchmark. The signal here was taken to be the observed value, whereas the noise was the error of the forecast image. The PSNR between the observed image and predicted image was defined as where 2 1 is the maximum pixel intensity. , is the mean squared error between the observed and predicted image, defined as: where is the number of pixels in the satellite image. For a smaller MSE, the PSNR will be larger, and, therefore, the algorithm accuracy will be higher. SSIM is a quality assessment method used to measure the similarity between two images. It was proposed under the assumption that the quality perception of the human visual system (HVS) is correlated with structural information of the scene. Therefore, it considers image degradations as perceived changes in structural information, while PSNR estimates image degradations based on error sensitivity. The structural information is decomposed into three components: luminance, contrast, and structure. The SSIM between and is defined as:

Test Benchmark
The performance of the proposed method was determined by two metrics: the peak signal-to-noise ratio (PSNR) [47] and structural similarity (SSIM) index [48]. PSNR is a widely used metric for evaluating the accuracy of algorithms. This metric indicates the reconstruction quality of the method used. The observed value at the prediction time step is not of practical relevance. Information regarding future events is not involved in the generation of the forecast satellite image; however, it still serves as a useful benchmark. The signal here was taken to be the observed value, whereas the noise was the error of the forecast image. The PSNR between the observed image f and predicted image g was defined as where I max = 2 8 − 1 is the maximum pixel intensity. MSE( f , g) is the mean squared error between the observed and predicted image, defined as: where N is the number of pixels in the satellite image. For a smaller MSE, the PSNR will be larger, and, therefore, the algorithm accuracy will be higher. SSIM is a quality assessment method used to measure the similarity between two images. It was proposed under the assumption that the quality perception of the human visual system (HVS) is correlated with structural information of the scene. Therefore, it considers image degradations as perceived changes in structural information, while PSNR estimates image degradations based on error sensitivity. The structural information is decomposed into three components: luminance, contrast, and structure. The SSIM between f and g is defined as: where l( f , g), c( f , g), and s( f , g) are the luminance comparison, contrast comparison, and structure comparison between f and g, respectively: where µ f and µ g are the averages of f and g, σ x are the variances of f and g, σ f g is the covariance of f and g, and c 1 , c 2 , and c 3 are positive constants to stabilize the division with a zero denominator. Besides, we also considered the mean bias error (MBE) as a supplementary metric. Although the value of MBE could not indicate the model reliability because the errors often compensate each other, it could show the degree to which the method underestimates or overestimates the results. With the purpose of exhibiting the degree more intuitively, the MBE was calculated as a percentage and the MBE between f and g was defined as

Data
FY-2, the first Chinese geostationary meteorological satellite, was launched on 31 December 2014, and positioned above the equator at 105 • E. With the Stretched Visible and Infrared Spin Scan Radiometer (S-VISSR) on board, FY-2 can observe the Earth's atmosphere with high temporal and spatial resolutions. The IR1 channel (10.3~11.3 µm) China land-area images are obtained hourly for the spatial range 50 • E~160 • E, 4 • N~65 • N. The size of each image is 512 × 512 pixels, and the spatial resolution of each pixel is 13 km in both the north-south and east-west directions. The intensity of the pixels is 0~255. The relationship between the intensity count and brightness temperature is negative but not linear. An image instance of FY-2 is depicted in Figure 4.
Radiometer (S-VISSR) on board, FY-2 can observe the Earth's atmosphere with high temporal and spatial resolutions. The IR1 channel (10.3~11.3 μm) China land-area images are obtained hourly for the spatial range 50° E~160° E, 4° N~65° N. The size of each image is 512 × 512 pixels, and the spatial resolution of each pixel is 13 km in both the north-south and east-west directions. The intensity of the pixels is 0~255. The relationship between the intensity count and brightness temperature is negative but not linear. An image instance of FY-2 is depicted in Figure 4.

Results and Analysis
One epoch is one training cycle through the entire training dataset. The models described in the previous sections were trained on the training dataset for 50 epochs and evaluated on the validation dataset after every epoch. The MSE loss is presented in Figure 5.

Results and Analysis
One epoch is one training cycle through the entire training dataset. The models described in the previous sections were trained on the training dataset for 50 epochs and evaluated on the validation dataset after every epoch. The MSE loss is presented in Figure 5. In Figure 5, it is apparent that the MSE loss declined dramatically in the first 10 epochs; thereafter, the decline rate gradually decreased, and the MSE loss eventually converged to a lower level. When the training time was over 40 epochs, the loss was relatively small compared to that within the first 10. Despite the fluctuation of the validation loss, the integrated trend continued to decline, which indicates that the model was not overfitting. Thus, the training procedure was effective and converged to a quite satisfactory result.
We then randomly selected 20 days from 2018 as the test dataset, on which we compared our method (Multi-GRU-RCN) with the VOF technique, ConvLSTM, LSTM, and GRU. For the VOF algorithm, we used the method of Chow et al. [26], which minimizes the objective function by using brightness constancy and global smoothness as model assumptions to realize VOF. We set the size of the input patches as 128 × 128 and the size of the output patches as 64 × 64 to produce comparable In Figure 5, it is apparent that the MSE loss declined dramatically in the first 10 epochs; thereafter, the decline rate gradually decreased, and the MSE loss eventually converged to a lower level. When the training time was over 40 epochs, the loss was relatively small compared to that within the first 10. Despite the fluctuation of the validation loss, the integrated trend continued to decline, which indicates that the model was not overfitting. Thus, the training procedure was effective and converged to a quite satisfactory result.
We then randomly selected 20 days from 2018 as the test dataset, on which we compared our method (Multi-GRU-RCN) with the VOF technique, ConvLSTM, LSTM, and GRU. For the VOF algorithm, we used the method of Chow et al. [26], which minimizes the objective function by using brightness constancy and global smoothness as model assumptions to realize VOF. We set the size of the input patches as 128 × 128 and the size of the output patches as 64 × 64 to produce comparable results with Multi-GRU-RCN. For ConvLSTM, we adopted the model structure of Shi et al. [41], setting the kernel size at 3 × 3 for convolution. The input frame had the same size as the output frame. For LSTM and GRU, we deployed five frames to predict the next frame. Because LSTM and GRU cannot extract spatial information, we treated every pixel in a frame as an independent sample; thus, there were 4096 samples in a frame. All the experiments were carried out with NVIDIA Tesla T4 GPU. It takes 7.78 h to train the GRU model, 9.44     According to Figures 6 and 7, the forecast results of these methods were consistent in terms of each metric. For instance, the PSNRs and SSIMs of the five methods were the highest on 2 February, which means that all four methods performed the best on the data of that day. Based on the forecast results, the Multi-GRU-RCN method consistently outperformed the other four methods during the whole computational time. The VOF was the worst-performing method on the test data. Multi-GRU-RCN and ConvLSTM had quite similar performance in terms of both MSE and PSNR values but According to Figures 6 and 7, the forecast results of these methods were consistent in terms of each metric. For instance, the PSNRs and SSIMs of the five methods were the highest on 2 February, which means that all four methods performed the best on the data of that day. Based on the forecast results, the Multi-GRU-RCN method consistently outperformed the other four methods during the whole computational time. The VOF was the worst-performing method on the test data. Multi-GRU-RCN and ConvLSTM had quite similar performance in terms of both MSE and PSNR values but Multi-GRU-RCN performed slightly better. The MSE of Multi-GRU-RCN forecasts on the test dataset was 72.93, which means that the average intensity difference per pixel between ground truth and prediction was 8.54 (a satisfactory result, given that the intensity range was 0~255).
To show the results more intuitively, we randomly picked three input sequences from the test dataset: May 2 between 0 and 5 am, January 31 between 6 and 11 pm, and July 7 between 6 and 11 am. The PSNR values predicted by Multi-GRU-RCN are consistently larger than those of the other methods, which indicates that its predictions are more accurate. This result also agrees with the difference between the ground truth and prediction. Even though the predictions by VOF have sharper outlines, the predictions by Multi-GRU-RCN have better accuracy. When a cloud appears at the edge of the prediction domain, Multi-GRU-RCN can predict it better than VOF. This proves that some of the complex spatiotemporal patterns in the dataset can be learned by the nonlinear and convolutional structure of the network. The model also performs well at predicting nonstationary processes, such as inversion and deformation, whereas VOF does not: in the VOF prediction for such situations, an abrupt change of intensity between adjacent pixels occurs at the bottom of the image. Multi-GRU-RCN gives a better prediction result without a blocky appearance.
the edge of the prediction domain, Multi-GRU-RCN can predict it better than VOF. This proves that some of the complex spatiotemporal patterns in the dataset can be learned by the nonlinear and convolutional structure of the network. The model also performs well at predicting nonstationary processes, such as inversion and deformation, whereas VOF does not: in the VOF prediction for such situations, an abrupt change of intensity between adjacent pixels occurs at the bottom of the image. Multi-GRU-RCN gives a better prediction result without a blocky appearance.

Discussion
The relationship of GRU, LSTM, ConvLSTM, GRU-RCN, and Multi-GRU-RCN is illustrated in Figure 9. GRU is a simplification of LSTM by replacing the forget gate and input gate with the update gate, and combining the cell state and hidden state. Embedded convolutional operation in the recurrent unit, ConvLSTM, and GRU-RCN were implemented for spatial-temporal data and GRU-RCN has less parameters than ConvLSTM. As the GRU-RCN model structure proposed by Ballas et al. In each panel, the first horizontal row shows the input sequence and the ground truth; the second horizontal row displays the predictions of five different methods; and the third horizontal row shows the difference between the prediction by each of the five methods and the ground truth.

Discussion
The relationship of GRU, LSTM, ConvLSTM, GRU-RCN, and Multi-GRU-RCN is illustrated in Figure 9. GRU is a simplification of LSTM by replacing the forget gate and input gate with the update gate, and combining the cell state and hidden state. Embedded convolutional operation in the recurrent unit, ConvLSTM, and GRU-RCN were implemented for spatial-temporal data and GRU-RCN has less parameters than ConvLSTM. As the GRU-RCN model structure proposed by Ballas et al. [42] focused on the video classification problem, we changed the model structure to adapt for the pixelwise cloud motion prediction problem. We considered the ConvLSTM model structure proposed by Shi et al. [41] as a reference, and replaced the ConvLSTM layer with the GRU-RCN layer. Besides, the surrounding context was introduced into our model to enrich input information.  (c), respectively. In each panel, the first horizontal row shows the input sequence and the ground truth; the second horizontal row displays the predictions of five different methods; and the third horizontal row shows the difference between the prediction by each of the five methods and the ground truth.

Discussion
The relationship of GRU, LSTM, ConvLSTM, GRU-RCN, and Multi-GRU-RCN is illustrated in Figure 9. GRU is a simplification of LSTM by replacing the forget gate and input gate with the update gate, and combining the cell state and hidden state. Embedded convolutional operation in the recurrent unit, ConvLSTM, and GRU-RCN were implemented for spatial-temporal data and GRU-RCN has less parameters than ConvLSTM. As the GRU-RCN model structure proposed by Ballas et al. [42] focused on the video classification problem, we changed the model structure to adapt for the pixelwise cloud motion prediction problem. We considered the ConvLSTM model structure proposed by Shi et al. [41] as a reference, and replaced the ConvLSTM layer with the GRU-RCN layer. Besides, the surrounding context was introduced into our model to enrich input information. In predicting cloud motion, both temporal and spatial information provide important clues. Temporally, the current frame correlates with the previous frame; spatially, the intensity of a given pixel correlates with those of the surrounding pixels. A GRU captures temporal information but ignores spatial information; therefore, it underperforms when compared to the ConvLSTM model, which captures both. However, in the ConvLSTM model, the input frame has the same shape as the output frame: as a result of the convolutional operation and same-padding method, it loses boundary information. In addition, the movement of the cloud is very complicated and cannot be determined In predicting cloud motion, both temporal and spatial information provide important clues. Temporally, the current frame correlates with the previous frame; spatially, the intensity of a given pixel correlates with those of the surrounding pixels. A GRU captures temporal information but ignores spatial information; therefore, it underperforms when compared to the ConvLSTM model, which captures both. However, in the ConvLSTM model, the input frame has the same shape as the output frame: as a result of the convolutional operation and same-padding method, it loses boundary information. In addition, the movement of the cloud is very complicated and cannot be determined by looking at the current region exclusively; more information must be brought into the model. To improve prediction accuracy, especially in the boundary region, we incorporated the surrounding context into our new end-to-end model. The performance improvement of Multi-GRU-RCN is also contributed to the model structure. For instance, in the experiment, we set the large region as the input and the small region as the output for the VOF algorithm, while we conducted a control experiment with the small region as both the input and output. The average PSNR and SSIM on the test dataset of the control experiment is 22.69 and 0.41, which indicates that the introduction of the large region only achieves a performance gain of 1.28% and 1.22% with the VOF algorithm. In the model structure aspect, we exploited the max pooling layer for dimension reduction and improved ability of the model to capture invariance information of the cloud while moving and fuse features from different scales. In addition, the activation functions introduce non-linearity into the model [49]. Accumulation of activation functions produces a promising model to learn sophisticated patterns. The essential advantage of the end-to-end structure is that all the parameters of the model can be simultaneously trained, making the training process more effective. The predictions of our model have consistently higher PSNR and SSIM than those of other methods. The spatial and temporal patterns learned by the model from the region of interest provides the fundamental of cloud motion. The utilization of external information out of the region of interest enriched the model understanding of environmental circumstances. This illustrates that utilizing information from both the internal and external region reveals a more accurate pattern of cloud motion.
There are three possible explanations for the better performance of Multi-GRU-RCN over the VOF algorithm. First, Multi-GRU-RCN can learn complex patterns during the training process. The clouds often seem to appear instantaneously, indicating that they either derive from outside or are suddenly formed. If similar situations happened in the training dataset, Multi-GRU-RCN could learn these patterns during the training process and subsequently provide reasonable predictions in the test dataset. However, this could not be detected by the VOF algorithm. The second explanation is that Multi-GRU-RCN is trained end-to-end for this task. The VOF algorithm is not an end-to-end model, and it is difficult to find a reasonable way to update the future flow fields. The final reason is that Multi-GRU-RCN can smooth a blocky appearance, whereas the predictions of VOF will have a blocky appearance whenever there are abrupt changes in the motion vectors and therefore in the intensity between adjacent pixels.
Although the proposed Multi-GRU-RCN can achieve promising intra-hour cloud motion prediction, there are still limitations of this model. Compared with the VOF algorithm, the Multi-GRU-RCN produces blur prediction. This property is associated with the MSE loss when training the model. The future of the satellite cloud image is uncertain and by nature multimodal. When there are multiple valid outcomes with equally possibility, the MSE loss aims to accommodate the uncertainty by averaging all the possible outcomes, thus resulting in a blur prediction. Generative adversarial networks (GANs) have emerged as a powerful alternative to enhance prediction sharpness. In the future work, we will combine MSE loss with adversarial training loss to improve the visual quality of the predictions. Besides, limited by the number of layers in the architecture, the model could not totally eliminate the influence of interference, such as complex surface conditions. Li et al. [50] proposed a multi-scale convolutional feature fusion method for the cloud detection method. Their research confirmed that the usage of dilated convolutional layers and feature fusion of shallow appearance information and deep semantic information helps to improve the interference tolerance.
In this paper, the current forecasting range was an hour. The extension of the forecast time will convert the output from one frame to a sequence of frames. The weakness of the encoder-decoder architecture is that it lacks the alignment of the input and output sequence. Bahdanau et al. [51] proposed an attention mechanism utilizing a context vector to align the source and target inputs. The context vector preserves information from all hidden states in encoder cells and aligns them with the current target output. The attention mechanism allows the decoder to "attend" to different parts of the source sentence at each step of output generation; this concept has revolutionized the field. The introduction of the attention mechanism will address issues carried out in long-term horizon prediction. Furthermore, we plan to implement more data sources to enrich the information in the dataset and introduce data fusion techniques into the model improve the accuracy. Combining our current research with the precipitation forecast problem also merits further research.

Conclusions
In this paper, we introduced deep-learning methods into the field of cloud-motion prediction. This work is innovative, since traditional methods for cloud-motion prediction are mostly based on block matching and linear extrapolation, neglecting the nonstationary process during cloud movement. By formulating cloud-motion prediction as a spatial temporal data prediction problem, an end-to-end model with GRU-RCN was developed. Inclusion of the surrounding context enriched the information used. We tested this model's applicability on the cloud images of the FY-2G dataset for intra-hour prediction. Despite the relatively simple structure of our model, it can learn complex patterns through the nonlinear and convolutional structure of the network and works well when predicting the movement of clouds on a short timescale. This provides another example of the applicability of the GRU-RCN method in dealing with spatiotemporal data and learning complex patterns of images.