CostNet: A Concise Overpass Spatiotemporal Network for Predictive Learning

Predicting the futures from previous spatiotemporal data remains a challenging topic. There have been many previous works on predictive learning. However, mainstream models suffer from huge memory usage or the gradient vanishing problem. Enlightened by the idea from the resnet, we propose CostNet, a novel recursive neural network (RNN)-based network, which has a horizontal and vertical cross-connection. The core of this network is a concise unit, named Horizon LSTM with a fast gradient transmission channel, which can extract spatial and temporal representations effectively to alleviate the gradient propagation difficulty. In the vertical direction outside of the unit, we add overpass connections from unit output to the bottom layer, which can capture the short-term dynamics to generate precise predictions. Our model achieves better prediction results on moving-mnist and radar datasets than the state-of-the-art models.

The vanishing gradient problem is a difficult research question when training artificial neural networks in predictive learning. It causes poor long-term prediction because the gradient based learning method will last much longer as errors vanish with back-propagation. Addressing the vanishing gradient problem is an issue in spatiotemporal predictive leaning.
There have been many previous studies on predictive learning, including recursive neural network (RNN) models, convolutional neural network (CNN) models, and generative adversarial network (GAN) methods. However, mainstream models suffer from huge memory usage or gradient vanishing problems [21][22][23]. Because predictive learning for spatiotemporal data, especially in the precipitation nowcasting, always deals with objects entangling, shape-changing, and direction variation, it is a more challenging task than the traditional temporal sequence regression, and it is a research direction worth exploring. The predictive learning based framework can solve the issue well, but the LSTM internal ISPRS Int. J. Geo-Inf. 2020, 9,209 2 of 15 unit structure is complicated. Our motivation is to explore a more straightforward unit structure and to solve or mitigate the gradient vanishing problem.
Toward a resolution of the gradient vanishing problem and simpler structure in the cell unit, we present CostNet, a novel RNN-based network. It is well known that resnet [1,24] excelled in the imagenet competition, which greatly increases the depth of convolution network by skip connections without causing gradient vanishing or gradient explosion problems. Enlightened by the idea of skip connections, CostNet has a horizontal and vertical cross connection. The core of this network is a concise unit, named Horizon LSTM with a fast gradient transmission channel, which provides a quick route from future predictions back to distant previous inputs to alleviate the gradient propagation difficulty. In the vertical direction outside of the unit, we add overpass connections from the unit output to the bottom layer, which can capture the short-term dynamics to generate clear predictions. Our model achieves better prediction results on moving-mnist and radar datasets than the state-of-the-art models, showing a great modeling capability for spatiotemporal data.
In this study, we propose a novel RNN-based network called CostNet. The paper is organized as follows. Related work is illustrated in Section 2. Section 3 introduces preliminaries. Section 4 shows the Horizon LSTM and vertical structure for the CostNet. Experiments and results are given in Section 5, and following is that is the Conclusion in Section 6.

Related Work
In recent years, a growing number of predictive learning models have been proposed, which are mainly based on convolutional neural network (CNN) [25], recursive neural network(RNN) [26,27] or generative adversarial network (GAN) [28,29].
Due to the powerful ability of extracting spatial correlations, CNN has achieved great success in the computer vision field, such as image classification and objects detection [20]. Some researchers attempted to model spatiotemporal data based on CNN. Oh et al. introduced an action autoencoder model based on CNN for video Atari games prediction [30], but its performance in real-world video is not good. De Brabandere et al. constructed the dynamic filter networks to some of input samples [31]. Zhang et al. designed the deep spatiotemporal networks for citywide crowd flows prediction using residual learning and fusion mechanism [12]. However, the model is only applied to very short-term prediction. Villegas et al. built a three-stage framework with additional annotated human joints data for long-term prediction [32]. However, it works in a supervision manner requiring a landmark as the ground truth.
Due to the powerful ability of modeling temporal dependencies, RNN has achieved great success in the natural language processing field, such as machine translation and intelligent conversational systems. Some researchers attempted to model spatiotemporal data based on RNN. Ranzato et al. introduced the first RNN framework inspired from language modeling and build a baseline for video prediction [14]. However, it has been shown that the model can only predict one frame ahead. Srivastava et al. employed the sequence to sequence LSTM network from language modeling to make multi-step video prediction [33]. The temporal characteristics are captured by the fully connected LSTM (FC-LSTM) layer in the model which cannot extract the spatial correlations. To learn spatial and temporal characteristics simultaneously, Shi et al. adopted convolution operator into input-to-state and state-to-state transitions and presented the convolutional LSTM (ConvLSTM) [7]. However, the stacked encoder-decoder architecture tends to produce fuzzy results. ConvLSTM becomes an important reference in the future research of spatiotemporal data because of its artful design. Finn et al. extended the convolutional LSTM model in robotics planning and constructed an action-conditioned video prediction network [34]. Patraucean et al. built a spatiotemporal video autoencoder with differentiable memory for action recognition [35], which can model short-term temporal dynamics and only predict one future frame partially related to optical flow and the convolutional LSTM. Villegas et al. also presented recurrent models based on the convolutional LSTM using optical flow as guided features to help capture short-term dynamics for video prediction and built an encoder-decoder network that separates motion and content into different encoder pathways for pixel-level future prediction [36]. Lotter et al. proposed a deep predictive coding network upon ConvLSTM particularly designed for one-frame video prediction [15]. Shi et al. continued to explore a new model to solve the location-invariant problem and proposed a benchmark for precipitation nowcasting [8]. Combining gated CNN and ConvLSTM, Kalchbrenner et al. designed a sophisticated probabilistic video model, named Video Pixel Network (VPN) [16], which encodes a four-dimensional dependency chain from raw videos and estimates the discrete joint distribution of pixel values one-by-one. This model gives sharp prediction frames but also brings high computational complexity and low prediction efficiency. Unlike the stacked ConvLSTMs, Wang et al. proposed a novel encoder-decoder architecture (PredRNN) for spatiotemporal predictive learning adding zigzag memory flows from top layer to bottom layer which is beneficial for modeling short-term video dynamics and designed a complex unit [9], named ST-LSTM with dual-memory (temporal and spatiotemporal memory) flows as blocks in the network. Wang et al. continued to develop PredRNN++ with a unit GHU (Gradient Highway Unit) [37] to alleviate the deep-in-time dilemma and proposed a more reasonable but still complex unit, named causal LSTM.
Due to the powerful ability of generating similar patterns, GAN has become a hot research topic in the machine learning field, such as image style transfer and video generation. Some researchers attempted to model spatiotemporal data based on GAN. Mathieu et al. introduced generative adversarial networks to video prediction [17], which generate prediction frames by a generator and then distinguish real/fake frames by a discriminator. More methods about adversarial learning were present in video prediction [38][39][40][41]. These methods can generate sharper frames than the traditional CNN or RNN methods. However, they need careful training because of the unstable adversarial networks.
In summary, different approaches have different disadvantages. GAN-based approaches can generate sharp frames but not capture the temporal dynamics in the long-term prediction. Generally speaking, CNN-based approaches are also poor at long-term prediction because convolutional structures can extract the spatial correlations but not model the temporal dynamics effectively. On the contrary, RNN-based approaches are good at modeling temporal dependencies in the long-term prediction but tend to generate blur predictions because of the well-known vanishing gradient problem. In this study, we proposed a concise overpass spatiotemporal network, which can model spatial and temporal characteristics simultaneously.

Preliminaries
The goal of predictive learning for spatiotemporal data is to forecast future predictions using previous observation sequences. From a mathematical view, this task can be regarded as a probability estimation problem. We take a video clipping (a common format of spatiotemporal data) as a research object. It's a temporal sequence in general that spans from t − J + 1 to t + K. Given a time stamp t, x t−J+1 , . . . , x t (length-J ) represents the previous observations and x t+1 , . . . , x t+K (length-K ) represents the ground truth values of the future status. At the given time stamp t, each observation x, a spatial representation, can be represented by a tensor R C×M×N , where R means the feature, C M and N denote the channel, height, width of a frame respectively. The essence of prediction is to predict the future length K sequence based on the known length J sequence and to maximize the prediction probability p. The predictionsx t+1 , . . . ,x t+k are used as estimate values of the ground truth x t+1 , . . . , x t+K . This process can be implemented by an encoder-decoder architecture. Many models for predictive learning use the encoder-decoder architecture, including FC-LSTM, ConvLSTM, ST-LSTM, Cause LSTM and our model. First, the encoder is used to encode the previous observations into intermediate states, and then the decoder is used to generate prediction results based on these intermediate states. The formulas are given in Formula (1) as follows: LSTM is suitable for processing temporal sequences, which is a recurrent cell unit with four gate structures inside. According to paper [27], the main formulas of LSTM are shown in Formula (2) below: where σ is sigmoid activation function, • and • denote the matmul product and the Hadamard product respectively. However, for spatial data, the matmul product generates too many redundant connections (full connections) to extract efficiently spatial correlations with high efficiency.
By combining convolution layer and recursion layer, Shi et al. proposed ConvLSTM [7], which is widely used in the field of spatiotemporal data because spatial correlations and temporal dynamics are extracted simultaneously. ConvLSTM replace matmul product with convolution in full-connection LSTM cell. The main formulas of ConvLSTM are shown in Formula (3) below: where σ is sigmoid activation function, * is the convolution operator and • indicates the Hadamard product. However, the network only stacks four layers of ConvLSTM units vertically, independent of each other step-to-step, thus the bottom layer ignores characteristics extracted by the top layer at the previous time. Predictions cannot get the short-term trends and tend to be fuzzy.
To overcome the drawback of layer-independent architecture in ConvLSTM, Wang et al. proposed a novel encoder-decoder architecture (PredRNN) [9] with zigzag memory flows from the top layer to the bottom layer and designed a dual memory unit named ST-LSTM using complicated nonlinear transition functions. PredRNN has a strong capability of modelling short-term video dynamics and generates clearer predictions than the ConvLSTM. The key equations of ST-LSTM are shown in (4) as follows: where σ is sigmoid activation function, * is the convolution operator and • indicates the Hadamard product. The square brackets represent concatenation and the round brackets denote a whole section. Unfortunately, the values of gradient fall exponentially in the back-propagation process. The complicated ST-LSTM still suffers from the gradient vanishing problem [23].

Methodology
We present a novel method based on ST-LSTM to explore a straightforward unit structure and to mitigate the gradient vanishing problem. In this section, we will describe details of CostNet, a concise overpass spatiotemporal network. We adopt an encoder-decoder architecture with four layers and employ the Horizon LSTM as our backbone block. Our approach has two key insights: First, the core of this network is the Horizon LSTM, a concise unit with a fast gradient transmission channel, which can extract spatial and temporal representations effectively to alleviate the gradient propagation difficulty. Second, in the vertical direction outside of the unit, we add overpass connections from unit output to the bottom layer, which can capture the short-term dynamics to generate clear predictions.

Horizon LSTM
Similar to the ST-LSTM, our Horizon LSTM also has a dual-memory structure: the temporal memory C and the spatiotemporal memory M. The memory C that flows horizontally from previous step to next step captures the temporal dependencies. The memory M that moves vertically from the bottom layer to the top layer extracts the spatial correlations. The Horizon LSTM unit is enlightened by the idea of the skip connection from resnet instead of the complex gate structures from ST-LSTM. The structure of Horizon LSTM is shown in Figure 1. There are four inputs to Horizon LSTM, including X t , H l t−1 , C l t−1 and M l−1 t . X t is the input frame in the first layer at the current time stamp. H l t−1 is the output hidden states in the current layer at the previous time stamp. C l t−1 is the temporal memory output states in the current layer at the previous time stamp. M l−1 t is the spatiotemporal memory output states in the bottom layer at the current time stamp. When in the first layer, the input should be M t−1 , which is the spatiotemporal memory output states in the top layer at the previous time stamp. There are three outputs for Horizon LSTM: H l in the memory route, the Horizon LSTM can provide a fast gradient transmission channel for both the temporal memory C and the spatiotemporal memory M from near predictions back to distant previous inputs to ease the gradient propagation difficulty.
consists of input gate , input modulation gate , forget gate and output gate . The forget gate controls the spatiotemporal information flow . The temporal memory flow depends the input gate , the input modulation gate and the forget gate in our Horizon LSTM block. The output hidden states in the current layer and at the current time stamp t is determined by the temporal memory as well as the output gate . As shown in Figure 1, the spatiotemporal memory exists in an overpass way through the gate structures in the Horizon LSTM just like the temporal memory . Since there are only a few of blocks in the memory route, the Horizon LSTM can provide a fast gradient transmission channel for both the temporal memory and the spatiotemporal memory from near predictions back to distant previous inputs to ease the gradient propagation difficulty. The key equations of the Horizon LSTM unit are shown in Formula (5) as follows: In the vertical direction, CostNet is enlightened by the idea of skip connection from resnet instead of the direct connection between the top layer and the bottom layer of PredRNN. Our network topology of our CostNet is illustrated in Figure 2.
HorizonLSTM means the Horizon LSTM unit in the first (bottom) layer. Section on the left of the equal sign means the outputs for Horizon LSTM and sections in the round brackets represent the inputs for Horizon LSTM. The square brackets denote concatenation for the spatiotemporal memory The key equations of the entire CostNet are presented in Formula (6) as follows: HorizonLSTM 1 means the Horizon LSTM unit in the first (bottom) layer. Section on the left of the equal sign means the outputs for Horizon LSTM and sections in the round brackets represent the inputs for Horizon LSTM. The square brackets denote concatenation for the spatiotemporal memory M t−1 at the previous time stamp.

Experiments
In this section, we evaluate our model by comparing experiments on two datasets to demonstrate the effectiveness and advancement of our algorithm. At the beginning, we inform general configuration for our experiments. For each evaluation dataset, we introduce the dataset and implementation procedure. Then we show experimental results of our model as well as the baseline models. At last, we analyze performance quantitatively and visualize prediction examples qualitatively.
Our model was developed in python and implemented in Keras [42] with TensorFlow [43] as back-ends. All the experiments were run on the Ubuntu server with a single NVIDIA GTX1080Ti GPU. The general configures are listed as follows: (1) ADAM [44] optimizer is adopted with a starting learning rate of 10-3.  [45] is adopted to avoid internal covariate shift problems. Besides, we employ the scheduled sampling strategy [46] to reduce differences between inference and training. In order to improve the training efficiency, we adopted the callback function in Keras, such as EarlyStopping, ModelCheckpoint and ReduceLROnPlateau. The source codes and data are available with a DOI at https://doi.org/10.6084/m9.figshare.11917914.v1.

Implementation
The Moving MNIST is a synthetic dataset constructed by moving digits from the MNIST dataset. It contains many data records, each of which is a sequence of length 20 (length of input frames is 10 and that of prediction frames is also 10. Each frame is a 64 × 64 × 1 grayscale image including two handwritten digits bouncing inside. Because digits selection, initial position, velocity direction and velocity magnitude are random, it is difficult to predict future frame. We generate the sequences in the way introduced by Srivastava et al. [33]. We split the dataset volume is into the training set with 10,000 sequences, the validation set with 3000 sequences and the test set with 5000 sequences.

Results
The intuitive way to measure the uncertainty for predictive learning is variance. We adopted two quantitative metrics to evaluate the performance of all models. One metric is the mean square error (MSE), an objective indicator, which represents the distance between true frames and predictions. A better model should have a lower value of MSE. In the ideal case, the minimum value is zero. Another metric is the per-frame structural similarity index measure (SSIM) [47], which is a subjective indicator to measure similarity between two images. A better model should have a higher value of SSIM. In the ideal case, the maximum value is 1. Table 1 shows the performances of different models for predicting  We plot frame-wise curves of different models for predicting 10 frames. As illustrated in Figure 3, the CostNet is evaluated against state-of-the-art methods including FC-LSTM, ConvLSTM and Causal LSTM. Uniformly, the performance of all models declines over time. Nevertheless, our model outperforms the state-of-the-art methods, with a lower curve for the metric MSE and a higher curve for the metric SSIM. Compared with Causal LSTM, a recent state-of-the-art method, our model works slightly better, especially for the last four frames. The results indicate our model has a great power for capturing long-term video dependencies.  At last, we visualize some examples on the Moving MNIST test set to observe the performance of different models qualitatively. All models predict 10 frames in the future given 10 previous frames. As illustrated in Figure 4, the first row is the previous input frames, the second row is the ground truth data, the third to eleventh rows are the predictions of FC-LSTM, ConvLSTM, TrajGRU, CDNA, DFN, FRNN, VPN, ST-LSTM, Causal LSTM respectively, and the last row is the predictions of our model. We observe that our model predictions are sharp enough. At last, we visualize some examples on the Moving MNIST test set to observe the performance of different models qualitatively. All models predict 10 frames in the future given 10 previous frames. As illustrated in Figure 4, the first row is the previous input frames, the second row is the ground truth data, the third to eleventh rows are the predictions of FC-LSTM, ConvLSTM, TrajGRU, CDNA, DFN, FRNN, VPN, ST-LSTM, Causal LSTM respectively, and the last row is the predictions of our model. We observe that our model predictions are sharp enough.
At last, we visualize some examples on the Moving MNIST test set to observe the performance of different models qualitatively. All models predict 10 frames in the future given 10 previous frames. As illustrated in Figure 4, the first row is the previous input frames, the second row is the ground truth data, the third to eleventh rows are the predictions of FC-LSTM, ConvLSTM, TrajGRU, CDNA, DFN, FRNN, VPN, ST-LSTM, Causal LSTM respectively, and the last row is the predictions of our model. We observe that our model predictions are sharp enough.

Implementation
In order to verify the effectiveness and advancement of our mode, a practical dataset, standard radar dataset 2018 (SRAD2018), is adopted in experiment, which comes from the IEEE ICDM 2018 global weather AI challenge. The radar dataset spans four months from 00:00 UTC on March 15 to 23:54 UTC on July 15 each year from 2010 to 2017. There are 320,000 sets in this dataset, including 300,000 records as training set and 20,000 records as test set. Each record has a length of 61, with an interval of 6 min. Radar covers one vertical level, altitude 3 km. After quality control, the radar echo data is limited in 0-80 (unit: dBZ), and the missing value is 255. The radar data at each time is stored in grayscale PNG format with a resolution of 501 × 501. Our goal in experiment is to predict the future 10 frames based on the previous 10 consecutive frames. We did some data preprocessing, such as image reshaping to 200 × 200. In addition, we under sampled the original dataset taking one image at every three intervals. After preprocessing, the training set is 80,000 sequences, the verification set is 10,000 sequences, and the test dataset is 1000 sequences.

Results
We adopted three quantitative metrics to evaluate the performance of all models, including the mean square error (MSE), the per-frame structural similarity index measure (SSIM) and the per-frame Peak Signal to Noise Ratio (PSNR) [47]. SSIM focuses on the difference in sharpness while PSNR emphasizes pixel-level correctness. A better model should have a higher value of SSIM and PSNR. In the ideal case, the maximum value of SSIM is 1 and the maximum value of PSNR is 255. Table 2 shows the performance of different models for predicting 10 frames given the previous 10 frames on the radar dataset. As shown in the table, our model is evaluated against state-of-the-art methods including ConvLSTM, TrajGRU, ST-LSTM and Causal LSTM. Our model outperforms all state-of-the-art methods in the metric SSIM as well as PSNR. Our model reduces the per-frame MSE from 3580.31 down to 888.81, increases the per-frame SSIM from 0.62 up to 0.79 and increases the per-frame PSNR from 12.13 up to 17.48. Compared with Causal LSTM, a recent state-of-the-art method, our model achieves competitive predictions, with a slightly higher 0.14 in the metric PSNR and 0.01 in the metric SSIM. The results show that CostNet can model radar data effectively. We plot frame-wise curves of different models for predicting 10 frames. Better predictions should have higher curves of frame-wise SSIM and PSNR. As illustrated in Figure 5, our model is evaluated against state-of-the-art methods including ConvLSTM, TrajGRU, ST-LSTM and Causal LSTM. Uniformly, the performance of all models declines over time. Nevertheless, our model outperforms the state-of-the-art methods, with higher curves for the metric SSIM and PSNR. Compared with Causal LSTM, a recent state-of-the-art method, our model works slightly better, especially for the last four frames. The results indicate our model has a great power for capturing long-term video dependencies. The performance of CostNet for the SSIM has improved by 0.1 than that in the Causal LSTM. The significant improvement is mainly in the final 5 frames. The result showed that CostNet has a stronger ability in predicting long-term temporal scenarios. We visualize examples on the radar test set to observe the performance of different models qualitatively. All models predict 10 frames in the future given 10 previous frames. As illustrated in Figure 6, the first row is the previous input frames, the second row is the ground truth, the third to sixth rows are the predictions of ConvLSTM, TrajGRU, ST-LSTM and Causal LSTM respectively, and the last row is the predictions of our model. We observe that our model predictions are sharp enough. We visualize examples on the radar test set to observe the performance of different models qualitatively. All models predict 10 frames in the future given 10 previous frames. As illustrated in Figure 6, the first row is the previous input frames, the second row is the ground truth, the third to sixth rows are the predictions of ConvLSTM, TrajGRU, ST-LSTM and Causal LSTM respectively, and the last row is the predictions of our model. We observe that our model predictions are sharp enough.
We visualize examples on the radar test set to observe the performance of different models qualitatively. All models predict 10 frames in the future given 10 previous frames. As illustrated in Figure 6, the first row is the previous input frames, the second row is the ground truth, the third to sixth rows are the predictions of ConvLSTM, TrajGRU, ST-LSTM and Causal LSTM respectively, and the last row is the predictions of our model. We observe that our model predictions are sharp enough.

Conclusion and Future Work
Towards resolutions of gradient vanishing problem and simpler structure in the cell unit, we propose CostNet, a concise overpass spatiotemporal network, which has a horizontal and vertical cross connection. The core of the CostNet is a concise unit, named Horizon LSTM with a fast gradient transmission channel, which provide a quick route from near predictions back to distant previous inputs to ease the gradient propagation difficulty. In the vertical direction outside of the unit, we add overpass connections from unit output to the bottom layer, which can capture the short-term dynamics to generate clearer predictions. The CostNet can extract the spatio-temporal feature information each time. The CostNet achieves better prediction results on moving-mnist and radar datasets than the state-of-the-art models. The results showed that the CostNet has a great modeling

Conclusions and Future Work
Towards resolutions of gradient vanishing problem and simpler structure in the cell unit, we propose CostNet, a concise overpass spatiotemporal network, which has a horizontal and vertical cross connection. The core of the CostNet is a concise unit, named Horizon LSTM with a fast gradient transmission channel, which provide a quick route from near predictions back to distant previous inputs to ease the gradient propagation difficulty. In the vertical direction outside of the unit, we add overpass connections from unit output to the bottom layer, which can capture the short-term dynamics to generate clearer predictions. The CostNet can extract the spatio-temporal feature information each time. The CostNet achieves better prediction results on moving-mnist and radar datasets than the state-of-the-art models. The results showed that the CostNet has a great modeling capability for spatiotemporal data. However, the evaluation for the uncertainty of the predictive learning is only considering the MSE, SSIM and PSNR. The Critical success index (CSI) can be used for the evaluation the uncertainty of the dynamic predictive learning as a skill score index [48]. A higher CSI denotes a better prediction. We will evaluate the uncertainty of the predictive learning results using CSI in the future. The predictive learning for spatiotemporal data is still an extremely challenging topic for multiple predictive learning methods. In the near future, we will explore different network structures and predictive learning working in an unsupervised manner over cyberinfrastructure.

Conflicts of Interest:
The authors declare no conflict of interest.