Spatiotemporal Predictive Learning for Radar-Based Precipitation Nowcasting

Wang, Xiaoying; Zhao, Haixiang; Zhang, Guojing; Guan, Qin; Zhu, Yu

doi:10.3390/atmos15080914

Open AccessArticle

Spatiotemporal Predictive Learning for Radar-Based Precipitation Nowcasting

by

Xiaoying Wang

^1,2

,

Haixiang Zhao

^1,2,

Guojing Zhang

^1,2,*

,

Qin Guan

³ and

Yu Zhu

^1,2

¹

Department of Computer Technology and Applications, Qinghai University, Xining 810016, China

²

Intelligent Computing and Application Laboratory of Qinghai Province, Qinghai University, Xining 810016, China

³

Qinghai Provincial Institute of Meteorological Science, Xining 810016, China

^*

Author to whom correspondence should be addressed.

Atmosphere 2024, 15(8), 914; https://doi.org/10.3390/atmos15080914

Submission received: 17 June 2024 / Revised: 25 July 2024 / Accepted: 28 July 2024 / Published: 31 July 2024

(This article belongs to the Special Issue Deep Learning Algorithms for Weather Forecasting and Climate Prediction)

Download

Browse Figures

Versions Notes

Abstract

Based on C-band weather radar and ground precipitation data from the Helan Mountain area in Yinchuan between 2017 to 2020, we evaluated the forecasting performances of 15 mainstream deep learning models used in recent years, including recurrent-based and recurrent-free models. The critical success index (CSI), probability of detection (POD), false alarm rate (FAR), mean square error (MSE), mean absolute error (MAE), and learned perceptual image patch similarity (LPIPS), were used to evaluate the forecasting abilities. The results showed that (1) recurrent-free models have significant parameter quantity and computing power advantages, especially the SimVP model. Among the recurrent-based models, PredRNN and PredRNN++ demonstrate good predictive capabilities for changes in echolocation and intensity, PredRNN++ performs better in predicting long sequences (1 h); (2) SimVP uses Inception to extract temporal features, which cannot capture the complex physical changes in radar echo images and fails to extract spatial–temporal correlations and accurately predict heavy rainfall areas effectively. Therefore, we constructed the SimVP-GMA model, replacing the temporal prediction module in SimVP and modifying the spatial encoder part. Compared with SimVP, the MSE and LPIPS indicators were improved by 0.55 and 0.0193, respectively. It can be seen from the forecast images that the forecast details have been significantly improved, especially in the forecasting of heavy rainfall weather.

Keywords:

spatiotemporal predictive learning; nowcasting precipitation; radar echo extrapolation; deep learning

1. Introduction

Nowcasting commonly pertains to the prediction of weather conditions during the next 0–2 h, with a specific emphasis on small-scale weather patterns, particularly those associated with intense convective weather events. When compared to large-scale weather systems, meso- and micro-scale systems display abruptness, intricate and varied mechanisms, and brief durations, which makes predicting precipitation in the near future exceedingly intricate and demanding. Radar echoes, known for their precise and detailed information in terms of time and space, are frequently utilized as powerful instruments for nowcasting. Currently, two prevalent methods for predicting precipitation in real-time are numerical weather prediction and radar echo extrapolation. Among these methods, radar echo extrapolation has emerged as the major strategy due to its superior accuracy and reliability [1].

Traditional radar echo extrapolation or remote sensing image extrapolation based on the semi-Lagrangian (RPM-SL) [2] advection scheme, such as centroid tracking, tracking radar echoes by correlation (TREC), the optical flow (OF) method, etc., achieve probabilistic or deterministic precipitation forecasting by calculating the optical flow of radar or satellite echo motion fields and adding random perturbations to the motion fields. This method consumes less computational resources and operates quickly [3], but it relies on the advection equation and struggles to adequately address the development and decay processes of precipitation clouds. As the extrapolation time increases, the accuracy of echo extrapolation rapidly decreases. Short-term precipitation forecasting is a typical temporal–spatial prediction problem. With the rapid development of AI technology, especially the breakthrough progress and application effects of deep learning (DL) in computer vision and time-series problem processing, deep learning methods can mine spatial and temporal variation features from a large amount of historical meteorological satellite and radar data [4,5], resulting in higher-precision and more suitable short-term forecasting models for local severe weather systems.

The approaches used for spatiotemporal predictive learning can be divided into two main groups: recurrent-based models and recurrent-free models. Recently, researchers have been using spatiotemporal predictive learning models more often to predict precipitation in the near future. Recurrent-based models commonly utilize a hybrid design that integrates convolutional neural networks and recurrent units. “recurrent units” usually refer to the recurrent units used in recurrent neural networks (RNNs) [6]. This hybrid technique enables the models to independently acquire knowledge about spatial correlations and temporal evolution characteristics in the data. They are typical examples of recurrent-based models, which are widely and frequently applied in the field of spatiotemporal sequence prediction. ConvLSTM [7] enhances conventional LSTM architectures by including convolutional neural networks to effectively process visual input. PredNet [8] utilizes a deep recurrent convolutional neural network that incorporates both bottom-up (extracting features from raw data, gradually building higher-level scene representations) and top-down (using higher-level contextual information to influence and correct the interpretation of lower-level data) connections to consistently forecast upcoming video frames. PredRNN [9] introduces a spatiotemporal LSTM unit that can extract and retain spatiotemporal representations to capture both spatial and temporal data simultaneously. PredRNN++ [10] incorporates gradient highway units to mitigate the issue of gradient vanishing and a Casual-LSTM module to sequentially connect spatial and temporal memories. Highway layers can effectively propagate gradients in very deep feedforward networks. In PredRNN++, to prevent the rapid vanishing of long-term gradients, a highway approach is needed to learn frame skipping. Therefore, a new spatiotemporal recurrent structure, the gradient highway unit, is proposed. PredRNN-V2 [11] enhances model performance by incorporating curriculum-learning techniques and memory decoupling loss. Utilize memory decoupling loss to enable the dual memory cells to “perform their respective duties” by separately focusing on long-term temporal dependency information from left to right and short-term spatiotemporal dynamic information transmitted vertically. Additionally, a new curriculum-learning strategy is proposed, which uses RSS for encoding and SS for decoding, allowing PredRNN to learn long-term dynamic information from the contextual framework. The MIM [12] incorporates a sophisticated non-stationary learning mechanism within the structure of the LSTM module in order to effectively process intricate time series data. PhyDNet [13] clearly distinguishes between the dynamics of partial differential equations (PDEs) and the unknown complementing information in recurring physical units. MAU [14] introduces a motion-aware unit that specifically captures motion information.

Although recurrent-based models are effective at dealing with spatiotemporal data, they frequently suffer from significant computational complexity, resulting in prolonged training and inference procedures. In addition, they are susceptible to problems such as gradient vanishing or gradient explosion during training, which can impact the convergence and stability of the model. On the other hand, recurrent-free models utilize a hybrid structure consisting of either convolutional neural networks or convolutional neural networks combined with attention mechanisms. Currently, advancements in recurrent-free models mostly rely on enhancements to SimVP (simpler yet better video prediction). SimVP does not use advanced modules such as RNN, LSTM, and Transformer, nor does it introduce complex training strategies like adversarial training and sampling learning. All its modules are composed of CNNs. They are typical examples of recurrent-free models, which are widely and frequently applied in the field of spatiotemporal sequence prediction.

SimVP [15] is an innovative piece of work that merges the Inception module with the UNet architecture to acquire knowledge about temporal changes. Its temporal module is specifically built to be capable of running in parallel. TAU [16] substitutes the essential Inception-UNet with an attention module, allowing the temporal module to be parallelized and also record long-term temporal evolution. TAU and gSTA [17] enhance the IncepU module by implementing a streamlined and highly efficient design that does not depend on InceptionNet or a Unix-like architecture. Recurrent-based models utilize recurrent modules to make predictions about the following frame by using autoregressive approaches, which are commonly referred to as recurrent models. On the other hand, recurrent-free models, make predictions for all future frames simultaneously, using all the available frames as input. The IncepU module in SimVP [15] is composed of a framework similar to Unet, which utilizes multi-scale elements from an architecture like Inception. Nevertheless, gSTA and TAU enhance the IncepU module by implementing a streamlined and optimized architecture, eliminating the requirement for InceptionNet or a UNet-like architecture. OpenSTL [18] introduces MetaFormers [19] to enhance the temporal modeling of recurrent-free models, facilitating recurrent-free spatiotemporal prediction learning. In this design, OpenSTL utilizes MetaFormers as the temporal module, which converts the input channels from the original C to inter-frame channels T × C. The recurrent-free models are improved by expanding the recurrent-free architecture and utilizing the benefits of MetaFormers. OpenSTL incorporates ViT [20], UniFormer [21], HorNet [22], and MogaNet [23] into recurrent-free models using MetaFormers. These MetaFormers replace the intermediate temporal modules in the original recurrent-free design.

While deep learning-based nowcasting systems have shown promising outcomes, different algorithms have displayed distinct predicting abilities. However, it is important to note that the occurrence, development, and mobility of radar echoes are quite intricate. Hence, evaluating and assessing the predictive capabilities of machine deep learning algorithms is beneficial for choosing deep learning methods that are suitable for specific regions and constructing the most effective deep learning models.

This paper utilizes a radar echo extrapolation method that combines the attributes of convolutional neural networks [24] and the self-attention mechanism [25]. This method excels in extracting both temporal and spatial data. This method is appropriate for applying radar echo pictures that exhibit significant correlations in both temporal and spatial variations. The results indicate that the SimVP-GMA model exhibits higher accuracy in predicting the echo contours, with the forecast outcomes closely aligning with the actual values in terms of intensity.

2. Data and Processing Methods

This study utilizes C-band weather radar data and ground precipitation data collected from the Helan Mountain region between 2017 and 2020. We perform a thorough analysis of numerous advanced models using diverse assessment measures. We enhance the SimVP model, which demonstrates outstanding performance, to create our own model, SimVP-GMA. These experiments focus on short-term weather forecasting within a time frame of 0–1 h. This work enhances the existing precipitation forecasting by implementing nowcasting, focusing on three specific aspects: This study examines the practical use of state-of-the-art spatiotemporal predictive learning models in predicting radar echo in real-time in the Yinchuan region. We suggest a novel model that efficiently combines the SimVP architecture and the group-mix attention (GMA) [26] module without using recurrent connections. This enables the temporal module to be parallelized while accurately recording long-term temporal progression. The suggested technique exhibits exceptional performance in the prediction of near-term precipitation. By integrating the SCConv module [27] into the encoder section of SimVP, we successfully restrict the duplication of features, resulting in a decrease in the number of model parameters and FLOPs while simultaneously improving the capability of feature representation.

2.1. Data Source

Yinchuan is located in the northwest of the Loess Plateau and the upper and middle reaches of the Yellow River. Its climate is semi-humid and semi-arid continental climate. This area is not only arid with little rainfall but has uneven spatial and temporal distribution of precipitation. Yinchuan straddles two climatic zones, namely, the summer warm Mediterranean climate (CSB) and the cool semi-arid climate (BSK). The main climatic features are scarce rain, snow, and strong water evaporation.

The research area is located on the east side of Helan Mountain, with geographical coordinates of 104° E–107° E and 37° N–40° N, and an altitude of 1000–3600 m. There is one C-band Doppler weather radar at Yinchuan station in the research area, with 254 surface rainfall stations. This radar has a wavelength of 1.5–3.75 cm, a frequency of 4000–8000 MHz, and adopts the VCP21 volume scan mode, which can complete scans at nine elevation angles within 5–6 min.The Drainage map of the study area is depicted in Figure 1.

This study employs radar reflectivity factor (Z) data obtained during the rainy season (June to September) in Yinchuan City from 2017 to 2020. The mixed scanning data are acquired after performing encoding conversion, clutter suppression, attenuation correction, ground clutter correction, and coordinate transformation. The data have a resolution of 1 km × 1 km and are collected at 6-minute intervals. A dataset is constructed by selecting 28 precipitation processes with complete records as the research objects, based on data quality control and statistical analysis. Since most of the meteorological processes are relatively calm without precipitation, which is actually not conducive to learning the spatial and temporal information of precipitation on the network when selecting radar echo image data, in order to ensure the validity of the data, this part of the data without rain is eliminated. Finally, radar echo data with obvious precipitation processes are selected, including 28 precipitation processes with complete records.

We obtain precipitation fields from radar data, utilizing the Z/R relationship (the relationship between radar reflectivity factor and precipitation) and echo levels to determine the intensity and scope of precipitation fields. These 28 precipitation processes are related to the period from 2017 to 2020, representing different seasonal and typological precipitation events during this period. During the network training process, we have taken into account the diversity of seasons and precipitation types (including low-intensity, high-intensity, convective, stratiform, etc.), to ensure that the network can adapt to precipitation prediction tasks under various conditions. The continuous 2 h uninterrupted radar echo image data in the precipitation process is selected as a set of sample data, and the data are sliced. The models employed in this work utilize a sequence of ten consecutive radar echo images as input and generate predictions for the subsequent ten consecutive photos. These predictions aim to forecast the weather conditions for the upcoming 0 to 1 h.

2.2. Noise Reduction and Anomaly Handling

Radar echo intensities below 10 dBZ are mostly caused by clutter from ground dust, and this part of the data is considered noise for network learning. Therefore, it is necessary to set all pixel grid points below 10 dBZ in the image to 0. These relatively isolated interference echoes that exist in the dataset mostly appear in the form of isolated points or thin lines. The filtering method adopted in this paper involves establishing an N × N rectangular window on the radar echo data. If the number of valid pixels around the center point within the window area is less than a specified threshold, the center point is eliminated, meaning it is considered invalid data. The calculation formula is as follows:

P_{z} = \frac{M}{N^{2}}

(1)

N represents the window size, which is the total number of pixels within the rectangular region segmented on the image. M represents the total number of valid pixels within the rectangular window.

P_{z}

is the percentage of valid reflectivity factors among all pixels in the rectangular window. When

P_{z}

is less than the given threshold, the pixel is considered an isolated data point and is eliminated.

2.3. Normalization Processing

For deep learning neural networks, to address issues such as gradient vanishing and gradient explosion that often arise in deeper network layers, and to improve computational efficiency, input data are typically normalized. Therefore, it is necessary to process radar echo data before model training, ensuring that the numerical range falls between 0 and 1. Firstly, the original radar reflectivity factor is linearly transformed into the pixel value range commonly used in image processing:

pixel = 0.5 + 255 \frac{dBZ}{70}

(2)

where “pixel” represents the transformed pixel value, falling within the range of 0 to 255. Secondly, normalization is applied. Here, we adopt the direct pixel normalization method commonly used in image processing, with the formula as follows:

{pixel}_{norm} = \frac{pixel}{255}

(3)

Normalization allows the data to better fit within the gradient descent interval of the activation function, effectively mitigating the issue of gradient vanishing and accelerating the convergence speed of model training.

2.4. Data Augmentation

In the field of deep learning, a lack of training samples can easily lead to overfitting, where the model excessively fits the data in the training set, resulting in poor performance on the test set. The meteorological dataset constructed in this chapter still has a limited number of training samples for deep learning. Therefore, all experiments conducted in this paper subsequently employ data augmentation during the training phase. Data augmentation helps the neural network model learn more patterns and variations during the training process, enabling it to have stronger generalization capabilities in the face of uncertain changes in future data. During the model training process, data augmentation operations occur randomly at a certain proportion. Below are several methods used in this paper for data augmentation. (1) Random flip transformation: Horizontally or vertically flip the image sequence with a certain probability. (2) Random rotation transformation: Randomly rotate the selected image sequence by a certain angle. (3) Random reverse transformation: Reverse the selected image sequence with a certain probability.

3. SimVP-GMA

The purpose of this chapter is to examine the process of developing the SimVP-GMA model. Upon assessing the predictive accuracy of different deep learning algorithms, we noted that distinct models had diverse levels of performance across multiple measures. For example, although the SimVP model exhibited better performance based on the mean squared error (MSE) metric, it achieved a slightly lower score on the learned perceptual image patch similarity (LPIPS) metric. In order to tackle this issue, we suggest implementing the SimVP-GMA model, which combines the advantages of SimVP and the self-attention mechanism. This model demonstrates efficacy in capturing the interconnections between various points in the input sequence, hence enabling it to comprehend intricate dependencies within the sequence. The SimVP-GMA model demonstrated noteworthy performance in both the MSE and LPIPS criteria.

3.1. Model Introduction

Spatiotemporal predictive learning aims to infer future frames based on previous ones. Given a sequence

X^{t, T} = {x_{i}}_{t - T + 1}^{t}

at time t with the past T frames, Our purpose is to predict the future sequence

Y^{t + 1, T^{'}} = {x_{i}}_{t + 1}^{t + 1 + T^{'}}

at time

t + 1

that contains the next

T^{'}

frames, where

x_{i} \in R^{C, H, W}

is an image with the channel C, height H, and width W.

we represent the sequences as tensors, X^{t, T} \in

R^{T \times C \times H \times W} and Y^{t + 1, T^{'}} \in R^{T^{'} \times C \times H \times W}

. Formally, the predicting model is a mapping

F_{Θ} : X^{t, T} \mapsto

Y^{t + 1, T^{'}}

with learnable parameters

Θ^{*}

, optimized by the following:

Θ^{*} = arg min_{Θ} L (F_{Θ} (X^{t, T}), Y^{t + 1, T^{'}})

(4)

where

L

can be various loss functions, and we simply employ MSE loss in our setting.

The model framework suggested in this study is depicted in Figure 2, as SimVP-GMA. More precisely, the spatial encoder is composed of two traditional 2D convolutional layers and a SCConv layer. On the other hand, the spatial decoder consists of a SCConv layer followed by two 2D transposed convolutional layers. We streamlined the initial four traditional convolutional layers into two standard 2D convolutional layers and a SCConv layer. This was conducted to minimize the computational expenses resulting from unnecessary feature extraction in visual tasks, while also promoting CNNs to produce more distinctive feature representations that encompass more comprehensive information. The GMA module stack, positioned between the spatial encoder and decoder, is tasked with extracting temporal data. Although our model has a relatively uncomplicated structure, it is proficient in acquiring spatial and temporal characteristics without depending on recurrent architectures.

During the encoding step, we utilize the encoder to sequentially incorporate

N_{s}

convolutional modules in order to extract spatial characteristics in a layered manner. Nevertheless, this method of stacking incurs a substantial expense in terms of computational resources, since conventional convolutional layers tend to extract a significant amount of duplicate features. In order to address this problem, we propose the implementation of the SCConv module (spatial and channel reconstruction convolution), which is designed to minimize unnecessary computations and enhance the acquisition of more meaningful features. SCConv is a plug-and-play architecture unit that may immediately substitute the usual convolutional layers in different convolutional neural networks. More precisely, the method for the encoding phase can be represented as follows, utilizing LayerNorm for downsampling:

Z_{i} = σ (LayerNorm (Conv 2 d (Z_{i - 1}))), 1 \leq i \leq N_{s}

(5)

Z_{i} = σ (CRU (SRU (Z_{i - 1}))), N_{s} \leq i \leq 2 N_{s}

(6)

Z_{i}

represents the input data at time step i, where the input

Z_{i - 1}

and output

Z_{i}

shapes are

(T, C, H, W)

and

(T, \hat{C}, \hat{H}, \hat{W})

. The Translator stage utilizes the GMA (group-mix attention) module, which acts as a sophisticated substitute for conventional self-attention. GMA can record relationships between individual tokens, between tokens and groups, and between multiple groups, regardless of their sizes. The SimVP-GMA model employs the use of GMA to arrange an encoder–decoder structure in the Translator. This arrangement aids in the extraction of temporal information and enables the achievement of temporal evolution effects. The formula is as stated:

Z_{j} = σ (GMA (Z_{j - 1})), 2 N_{s} \leq j \leq 2 N_{s} + N_{t}

(7)

where the input

Z_{j - 1}

and output

Z_{j}

shapes are

(T \times

C, H, W)

and

(\hat{T} \times \hat{C}, H, W) .

The decoder phase employs 2

N_{s}

deconvolutions to recreate the ground truth frame by convolving C channels on (H, W), hence achieving the decoding of Translator information. The decoder transforms convolution operations into deconvolutions. The formula below illustrates the stacking technique employed by the decoder, which utilizes GroupNorm for upsampling:

Z_{k} = σ (GroupNorm (unConv 2 d (Z_{k - 1}))), 2 N_{s} + N_{t} \leq k \leq 4 N_{s} + N_{t}

(8)

where the shapes of input

Z_{k - 1}

and output

Z_{k}

are

(T, \hat{C}, \hat{H}, \hat{W})

and

(T, C, H, W)

, respectively. We use ConvTranspose2d to serve as the unConv2d operator.

3.2. Spatial and Channel Reconstruction Convolution

The encoder in the spatial module is composed of two traditional 2D convolutional layers and one SCConv layer. The SCConv module offers a novel approach to the feature extraction process of CNNs. It includes a strategy to optimize the utilization of spatial and channel redundancy information, with the goal of reducing redundant features and enhancing model performance. The SCConv module combines two essential components: the spatial reconstruction unit (SRU) and the channel reconstruction unit (CRU). Figure 3 depicts the SCConv design, which integrates the SRU and CRU components. The figure accurately depicts the location of the SCConv module within the ResBlock.

The SCConv module consists of two units, namely SRU (spatial reconstruction unit) and CRU (channel reconstruction unit). SRU reduces spatial redundancy through a split reconstruction approach, while CRU employs a split–transform–merge method to minimize channel redundancy. These two units work together to reduce redundant information in CNN features. SCConv is a plug-in convolution module that can directly replace standard convolution operations and can be applied to various convolutional neural networks, thus reducing redundant features and computational complexity.

As shown in the diagram above, SCConv consists of two units: the spatial reconstruction unit (SRU) and the channel reconstruction unit (CRU), arranged in sequence. The input feature X first passes through the spatial reconstruction unit to obtain spatially refined feature

X_{w}

. Then, it goes through the channel reconstruction unit to obtain the channel-refined feature Y as the output.

3.3. Group-Mix Attention

Visual transformers (ViTs) have demonstrated their ability to improve visual identification by including multi-head self-attention (MHSA), a technique that models long-range relationships. MHSA is commonly implemented as a query–key–value computation. Nevertheless, the attention maps produced from the query and key only record relationships between tokens at a singular level of detail. The GMA model utilizes a self-attention mechanism to effectively capture the relationships between individual tokens and groups of adjacent tokens. This allows for enhanced representational capabilities. Therefore, group-mix attention (GMA) can be considered a sophisticated substitute for conventional self-attention. It can capture correlations between tokens, between tokens and groups, and between groups themselves, all at varying group sizes. Figure 4 depicts the precise calculation of GMA using five 3D tokens. In order to determine the correlation between two distinct groups, each consisting of three tokens, the groups are combined into two proxies, which will then be multiplied together. Group aggregation can be effectively accomplished using a sliding window-based operator, as depicted in the picture below.

Within every GMA block, the components Q, K, and V are partitioned into five sections. To create group proxies, aggregators with varying kernel sizes are applied to four of these sections. This enables the computation of attention scores on combinations of individual tokens and group representatives at various levels of detail. The branch responsible for feeding the output into the attention computation is known as the pre-attention branch. The non-attention branch, located on the rightmost side, utilizes aggregation to establish several connections without attending to them. A linear mapping layer is used to combine the outputs of the attention branch with the non-attention branch. GMA divides query, key, and value into several pieces and applies various group aggregations to produce group proxies in a consistent manner. The attention map is computed by combining tokens and group proxies and is utilized to reassemble tokens and groups in Value. Figure 5 depicts the structure of the group-mix attention block.

Following data preprocessing, the LayerNorm technique is initially utilized for downsampling. This is then followed by the application of SRU to decrease spatial redundancy and CRU to decrease channel redundancy. These two units cooperate to minimize repetitive data in CNN features. The Translator utilizes group-mix attention to extract temporal features, which enables temporal evolution effects. The output is then upsampled using GroupNorm to provide the projected outcomes.

4. Experimental Results

4.1. Evaluation Metrics

The quality of extrapolation was quantitatively evaluated using measures such as CSI, FAR, POD, MSE, MAE, and LPIPS. The closer the values of the critical success index (CSI) and probability of detection (POD) are to 1, the higher the model’s performance at accurately identifying cases. On the other hand, when the value of the false alarm rate (FAR) approaches 0, it indicates that the model’s performance is better. This is because the FAR quantifies the percentage of false positives predicted by the model. Among them, TP (prediction = 1, true value = 1), FP (prediction = 0, true value = 1), and FN (prediction = 1, true value = 0). The mean absolute error (MAE), usually referred to as L1 loss, is a widely employed loss function in regression models. The mean squared error (MSE) can be considered a form of L2 loss. LPIPS, or “perceptual loss”, is a method used to quantify the dissimilarity between two images. It achieves this by collecting feature stacks from a specific layer, L, and normalizing them in the channel dimension. The activation channels are rescaled using the vector WL, and the L2 distance is then computed. The outcome is subsequently subjected to spatial averaging and then summed across the channels, with ‘d’ denoting the distance between ‘

x_{0}

’ and ‘x’. The applicable equations are as follows:

C S I = \frac{T P}{T P + F N + F P}

(9)

F A R = \frac{F P}{T P + F N}

(10)

P O D = \frac{T P}{T P + F P}

(11)

\begin{matrix} M A E = \sum_{i = 1}^{n} | y_{i} - \hat{y_{i}} | \end{matrix}

(12)

\begin{matrix} M S E & = \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2} \end{matrix}

(13)

d (x, x_{0}) = \sum_{l} \frac{1}{H_{l} W_{l}} \sum_{h, w} | | w_{l} ⊙ ({\hat{y}}_{h w}^{l} - {\hat{y}}_{0 h w}^{l}) {| |}_{2}^{2}

(14)

4.2. Model Comparison

The paper’s model, SimVP-GMA, was evaluated against 15 existing popular models, namely ConvLSTM, PredNet, PredRNN, PredRNN++, MIM, PhyDNet, MAU, PredRNN-V2, SimVP, TAU, SimVPv2, MogaNet, HorNet, ViT, and UniFormer. The comparison was based on metrics such as mean squared error (MSE), mean absolute error (MAE), and learned perceptual image patch similarity (LPIPS). The model testing environment utilized the Linux operating system and was equipped with an NVIDIA A100 graphics card. The program was executed within the PyTorch framework and a Python 3 environment, using the Adam optimizer. The maximum number of epochs was set at 100, and the training procedure utilized the scheduled sampling technique [28]. Table 1 shows the comparison, with bold values indicating the top performance. ‘↓’ denotes better performance with lower values.

The research findings suggest that the SimVP-GMA model, developed in this study, exhibits much better performance than other currently available models in terms of mean squared error (MSE) and mean absolute error (MAE). Specifically, it shows improvements of 0.55 and 5.45 compared to the SimVP model, respectively. These results demonstrate the high predictive accuracy and superiority of the SimVP-GMA model. SimVP-GMA is ranked as the second-best model in terms of perceptual loss, with only PredNet and SimVPv2 models ranking higher. Nevertheless, when evaluating all parameters in a complete manner, it is evident that SimVP-GMA maintains the highest overall performance. However, a significant disadvantage of SimVP-GMA is that it utilizes two GMA (generalized module assembly) modules, resulting in an augmentation of the number of model parameters and more utilization of computational resources. Our future study will focus on minimizing the number of parameters and computational complexity without compromising model performance. The precise data are displayed in Table 2 below, where the bold values represent the best performance among the models compared. The symbol ‘↓’ indicates that a lower value is better, indicating less error or closer alignment with the ideal value, while the symbol ‘↑’ indicates that a higher value is better, signifying superior model performance.

From Table 2, it can be seen that among recurrent-based models, the PredRNN++ model performs the best, with a POD of 0.826 when dBZ is greater than 10. Among recurrent-free models, when dBZ is greater than 10 and 20, our proposed SimVP-GMA model outperforms other models in the POD index, and its CSI index is better than other models. Compared with the SimVP model, our SimVP-GMA model is superior to SimVP in other indices such as POD, CSI, and FAR, except that the CSI is slightly lower than SimVP when dBZ is greater than 30. After considering all indicators comprehensively, it can be concluded that the SimVP-GMA model performs excellently.

5. Discussion

In order to make a meaningful comparison of the performance of each model, a sequence was selected at random from the test set, with a time interval of 6 min. The graphic was created by translating pixel data into radar echo values, assigning colors to each value range, and showing the results alongside the predictions of other models. The initial row of 10 images, captured at 6-minute intervals, represents the accurate output, while the subsequent rows display the prediction outcomes of each model for the same time periods.

Figure 6 presents the visualization results of the recurrent-based model; recurrent-based models often exhibit image-blurring as the forecasting time increases. The main reasons for this are as follows: Firstly, the loss of spatiotemporal information, which increases with longer forecasting periods; secondly, radar echo movement and evolution are influenced by factors such as wind speed and direction, temperature, and terrain. Using only radar echo images as model input cannot accurately predict changes in radar echoes, especially for the generation process of radar echoes from scratch. The forecasting effectiveness of the three convolutional recurrent neural network models based on a single data source is relatively weak. Thirdly, deep learning tends to average the error of the entire image during the training process, which causes the radar echo values to tend toward homogenization in the later stages of forecasting, resulting in visual “blurring”. Among them, PredRNN and PredRNN++ demonstrate good predictive capabilities for changes in echolocation and intensity and perform well among recurrent-based models. PredRNN++ has a certain forecasting ability for the echo center during longer forecast periods (1 h), but it underestimates the intensity. However, their forecasting ability for the echo generation process is relatively weak. The PredNet model, proposed for video prediction, performs exceptionally well on the evaluation metric of perceptual loss. However, due to its bottom-up mechanism, the predicted results produce repeated phenomena, where important features of the previous image are repeated every 12 min, making it difficult to accurately predict changes in echolocation.

Figure 7 presents the visualization results of the recurrent-free model. The performance of the recurrent-free model is basically better than that of the recurrent-based model, among which, our model SimVP-GMA performs excellently. Our SimVP-GMA model excels in producing smooth prediction results, particularly within the initial 24 min, where its performance surpasses that of other models. The addition of the GMA module, which captures long-term temporal evolution, results in a noticeable enhancement in the anticipated details as compared to SimVP. To summarize, the SimVP-GMA model exhibits outstanding performance when used for radar echo extrapolation through the utilization of deep learning techniques.

The SimVP-GMA model excels in various metrics, including mean squared error (MSE), mean absolute error (MAE), probability of detection (POD), critical success index (CSI), and a false alarm rate (FAR), collectively emphasizing its high accuracy in predicting the spatiotemporal evolution of radar echoes.

Two pivotal modules integrated into the SimVP-GMA model significantly contribute to its enhanced performance. These modules effectively capture long-term spatiotemporal dependencies, resulting in more precise and detailed prediction outcomes. Notably, the model is capable of generating smooth predictions within the initial 24 min, marking a notable advantage over other models. This characteristic is particularly valuable for applications requiring short-term radar echo extrapolation, such as weather forecasting and disaster management. Furthermore, the SimVP-GMA demonstrates robust predictive capabilities for both echo position and intensity variations up to 60 min, indicating its strength in handling radar echoes with high intensities and rapid changes.

Despite its remarkable performance, the SimVP-GMA model also faces limitations. The addition of modules leads to an increase in the number of model parameters and computational resources required, posing challenges during model deployment. Additionally, when predicting radar echoes with dBZ values exceeding 30, the model’s performance slightly diminishes, with a slightly lower CSI compared to the SimVP model. In conclusion, the SimVP-GMA model proposed in this study represents a significant improvement over existing models for radar echo extrapolation tasks. Future research will focus on reducing the number of parameters and computational complexity of the SimVP-GMA model while maintaining its performance and integrating multi-source data (such as wind speed, wind direction, temperature, and terrain information) for more accurate predictions.

6. Conclusions

This paper examines the practical usefulness of contemporary advanced spatiotemporal predictive learning models in predicting radar echoes in Yinchuan. The SimVP-GMA model is introduced, which seamlessly combines the SimVP architecture with the grouped module attention (GMA) mechanism. By parallelizing the temporal module, it becomes possible to capture long-term temporal evolution, leading to outstanding performance in the application of nowcasting precipitation. Incorporating the SCConv module efficiently reduces feature redundancy and improves feature representation capabilities. The study’s experimental results demonstrate enhancements in MSE and LPIPS by 0.55 and 0.0193, respectively. Additionally, among all models, when dBZ is greater than 10 and 20, our proposed SimVP-GMA model outperforms other models in the POD index and its CSI index is better. The objective of this study is to assess the effectiveness of these models in predicting radar echoes in Yinchuan by means of training and comparison, and to enhance the most optimal model. The experimental results indicate that the enhanced model exhibits exceptional performance across multiple criteria, particularly in accurately predicting severe precipitation events. This paper not only presents a novel technological methodology for predicting precipitation in the near future but also introduces new possibilities for utilizing radar echo data in a more comprehensive manner.

Author Contributions

Conceptualization, X.W., Q.G., G.Z. and H.Z.; methodology, H.Z.; software, H.Z.; validation, H.Z. and X.W.; formal analysis, H.Z.; investigation, H.Z. and G.Z.; resources, X.W. and G.Z.; data curation, H.Z.; writing—original draft preparation, H.Z.; writing—review and editing, X.W., G.Z., Q.G. and Y.Z.; visualization, H.Z.; supervision, X.W., G.Z., Q.G. and Y.Z.; project administration, X.W., G.Z., Q.G. and Y.Z.; funding acquisition, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Qinghai Province (No. 2023-ZJ-906M) and the National Natural Science Foundation of China (No. 62162053).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are available from the second author on reasonable request.The data are not publicly available due to privacy.

Acknowledgments

This paper is supported by the high-performance computing center at Qinghai University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Imhoff, R.O.; Brauer, C.C.; van Heeringen, K.J.; Uijlenhoet, R.; Weerts, A.H. Large-sample evaluation of radar rainfall nowcasting for flood early warning. Water Resour. Res. 2022, 58, 3. [Google Scholar] [CrossRef]
Reich, S. An explicit and conservative remapping strategy for semi-Lagrangian advection. Atmos. Sci. Lett. 2007, 8, 58–63. [Google Scholar] [CrossRef]
Di, S.; Li, Z.; Liu, Y. Research of approaching rainfall forecast method based on weather radar inversion and cloud image extrapolation. Water Resour. Hydropower Eng. 2022, 53, 13–21. [Google Scholar]
Yuan, K.; Li, W.; Li, M. Examination and evaluation of four machine deep learning algorithms for radar echo nowcasting in Wuhan Region. Meteorol. Mon. 2022, 48, 428–441. [Google Scholar]
Wang, Y.; Wei, J.; Li, Q. Radar echo-based study on convolutional recurrent neural network model for precipitation nowcast. Water Resour. Hydropower Eng. 2023, 54, 24–41. [Google Scholar]
Kaur, M.; Mohta, A. A review of deep learning with recurrent neural network. In Proceedings of the International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 27–29 November 2019; Volume 27, pp. 460–465. [Google Scholar]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Adv. Neural Inf. Process. Syst. 2015, 28, 802–810. [Google Scholar]
Lotter, W.; Kreiman, G.; Cox, D. Deep predictive coding networks for video prediction and unsupervised learning. arXiv 2016, arXiv:1605.08104. [Google Scholar]
Wang, Y.; Long, M.; Wang, J.; Gao, Z.; Yu, P.S. Predrnn: Recurrent neural networks for predictive learning using spatio-temporal lstms. Adv. Neural Inf. Process. Syst. 2017, 30, 879–888. [Google Scholar]
Wang, Y.; Gao, Z.; Long, M.; Wang, J.; Philip, S.Y. Predrnn++: Towards a resolution of the deep-in-time dilemma in spati-otemporal predictive learning. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 5123–5132. [Google Scholar]
Wang, Y.; Wu, H.; Zhang, J.; Gao, Z.; Wang, J.; Philip, S.Y.; Long, M. Predrnn: A recurrent neural network for spatiotemporal predictive learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2208–2225. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Zhang, J.; Zhu, H.; Long, M.; Wang, J.; Yu, P.S. Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9154–9162. [Google Scholar]
Guen, V.L.; Thome, N. Disentangling physical dynamics from unknown factors for unsupervised video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11474–11484. [Google Scholar]
Chang, Z.; Zhang, X.; Wang, S.; Ma, S.; Ye, Y.; Xinguang, X.; Gao, W. Mau: A motion-aware unit for video prediction and beyond. Adv. Neural Inf. Process. Syst. 2021, 34, 26950–26962. [Google Scholar]
Gao, Z.; Tan, C.; Wu, L.; Li, S.Z. Simvp: Simpler yet better video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3170–3180. [Google Scholar]
Tan, C.; Gao, Z.; Wu, L.; Xu, Y.; Xia, J.; Li, S.; Li, S.Z. Temporal attention unit: Towards efficient spatiotemporal predictive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18770–18782. [Google Scholar]
Tan, C.; Gao, Z.; Li, S.; Li, S.Z. Simvp: Towards simple yet powerful spatiotemporal predictive learning. arXiv 2022, arXiv:2211.12509. [Google Scholar]
Tan, C.; Li, S.; Gao, Z.; Guan, W.; Wang, Z.; Liu, Z.; Li, S.Z. Openstl: A comprehensive benchmark of spatio-temporal predictive learning. Adv. Neural Inf. Process. Syst. 2024, 36, 69819–69831. [Google Scholar]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2022; pp. 10819–10829. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Li, K.; Wang, Y.; Zhang, J.; Gao, P.; Song, G.; Liu, Y.; Qiao, Y. Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12581–12600. [Google Scholar] [CrossRef] [PubMed]
Rao, Y.; Zhao, W.; Tang, Y.; Zhou, J.; Lim, S.N.; Lu, J. Hornet: Efficient high-order spatial interactions with recursive gated convolutions. Adv. Neural Inf. Process. Syst. 2022, 35, 10353–10366. [Google Scholar]
Li, S.; Wang, Z.; Liu, Z.; Tan, C.; Lin, H.; Wu, D.; Chen, Z.; Zheng, J.; Li, S.Z. Efficient multi-order gated aggregation network. arXiv 2022, arXiv:2211.03295. [Google Scholar]
O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks. arXiv 2015, arXiv:1511.08458. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Ge, C.; Ding, X.; Tong, Z.; Yuan, L.; Wang, J.; Song, Y.; Luo, P. Advancing Vision Transformers with Group-Mix Attention. arXiv 2023, arXiv:2311.15157. [Google Scholar]
Li, J.; Wen, Y.; He, L. SCConv: Spatial and channel reconstruction convolution for feature redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar]
Bengio, S.; Vinyals, O.; Jaitly, N.; Shazeer, N. Scheduled sampling for sequence prediction with recurrent neural networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1171–1179. [Google Scholar]

Figure 1. Drainage map of the study area.

Figure 2. Overview of the framework for SimVP-GMA, where T =

T^{'}

= 10.

Figure 2. Overview of the framework for SimVP-GMA, where T =

T^{'}

= 10.

Figure 3. The architecture of SCConv integrated with the spatial reconstruction unit (SRU) and channel reconstruction unit (CRU), operates on the intermediate input features, X, in the bottleneck residual block. First, the spatially refined features

X_{w}

are obtained through the SRU operation, followed by the CRU operation to derive the channel-refined features, Y.

Figure 3. The architecture of SCConv integrated with the spatial reconstruction unit (SRU) and channel reconstruction unit (CRU), operates on the intermediate input features, X, in the bottleneck residual block. First, the spatially refined features

X_{w}

are obtained through the SRU operation, followed by the CRU operation to derive the channel-refined features, Y.

Figure 4. Specific computation diagram of GMA. In (a,b), the concrete computation of GMA with five three-dimensional tokens, so that N = 5 and d = 3. To compute the correlations between two highlighted groups so that each consists of three tokens, they are aggregated into two proxies for further multiplication. The group aggregation can be effectively implemented via sliding window-based operators.

Figure 5. Structure of the Group-Mix Attention Block.

Figure 6. Visualization results of recurrent-based models.

Figure 7. Visualization results of recurrent-free models.

Table 1. Comparison of all models regarding MSE, MAE, and LPIPS.

Method	Params (M)	FLOPs	MSE ↓	MAE ↓	LPIPS ↓
ConvLSTM	14.9	0.897T	33.11	296.03	0.2212
PredNet	12.511	3.353G	37.34	282.29	0.1328
PredRNN	23.568	1.834T	30.787	285.29	0.2343
PredRNN++	38.6	2.727T	31.59	280.56	0.1892
MIM	39.813	0.717T	24.46	255.29	0.2022
PhyDNet	3.1	61.30G	40.63	408.08	0.2059
MAU	20.138	0.262T	45.05	376.46	0.2426
PredRNN-V2	23.585	1.844T	32.27	310.50	0.2137
SimVP	8.6	48.25G	21.93	243	0.1802
TAU	14.96	73.77G	23.35	241.8	0.1631
SimVPv2	15.63	78.76G	24.83	245.46	0.1587
MogaNet	15.64	76.73G	22.83	251.61	0.1789
HorNet	15.26	75.33G	24.13	273.55	0.1969
ViT	12.69	0.112T	22.28	244.28	0.1629
UniFormer	11.78	78.33G	22.42	243.76	0.1843
SimVP-GMA	19.9	0.345T	21.38	237.45	0.1609

Table 2. Scores of each model.

Method	POD ↑ (dBZ > 10/20/30)	CSI ↑ (dBZ > 10/20/30)	FAR ↓ (dBZ > 10/20/30)
ConvLSTM	0.775/0.623/0.232	0.661/0.541/0.197	0.024/0.009/0.002
PredNet	0.788/0.688/0.415	0.658/0.559/0.335	0.028/0.014/0.002
PredRNN	0.801/0.676/0.381	0.678/0.574/0.316	0.025/0.010/0.002
PredRNN++	0.826/0.752/0.454	0.690/0.600/0.367	0.027/0.015/0.002
MIM	0.808/0.707/0.396	0.711/0.613/0.344	0.019/0.009/0.001
PhyDNet	0.766/0.580/0.226	0.641/0.497/0.192	0.027/0.010/0.002
MAU	0.777/0.622/0.486	0.632/0.506/0.286	0.032/0.014/0.006
PredRNN-V2	0.765/0.663/0.211	0.666/0.562/0.201	0.021/0.010/0.000
SimVP	0.824/0.731/0.457	0.727/0.631/0.383	0.019/0.009/0.002
TAU	0.824/0.732/0.451	0.721/0.626/0.376	0.020/0.010/0.002
SimVPv2	0.824/0.743/0.509	0.714/0.623/0.396	0.022/0.011/0.003
MogaNet	0.821/0.702/0.413	0.720/0.615/0.351	0.019/0.008/0.002
HorNet	0.790/0.661/0.292	0.707/0.591/0.266	0.016/0.007/0.001
VIT	0.816/0.692/0.390	0.719/0.611/0.346	0.019/0.008/0.001
UniFormer	0.808/0.710/0.454	0.719/0.622/0.375	0.017/0.008/0.002
SimVP-GMA	0.826/0.755/0.464	0.730/0.633/0.374	0.018/0.008/0.001

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Zhao, H.; Zhang, G.; Guan, Q.; Zhu, Y. Spatiotemporal Predictive Learning for Radar-Based Precipitation Nowcasting. Atmosphere 2024, 15, 914. https://doi.org/10.3390/atmos15080914

AMA Style

Wang X, Zhao H, Zhang G, Guan Q, Zhu Y. Spatiotemporal Predictive Learning for Radar-Based Precipitation Nowcasting. Atmosphere. 2024; 15(8):914. https://doi.org/10.3390/atmos15080914

Chicago/Turabian Style

Wang, Xiaoying, Haixiang Zhao, Guojing Zhang, Qin Guan, and Yu Zhu. 2024. "Spatiotemporal Predictive Learning for Radar-Based Precipitation Nowcasting" Atmosphere 15, no. 8: 914. https://doi.org/10.3390/atmos15080914

APA Style

Wang, X., Zhao, H., Zhang, G., Guan, Q., & Zhu, Y. (2024). Spatiotemporal Predictive Learning for Radar-Based Precipitation Nowcasting. Atmosphere, 15(8), 914. https://doi.org/10.3390/atmos15080914

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spatiotemporal Predictive Learning for Radar-Based Precipitation Nowcasting

Abstract

1. Introduction

2. Data and Processing Methods

2.1. Data Source

2.2. Noise Reduction and Anomaly Handling

2.3. Normalization Processing

2.4. Data Augmentation

3. SimVP-GMA

3.1. Model Introduction

3.2. Spatial and Channel Reconstruction Convolution

3.3. Group-Mix Attention

4. Experimental Results

4.1. Evaluation Metrics

4.2. Model Comparison

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI