ST-DualNet: A Spatiotemporal Dual-Branch Neural Network Model for Short-Term Precipitation Forecasting

Dang, Yuan; Yin, Bo; Cui, Haipeng; Bi, Tao; Guo, Yiyun

doi:10.3390/rs18101567

Open AccessArticle

ST-DualNet: A Spatiotemporal Dual-Branch Neural Network Model for Short-Term Precipitation Forecasting

by

Yuan Dang

¹,

Bo Yin

^1,2,*

,

Haipeng Cui

²,

Tao Bi

³ and

Yiyun Guo

¹

College of Information Science and Engineering, Ocean University of China, Qingdao 266100, China

²

Qingdao Jari Industry Control Technology Co., Ltd., Qingdao 266100, China

³

Qingdao Port International Co., Ltd., Qingdao 266000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1567; https://doi.org/10.3390/rs18101567

Submission received: 27 March 2026 / Revised: 7 May 2026 / Accepted: 11 May 2026 / Published: 14 May 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A dual-branch network architecture named ST-DualNet is proposed. By constructing independent temporal and spatial branches, this architecture explicitly separates the tasks of modeling the dynamic evolution and static structure of precipitation radar echoes, thereby alleviating the issue of feature ambiguity in precipitation forecasting.
On the KNMI radar dataset, this model significantly outperforms mainstream models such as SmaAt-UNet and PredRNN++ on key metrics including CSI and HSS, demonstrating excellent forecasting robustness particularly under challenging extreme precipitation scenarios.

What is the implication of the main finding?

With a reduced parameter count of 5.12 million and a computational efficiency of 14.04 GFLOPs, the model has demonstrated its high feasibility for real-time meteorological operations on standard consumer-grade GPUs.
By adaptively modeling the non-rigid deformation of complex radar echoes, ST-DualNet provides an efficient and scalable framework for short-range forecasting. In future, the reliability of complex weather forecasting could be further enhanced by incorporating wind field or satellite data.

Abstract

Short-term precipitation forecasting is an important research direction in meteorological studies, holding significant implications for disaster prevention and mitigation, urban flood drainage, and agricultural meteorological management. Existing deep learning models have achieved favourable results in modeling local features, yet they generally suffer from insufficient sensitivity to heavy precipitation areas, limitations in modeling temporal dependencies, and gradient instability issues. To address these limitations, we propose a novel spatiotemporal dual-branch neural network (ST-DualNet) for short-term precipitation forecasting based on radar echo maps. The network comprises a temporal branch (based on an enhanced ST-DConvLSTM) and a spatial branch (based on dilated convolutions and Transformer), respectively capturing the dynamic evolution and spatial structural features of precipitation. The two branches are integrated through the CBAM attention module and 3D convolution layer to achieve cross-branch feature fusion and prediction output. Experimental results demonstrate that ST-DualNet outperforms multiple mainstream models on the KNMI radar precipitation dataset, especially in heavy precipitation forecasting, providing an effective new framework for short-term precipitation forecasting.

Keywords:

precipitation forecasting; spatiotemporal prediction; ConvLSTM; attention mechanism

1. Introduction

Short-term precipitation forecasting, also known as precipitation nowcasting, stands as a significant research topic within meteorology [1]. Since ancient times, precipitation has played a substantial role in daily life, with our ancestors observing natural phenomena to predict the arrival of rain and prepare accordingly. In the modern era, precipitation forecasting retains considerable research value for both societal life and production activities. When categorized by temporal scale, precipitation can be divided into short-term, medium-term [2], and long-term precipitation [3]. Among these, short-term precipitation exerts the most severe impact on human society and production, while also presenting the greatest forecasting challenges [4]. Short-term precipitation forecasting aims to predict the intensity and distribution of rainfall over the next 0–2 h using radar image sequences [5]. Its outcomes hold immense practical value in areas such as heavy rainfall warnings, urban flood prevention, and meteorological support for shipping operations [6].

Existing short-term precipitation forecasting methods can be broadly categorized into numerical weather prediction (NWP) [7] methods and radar echo reflectance extrapolation [8] methods. NWP methods primarily rely on mathematical models, requiring complex atmospheric mathematical equations and massive amounts of observational data. Although they are generally accurate in weather forecasting, they have limitations in short-term precipitation forecasting tasks due to their high computational cost, poor real-time performance, and difficulty in effectively capturing complex atmospheric processes in a short time and at small scale [9]. In contrast, deep learning methods based on radar echo reflectance extrapolation can predict future echo distributions using only historical radar image sequences [10,11]. They then utilize the Z-R relationship [12] (where Z is radar reflectance and R is precipitation intensity) to convert reflectance into precipitation, enabling rapid forecasting. This method excels in timeliness and scalability, gradually becoming the more efficient and accurate mainstream choice in current short-term precipitation forecasting tasks [13].

In recent years, deep learning techniques have been extensively applied to various meteorological forecasting tasks and have demonstrated remarkable success [14,15]. Especially in the field of short-term precipitation forecasting, existing methods based on deep learning can be roughly divided into two categories according to their core architecture: one is mainly based on convolutional neural networks (CNNs) [16] for spatial feature extraction, and the other is based on recurrent neural networks (RNNs) for temporal dynamic modeling [17,18].

CNN-based models have attracted considerable attention in short-term precipitation forecasting due to their robust spatial feature extraction capabilities. A series of variants centered on the classic U-Net encoder–decoder architecture have achieved notable progress in this domain. SmaAt-UNet [19], proposed by Kevin Trebing et al., stands as a foundational and highly efficient representative work in this direction. This model integrates a CBAM within the U-Net architecture and employs depthwise-separable convolutions (DSC) [20] to replace standard convolutions. This approach reduces the number of parameters to approximately one-quarter of the original U-Net while maintaining comparable or even superior forecasting performance. The advantage of this model lies in its enhanced focus on key meteorological features through the CBAM mechanism and the substantial improvement of model efficiency and deployment feasibility achieved by means of DSC.

To more effectively extract multi-scale spatial features, Jesús García Fernández et al. proposed Broad-UNet [21]. This model introduces asymmetric parallel convolutions and Atrous Spatial Pyramid Pooling (ASPP) [22] modules within the U-Net encoder. Asymmetric parallel convolutions simultaneously extract multi-scale features using convolutional kernels of varying sizes, while the ASPP module expands the receptive field through convolutions with different dilation rates to fuse more global contextual information. Experiments demonstrate that Broad-UNet achieves superior accuracy to baseline models in both precipitation and cloud cover forecasting tasks.

In recent years, the Transformer architecture has been introduced into visual tasks to address the limitations of CNNs due to its robust global modeling capabilities. The representative models include UTrans-Net [23] proposed by Hao Cao et al. and AA-TransUNet [24] proposed by Yimin Yang et al. UTrans-Net attempts to integrate Transformer modules into U-Net and uses a self-attention mechanism to assign weights to different meteorological elements in order to improve the effectiveness of feature extraction.

By contrast, AA-TransUNet [24] proposes a more systematic and higher-performing fusion architecture. Centred on TransUNet (a hybrid CNN-Transformer encoder and U-Net decoder), it further integrates the CBAM and depthwise-separable convolutions. This model creatively incorporates CBAM into the CNN part of the encoder and each layer of the decoder, achieving dual attention enhancement in both channel and spatial dimensions, while using DSC in the decoder to control the number of parameters. Comprehensive experiments demonstrate that AA-TransUNet outperforms prior models such as SmaAt-UNet [19] and Broad-UNet [21] across multiple evaluation metrics, proving the superiority of hybrid architectures combining global attention with local convolutions for short-term forecasting tasks.

Recurrent neural networks (RNNs) and their variants possess inherent advantages in sequence prediction tasks due to their robust temporal modeling capabilities. In the field of short-term precipitation forecasting, researchers have focused on integrating RNNs with spatial feature extraction capabilities to construct predictive models capable of jointly modeling spatiotemporal dependencies.

A pioneering work in this area is the ConvLSTM [25] proposed by Shi et al. This model first replaced the matrix multiplications in fully connected LSTM (FC-LSTM) with convolution operations, introducing the convolutional LSTM unit. This innovation allows the state transitions within the model to retain spatial structure, thereby enabling direct processing of spatiotemporal sequence data. By stacking layers and constructing an encoding–forecasting architecture, ConvLSTM achieves end-to-end training. It significantly outperforms traditional optical flow methods (such as the ROVER algorithm) and FC-LSTM in precipitation forecasting tasks, uniformly and intrinsically modeling spatiotemporal correlations.

To enhance the modeling capability for complex spatiotemporal dynamics, subsequent studies have made profound improvements to the structure of ConvLSTM. PredRNN [26] proposes a recurrent neural network architecture for spatiotemporal predictive learning, whose core is a novel Spatiotemporal Long Short-Term Memory unit (ST-LSTM). This method introduces a unified memory pool, allowing memory states to be vertically transferred between stacked RNN layers and horizontally flowed between temporal states, thus simultaneously modeling spatial appearance and temporal dynamics. PredRNN achieved state-of-the-art performance on multiple video prediction datasets at the time. Its advantages lie in its ability to effectively capture long-term motion trajectories and detailed spatial deformations, generate clearer and more accurate prediction frames, and offer a flexible framework that is easily extendable to other forecasting tasks.

Based on ConvLSTM, to overcome the limitation posed by the positional invariance of its convolutional structure in modeling complex motions such as rotation and scaling, Shi et al. further proposed the Trajectory GRU (TrajGRU) [27] model. The core improvement of this model lies in transforming the recurrent connection structure from static convolution to dynamic learning. TrajGRU retains the encoding–forecasting framework of ConvLSTM while significantly enhancing the model’s ability to model location-variant motions like rotation and scaling. Furthermore, this work concurrently established the first large-scale precipitation forecasting benchmark, encompassing data, balanced loss functions, and online/offline evaluation protocols. It has laid a crucial foundation for subsequent research.

The CNN model based on U-Net and its variants that introduce the attention mechanism and Transformer have continuously promoted the development of short-term precipitation forecasting technology by constantly optimizing the efficiency of feature extraction. Nevertheless, the key challenge remains how to more effectively model irregular spatial patterns and non-stationary long-term temporal dependencies in tandem. RNN-based methods have progressively enhanced the capacity to represent complex spatiotemporal dynamic evolution by designing more sophisticated memory mechanisms (e.g., PredRNN) and more flexible connection schemes (e.g., TrajGRU). Although these methods perform well in modeling temporal dependencies, they usually rely on sequential recursive computations, resulting in low training parallelism and long training time. Concurrently, the efficient integration of multi-scale spatial features with refined temporal modeling remains a challenge for this class of methods. The shortcomings of both CNN-based and RNN-based models in short-term precipitation forecasting form the impetus for this work.

To address the challenges of current short-term precipitation forecasting in handling irregular spatial morphology and complex spatiotemporal evolution, we propose a novel spatiotemporal dual-branch neural network—ST-DualNet. This model achieves explicit decoupled learning and deep fusion of the dynamic evolution process and spatial hierarchical features of the precipitation field through parallel temporal and spatial branches.

The temporal branch of the ST-DualNet model is constructed around our newly designed Spatiotemporal Deformable Convolutional Long Short-Term Memory (ST-DConvLSTM) module. This module is one of the core innovations of ST-DualNet, making two key improvements to the traditional ConvLSTM: for one thing, it enhances the ability to remember long-term weather models by introducing an independent memory state M; for another, it replaces standard convolutions with deformable convolutions. This key substitution endows the network with an adaptive receptive field, enabling it to dynamically focus computational resources on regions most valuable for prediction.This allows the model to accurately capture the key dynamics of the non-rigid and irregular morphology of precipitation systems with less computational loss, thus achieving a significant performance leap without excessively increasing computational costs. Meanwhile, the spatial branch of the model integrates dilated convolution [28] and Transformer [29] modules in parallel, collaboratively capturing multi-scale local textures and global spatial dependencies in precipitation radar echo maps. Finally, we feed the heterogeneous features from the two branches into the CBAM [30] attention module to achieve adaptive weighting of features and fusion of multidimensional information, thereby generating prediction results with high spatiotemporal consistency.

The main contributions of this paper are summarised as follows:

We propose a novel dual-branch network architecture named ST-DualNet, which provides an efficient and logically clear modeling framework for complex spatiotemporal sequence forecasting tasks by explicitly decoupling the learning process of temporal dynamics and spatial features.
The ST-DConvLSTM module is designed as the core component of the temporal branching architecture. By integrating deformable convolution and independent memory states M, it synergistically enhances the capability of the model in modeling irregular spatial patterns and long-term temporal dependencies.
A hybrid spatial feature extraction branch is constructed, which integrates dilated convolutions and Transformers. This branch effectively captures multi-scale and global spatial information of precipitation fields. Additionally, the CBAM module is adopted to achieve intelligent fusion of the dual-branch features.
Comprehensive comparative experiments were conducted on the publicly available KNMI radar precipitation dataset [19]. The results show that our proposed ST-DualNet model significantly outperforms existing baseline methods on several key evaluation metrics.

The structure of this paper is as follows. In Section 2, we propose a spatiotemporal dual-branch neural network model for short-term precipitation forecasting. In Section 3, we present comparative and ablation experiments of the ST-DualNet model. Section 4 and Section 5 are the discussion and conclusion sections of this paper.

2. Materials and Methods

2.1. Problem Definition

The short-term precipitation forecasting problem aims to utilise observed historical radar echo sequences to predict the spatial distribution of radar echoes over a fixed future time period. Given the established Z-R relationship between the radar reflectivity factor Z and precipitation intensity R, this problem can be transformed into a task of predicting future radar echo sequences.

To better describe this problem, we discretise the target area into a regular grid with spatial resolution

H \times W

, where each grid point possesses P associated observation variables at time t. In this study, we employ a single radar reflectivity factor; hence

P = 1

. The observed state of the entire region at time t can be represented as a third-order tensor

X_{t} \in R^{P \times H \times W}

. Let the historical observation sequence be denoted as

X = {X_{t - m + 1}, X_{t - m + 2}, \dots, X_{t}}

, comprising m frames; the forecasted future sequence is denoted as

\hat{Y} = {{\hat{Y}}_{t + 1}, {\hat{Y}}_{t + 2}, \dots, {\hat{Y}}_{t + n}}

, comprising n frames [31]. The short-term precipitation forecasting problem can be formulated as:

\hat{Y} = Φ (X),

(1)

where

Φ

denotes the proposed ST-DualNet model in this paper. The objective of this problem is to make the predicted sequence

\hat{Y}

as close as possible to the true future sequence

Y

in terms of both spatial structure and temporal evolution.

In this study, the historical frame length is set to

m = 12

, and the prediction frame length to

n = 6

, with a temporal resolution of 5 min per frame. The spatial dimensions of each input and output image are

288 \times 288

pixels. The model directly outputs the reflectivity field, which can subsequently be converted to precipitation intensity via a standard Z-R relationship, thereby enabling end-to-end short-term precipitation forecasting.

2.2. Base Model: ConvLSTM

ConvLSTM [25], serving as the foundational architecture for the improved model presented herein, is a recurrent neural network capable of simultaneously extracting temporal dependencies and spatial features. Unlike traditional fully connected LSTM, ConvLSTM models input data, hidden states, and cell states as three-dimensional tensors, creatively replacing matrix multiplication operations within the gating mechanism with convolutional operations. During temporal iteration, this model employs convolutional layers to jointly extract features from the current input and the previous hidden state, thereby explicitly preserving the spatial topological structure of the data during state transitions. This intrinsic mechanism enables ConvLSTM to effectively overcome the loss of spatial information caused by one-dimensional vectorisation, providing a robust feature representation framework for constructing predictive models for high-dimensional spatiotemporal sequences. The formula for ConvLSTM is as follows:

\begin{matrix} i_{t} & = σ (W_{x i} * X_{t} + W_{h i} * H_{t - 1} + W_{c i} ⊙ C_{t - 1} + b_{i}) \\ f_{t} & = σ (W_{x f} * X_{t} + W_{h f} * H_{t - 1} + W_{c f} ⊙ C_{t - 1} + b_{f}) \\ {\tilde{C}}_{t} & = tanh (W_{x c} * X_{t} + W_{h c} * H_{t - 1} + b_{c}) \\ C_{t} & = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ {\tilde{C}}_{t} \\ o_{t} & = σ (W_{x o} * X_{t} + W_{h o} * H_{t - 1} + W_{c o} ⊙ C_{t} + b_{o}) \\ H_{t} & = o_{t} ⊙ tanh (C_{t}) . \end{matrix}

(2)

Here,

X_{t}

denotes the input tensor at time step t, while

H_{t}

and

C_{t}

represent the hidden state and memory cell state at that moment respectively, all being three-dimensional tensors retaining spatial dimensions.

i_{t}

,

f_{t}

, and

o_{t}

correspond to the input gate, forget gate, and output gate respectively. These gates dynamically regulate the retention and transmission of information flows through the Sigmoid activation function

σ

.

{\tilde{C}}_{t}

denotes the candidate memory state generated by the tanh activation function. W and b represent the learnable convolutional kernels and bias terms associated with each gate. The operator ∗ denotes the convolution operation, while ⊙ denotes the Hadamard product. The structural diagram of ConvLSTM is shown in Figure 1.

ConvLSTM organically integrates spatiotemporal feature extraction with sequence modeling through the incorporation of convolutional structures, significantly enhancing the spatiotemporal evolution modeling capabilities of precipitation systems. This represents a landmark achievement in the field of short-term precipitation forecasting. However, over extended forecast horizons, the model struggles to maintain stable information propagation, frequently exhibiting cumulative prediction errors and a loss of detail.

2.3. Network Structure

2.3.1. Whole Network

This model constructs a parallel dual-branch network architecture with differentiated focuses. The temporal branch dominates short-term dynamic capture, and the spatial branch dominates global static modeling. The overall architecture of the model is shown in Figure 2.

The model receives a series of 12 consecutive radar echo images as input. The temporal branch utilises the complete historical sequence

X

as input, employing a deep recurrent encoder–decoder architecture to capture the dynamic temporal variations and non-rigid motion trajectories of precipitation processes. The spatial branch does not simply select a single frame image, but performs temporal average pooling on the input sequence, aggregating the input features from T time steps along the temporal dimension to generate a representative static spatial feature map. This process eliminates short-term fluctuation noise and extracts the global spatial structure and average intensity distribution of the precipitation system within that time period. Subsequently, the dynamic temporal features from the temporal branch and the global spatial features from the spatial branch that have undergone temporal replication are concatenated in the channel dimension to form a fused feature tensor. To further optimize feature representation, the fused features are 3D normalized before being fed into the CBAM module. This allows the model to adaptively select key temporal and spatial information from the fused features, strengthening the complementarity of the two branches. The features enhanced by attention are further input into a 3D convolutional layer for spatiotemporal consistency modeling, ultimately outputting predictions for the next 6 frames of radar echo images.

2.3.2. ST-DConvLSTM

We propose an improved ST-DConvLSTM core unit within the temporal branch to address the issues of the difficulty in adaptively modeling the complex non-rigid deformation of precipitation radar echoes in deep networks, as well as the problems of long-term spatiotemporal information decay and forgetting. The overall structure is shown in Figure 3, incorporating two key technical improvements.

The first improvement involves integrating a deformable convolution mechanism [32] within the ST-DConvLSTM module. Traditional ConvLSTM modules rely on standard convolution operators, whose receptive fields are constrained by fixed geometric grid structures, such as 3 × 3 rectangles. This design fundamentally corresponds to the Eulerian View in fluid dynamics, observing changes in fluid properties from fixed spatial positions. However, radar echoes reveal precipitation systems as highly dynamic fluids exhibiting pronounced non-rigid deformation characteristics such as rotation and diffusion. A fixed receptive field struggles to adapt to such complex geometric transformations, easily leading to loss of texture and deviation in motion trajectories in the predicted image.

To address this, ST-DConvLSTM innovatively introduces deformable convolutions (DCN) to replace standard convolutions within units. This mechanism endows the network with the capability to dynamically adjust the shape of its receptive field by incorporating data-driven learnable two-dimensional offsets at the sampling points of the convolutional kernel, as illustrated in Figure 4. From a mathematical perspective, this enhancement facilitates a shift towards a Lagrangian View—the model is no longer constrained by a fixed grid but can actively track feature points within precipitation echoes. Regardless of how complex the non-rigid deformations of precipitation systems become, the deformable convolution kernel adaptively locks onto the same physical features by dynamically adjusting sampling positions, thereby significantly enhancing the model’s accuracy and spatiotemporal consistency in modeling complex atmospheric fluid motions, as shown in Figure 5.

It is important to clarify the differences between our ST-DConvLSTM and two related but distinct approaches. First, TrajGRU [27] learns a flow field to warp the hidden state from sparse, non-grid positions, focusing on trajectory modeling. In contrast, our ST-DConvLSTM integrates Deformable Convolution (DCN) which directly adapts the sampling grid of the convolution kernel. This provides a more direct mechanism to capture local non-rigid deformations without requiring an additional subnetwork to predict a dense flow field. Second, while SA-DConvLSTM [33] also incorporates deformable convolution into a ConvLSTM variant, the integration strategy is fundamentally different. SA-DConvLSTM only replaces the standard convolutions in the input-to-state transitions with DCN, leaving the state-to-state connections as fixed-grid standard convolutions. In contrast, our ST-DConvLSTM replaces the standard convolutions within all ConvLSTM gates. This internal integration allows the network to adaptively adjust its receptive field during the state transition process, offering a more unified framework for modeling non-rigid deformations in precipitation radar echoes.

The second technical enhancement involves introducing an explicit spatiotemporal memory state M into the ST-DConvLSTM module. Specifically, the model extends the state transfer mechanism of traditional ConvLSTM by introducing a third independent memory state M, as shown in Figure 6. At each time step, the network concurrently maintains the hidden state H, cell state C, and memory state M, with these three states exchanging information through a multi-gate convolutional structure during updates. The cell state C primarily handles the storage of short-term local features, whilst the memory state M records the long-term evolutionary trends of the precipitation system. Through the joint regulation of the gating mechanisms, the model can adaptively fuse information from these two states across different temporal scales. This enables the network to respond rapidly to short-term precipitation changes whilst robustly capturing the sustained evolution of large-scale weather systems. The principal mathematical formula for ST-DConvLSTM is as follows:

\begin{matrix} Δ P_{h c} = W_{o f f_h c} * [X_{t}, H_{t - 1}] \\ Δ P_{m} = W_{o f f_m} * [X_{t}, M_{t - 1}] \\ i_{t} = σ (W_{x i} ⊛_{Δ P_{h c}} X_{t} + W_{h i} ⊛_{Δ P_{h c}} H_{t - 1} + b_{i}) \\ f_{t} = σ (W_{x f} ⊛_{Δ P_{h c}} X_{t} + W_{h f} ⊛_{Δ P_{h c}} H_{t - 1} + b_{f}) \\ {\tilde{C}}_{t} = tanh (W_{x c} ⊛_{Δ P_{h c}} X_{t} + W_{h c} ⊛_{Δ P_{h c}} H_{t - 1} + b_{c}) \\ C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ {\tilde{C}}_{t} \\ o_{t} = σ (W_{x o} ⊛_{Δ P_{h c}} X_{t} + W_{h o} ⊛_{Δ P_{h c}} H_{t - 1} + b_{o}) \\ i_{t}^{'} = σ (W_{x i}^{'} ⊛_{Δ P_{m}} X_{t} + W_{m i}^{'} ⊛_{Δ P_{m}} M_{t - 1} + b_{i}^{'}) \\ f_{t}^{'} = σ (W_{x f}^{'} ⊛_{Δ P_{m}} X_{t} + W_{m f}^{'} ⊛_{Δ P_{m}} M_{t - 1} + b_{f}^{'}) \\ {\tilde{M}}_{t} = tanh (W_{x m}^{'} ⊛_{Δ P_{m}} X_{t} + W_{m m}^{'} ⊛_{Δ P_{m}} M_{t - 1} + b_{m}^{'}) \\ M_{t} = f_{t}^{'} ⊙ M_{t - 1} + i_{t}^{'} ⊙ {\tilde{M}}_{t} \\ H_{t} = o_{t} ⊙ tanh (C_{t} + M_{t}) . \end{matrix}

(3)

Here,

M_{t}

is responsible for recording the memory state of long-term evolutionary trends.

Δ P_{h c}

and

Δ P_{m}

represent the learned offsets. The operator

⊛_{Δ P}

denotes a deformable convolution guided by the offset

Δ P

, enhancing the model’s capability to handle geometric transformations.

i_{t}

,

f_{t}

, and

o_{t}

correspond respectively to the input gate, forget gate, and output gate of the standard flow.

i_{t}^{'}

and

f_{t}^{'}

correspond respectively to the input gate and forget gate of the memory flow.

{\tilde{C}}_{t}

and

{\tilde{M}}_{t}

denote the candidate cell state and candidate memory state, recording short-term memory and long-term memory respectively.

Unlike traditional single ConvLSTM models, we introduce an improved ST-DConvLSTM module within the temporal branch. This enhances the network’s ability to represent complex precipitation dynamics, effectively mitigating the long-term information forgetting and fixed receptive field issues inherent in deep recurrent networks. This module not only inherits the advantages maintained by the ConvLSTM model in terms of spatial structure, but also achieves decoupled modeling of long-term and short-term features and dynamic capture of radar echo variations through the introduction of additional deformable convolutions and memory states. Consequently, it enhances the spatiotemporal consistency and stability of forecasts.

2.3.3. Temporal Branch

The temporal branch primarily undertakes the modeling of temporal dynamic evolution characteristics in precipitation radar echo sequences. Addressing the non-rigid deformation and complex trajectory features exhibited by precipitation radar echoes during motion, this branch constructs a symmetric encoder–decoder architecture based on a sequence-to-sequence paradigm, employing an enhanced ST-DConvLSTM module as its core computational unit.

The encoder comprises two stacked layers of ST-DConvLSTM modules. The first encoder layer primarily captures local information and short-term motion trends. Subsequently, the feature maps enter the second encoder layer, where the doubling of channel count enables the network to extract higher-order semantic information, such as the overall movement direction and intensity evolution patterns of precipitation echoes. Throughout this process, the internally integrated deformable convolution mechanism within the module adaptively adjusts the receptive field according to changes in echo morphology, significantly enhancing the model’s ability to extract features from non-stationary meteorological processes. The decoder likewise comprises two layers of ST-DConvLSTM, with the number of channels decreasing sequentially across layers. While gradually restoring feature dimensions, it recursively generates temporal representations for future time steps using state information passed from preceding layers. Finally, the feature tensor output from the decoder undergoes three-dimensional batch normalisation, standardising the temporal feature distribution and enhancing the model’s convergence stability during deep network training.

To address information decay in long sequence prediction, the temporal branch incorporates a symmetric reverse-order state transfer mechanism. Unlike conventional concatenation, this mechanism uses the hidden state, cell state, and memory state output by the encoder at the final time step as prior knowledge, and initializes the corresponding layers of the decoder in reverse order. The deep high-order semantic states of the encoder are directly passed to the starting layer of the decoder, laying the macroscopic evolutionary foundation for the predicted sequence. Meanwhile, the shallow fine-grained states of the encoder are passed to the output layer of the decoder to guide the reconstruction of local temporal details. Specifically, the deep encoder states initialize the shallow decoder, and the shallow encoder states initialize the deep decoder.

2.3.4. Spatial Branch

The spatial branch aims to extract static features with a global perspective and multi-scale spatial structural information from radar echo sequences, thereby compensating for spatial details that the temporal branch may overlook when focusing on dynamic evolution.

To construct a representative spatial input representation, the model first performs temporal average pooling on the input sequence

X \in R^{B \times C \times T \times H \times W}

. By calculating the mean value of the input sequence across the temporal dimension, this operation smooths out short-term random fluctuations, thereby generating a feature map that reflects the average intensity distribution and overall spatial morphology of the precipitation system during that time period.

The input spatial feature map first passes through a layer of dilated convolution modules. By employing

3 \times 3

convolution kernels with a dilation rate of 2, this module significantly expands the receptive field without reducing the feature map resolution. This design enables the network to simultaneously capture both the local core of intense echoes and the background structure within its neighbourhood, thereby enhancing its ability to analyse multi-scale precipitation patterns. Features extracted via dilated convolution subsequently undergo batch normalisation and ReLU activation to enhance nonlinear representation and stabilise feature distribution.

Subsequently, the feature stream enters the Transformer module. Recognising the limitations of convolutional operations in modeling long-range dependencies, we introduce a Transformer architecture within the spatial branch. To preserve fine-grained spatial details crucial for locating intense precipitation cells, we treat each pixel’s feature vector as an individual token instead of adopting the patch-wise partitioning common in Vision Transformers (ViT). Specifically, the feature map of shape

(B, C, H, W)

is flattened and transposed into a sequence of

(B, L, D)

, where the sequence length

L = H \times W

and the feature dimension

D = 64

. The Transformer is configured as a standard encoder with 2 layers, 8 attention heads, and a feed-forward network (FFN) dimension of 256. This architecture enables the model to capture global spatial dependencies by calculating pairwise self-attention scores across all

H \times W

positions, effectively modeling the holistic perception of large-scale precipitation systems.

The feature tensor shape of the spatial branch output is

S = (B, C_{s}, H, W)

. To fuse with the temporal branch output tensor

T = (B, C_{t}, T, H, W)

, we employ a temporal replication strategy. The spatial features are replicated T times along the temporal dimension, expanding them to

S^{'} = (B, C_{s}, T, H, W)

. This ensures consistent and rich global spatial context information is available for dynamic prediction in each frame during subsequent fusion stages.

2.3.5. Feature Fusion

The feature fusion module serves as the pivotal hub connecting the dual-branch architecture with the final prediction output. Its core task is to achieve deep interaction and adaptive fusion between temporal dynamic features and spatial static features. Given the significant heterogeneity in physical meaning and distribution patterns of the features extracted by the two branches, the fusion module avoids simple linear superposition. Instead, it employs a processing mechanism comprising temporal alignment, distribution normalisation, and attention recalibration. The tensor T output from the temporal branch is concatenated along the channel dimension with the spatially replicated static features

S^{'}

from the spatial branch, generating the initial fusion tensor

F_{merge} = Concat (T, S^{'}) \in R^{B \times (C_{t} + C_{s}) \times T \times H \times W}

.

As T and

S^{'}

originate from recurrent neural networks and convolutional neural networks respectively, their numerical distributions exhibit significant divergence. Direct processing may lead to training instability or the gradient being dominated by a particular branch feature. Therefore, the fusion module first introduces 3D batch normalisation to standardize

F_{merge}

, forcing a unified feature distribution, accelerating convergence, and balancing the contributions of the two branches. Subsequently, the normalised feature tensor is fed into the CBAM module. The fused features are adaptively recalibrated through the dual mechanisms of channel attention and spatial attention. Channel attention automatically identifies and enhances the feature channels most discriminative for precipitation prediction while suppressing redundant noise. Spatial attention focuses on high-probability precipitation regions, strengthening the model’s attention to core echo patterns. To accommodate the temporal nature of the task, the CBAM module is adapted for 5D tensors using a time-distributed mechanism. The fused tensor (B, T, C, H, W) is temporarily reshaped into (B × T, C, H, W) to apply attention independently to each time step. This ensures that the importance of different meteorological features is evaluated frame-by-frame while maintaining the overall sequence structure. Following this, a 3D convolutional layer with a kernel size of

3 \times 3 \times 3

is employed to fuse features across the temporal dimension and generate final predictions. This further smooths spatiotemporal boundaries, ensuring the generated radar echo sequence of six frames exhibits high temporal continuity and physical consistency in its evolution.

3. Results

3.1. Dataset

The ST-DualNet network employs radar echo datasets released by the Royal Netherlands Meteorological Institute (Koninklijk Nederlands Meteorologisch Instituut, KNMI) for model training and evaluation. This dataset is collected by two C-band Doppler weather radars located at De Bilt and Den Helder, covering the entire territory of the Netherlands and neighbouring regions [19]. The raw data spans the period from 2016 to 2019, featuring a temporal resolution of 5 min and a spatial resolution of 1 km. Original radar images measure

765 \times 700

pixels, with each pixel value representing the cumulative rainfall over the preceding 5 min. This data not only contains precipitation intensity information but also documents the spatial distribution patterns of precipitation systems.

To enhance data quality and adapt to model inputs, we implemented a rigorous preprocessing workflow. First, considering the presence of invalid data regions beyond the detection range at the edges of raw radar images, and to reduce computational redundancy, we cropped the image centre to a size of 288 × 288 pixels. Second, to address the problem of the inherent class imbalance in precipitation data—where non-rainfall samples vastly outnumber rainfall samples—direct training tends to bias models towards predicting zero values. Therefore, this study followed the strategy of Trebing et al., constructing the NL-50 dataset [19]. This dataset employs a filtering mechanism to retain only samples where at least 50% of pixels within the target image exhibit non-zero precipitation values (intensity > 0 mm/min). This ensures the model focuses on capturing complex spatiotemporal evolution patterns under high precipitation probability. During sample construction, a sliding window method generates continuous sequence samples. Each sample comprises 18 frames of radar echo images, where the preceding 12 frames (corresponding to the past hour) serve as the input sequence

X

, and the subsequent 6 frames (representing the next 30 min) constitute the predicted ground truth

Y

[19]. Finally, the training and validation sets of the processed precipitation dataset contain 5734 samples, whilst the test set contains 1557 samples.

To ensure the reliability and temporal independence of experimental results, the dataset was strictly partitioned by year to prevent leakage of future information. Data from 2016 to 2018 were selected for model training and validation, while the full year of 2019 was used for model testing [19]. The ratio of training, validation, and testing sets was approximately 7:1:2. Finally, to accelerate model convergence and eliminate the influence of numerical dimensions, all input data underwent maximum value normalisation. Pixel values were divided by the maximum rainfall intensity observed within the training set, mapping the data to the range [0, 1].

3.2. Model Evaluation

To comprehensively evaluate the performance of the ST-DualNet model in short-term precipitation forecasting tasks, we adopt a standard quantitative evaluation system commonly used in meteorology, which covers both continuous and categorical metrics. The mean squared error (MSE) serves as a continuous metric to measure the overall deviation between predicted precipitation maps and actual radar echo maps at the pixel level. A lower MSE value indicates that the precipitation intensity distribution generated by the model is closer to the real observation, reflecting better fitting performance. Its calculation formula is given as follows:

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2},

(4)

where

y_{i}

denotes the true precipitation intensity,

{\hat{y}}_{i}

represents the precipitation intensity predicted by the model, and n is the total number of pixels in the image.

Since precipitation forecasting focuses not only on numerical accuracy but also on the spatial location capability of precipitation events, we also introduce a set of categorical evaluation metrics. Based on the intensity threshold r of meteorological radar echoes, continuous forecast maps and ground truth maps are converted into binary masks. Then the confusion matrix is statistically analysed to determine True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN). Based on this confusion matrix, the following key metrics are calculated: Precision measures the proportion of correctly predicted precipitation areas among all areas predicted as precipitation; Recall reflects the completeness of the model in detecting actual precipitation regions; the F1 score represents the harmonic mean of Precision and Recall, providing a comprehensive assessment of detection performance. Most crucially, the Critical Success Index (CSI) indicates the model’s capability to accurately identify precipitation events. Additionally, Accuracy reflects the proportion of correctly predicted pixels out of the total pixels, while the Heidke Skill Score (HSS) [27] measures the overall improvement of the model relative to random forecasting. Generally, higher Precision, Recall, F1, CSI, HSS and Accuracy values indicate superior precipitation detection performance. The calculation formulas are as follows:

\begin{matrix} Precision = \frac{T P}{T P + F P} \\ Recall = \frac{T P}{T P + F N} \\ F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} \\ CSI = \frac{T P}{T P + F P + F N} \\ Accuracy = \frac{T P + T N}{T P + T N + F P + F N} \\ HSS = \frac{T P \times T N - F P \times F N}{(T P + F N) (F N + T N) + (T P + F P) (F P + T N)} . \end{matrix}

(5)

3.3. Experimental Setup

All experiments in this study were conducted on a high-performance computing server running a Linux system. The server is equipped with three NVIDIA GeForce RTX 4090 GPUs (24 GB memory each) with driver version 535.113.01. The experiments were implemented in Python 3.8 using the PyTorch 2.6.0 deep learning framework. To further clarify the model’s structural scale and computational complexity for practical deployment, we analyzed its total parameters and FLOPs. The ST-DualNet consists of approximately 5.12 M parameters, and its total computational cost for a single forward pass (predicting 6 frames from 12 input frames) is 14.04 GFLOPs.

The initial learning rate during the model training phase is set to 0.0005, with a batch size of 4 and a patch size of 4. The number of hidden units in the recurrent cells is configured as ‘8, 8, 8, 8’. The learning rate automatically decays to 0.1 times its original value when the validation set loss fails to decrease for 15 consecutive epochs. The loss function employs Mean Squared Error (MSE) to measure the error between predicted precipitation intensity and actual values. The training process typically converges within approximately 150 epochs, ensuring the model reaches a stable and optimal state.

3.4. Experimental Results and Comparison with Mainstream Models

In this subsection, to comprehensively validate the effectiveness and advancement of the proposed ST-DualNet model in the short-term precipitation forecasting tasks, we conduct extensive comparative evaluations against representative mainstream methods and advanced models within the current domain. The selected benchmark models are mainly divided into three categories. The first category comprises classical spatiotemporal sequence prediction models, such as ConvLSTM and its improved variants PredRNN [26] and PredRNN++ [34], which represent the mainstream direction based on recurrent neural networks. The second category comprises CNN-based segmentation models, including the standard UNet and its meteorologically optimized variant SmaAt-UNet. The third category encompasses recently proposed high-performance architectures such as MIM [35], Rainformer [31], and GA-SmaAt-GNet [36].

To ensure a rigorous evaluation, for the proposed ST-DualNet and classical baselines, we conducted experiments under unified settings. For the CNN-based baseline models, the 12 input radar frames are stacked along the channel dimension, resulting in an input tensor of shape (B, 12, H, W). This configuration allows the 2D convolutional kernels to capture temporal correlations by treating time steps as feature channels. For MIM and Rainformer, we directly cite the results reported in their original papers to avoid reproduction errors, as they utilise the identical KNMI NL-50 dataset and evaluation protocol. For GA-SmaAt-GNet, we cited the results directly from their original papers to respect their reported performance, with specific experimental differences noted in the corresponding tables. While their evaluation spanned a broader period than our test set, both studies are based on the same KNMI radar infrastructure and data processing protocols. We chose to cite the original results to represent the baseline at its verified optimal performance, thereby avoiding the risks associated with sub-optimal self-re-implementation. While this temporal misalignment is a limitation, the consistent data source ensures that the relative strengths of the models, specifically ST-DualNet’s superior skill in HSS, remain a robust observation.

Table 1 presents the quantitative comparison results between ST-DualNet and current mainstream spatiotemporal sequence prediction models under the 0.5 mm/h threshold. Overall, ST-DualNet demonstrates outstanding performance across all key metrics. Compared to the classic ConvLSTM and PredRNN series models, ST-DualNet achieves significant improvements in both CSI and HSS, two critical meteorological metrics. For instance, relative to PredRNN++, the proposed model increases CSI from 0.690 to 0.748 (an improvement of about 7.75%) and raises HSS from 0.351 to 0.384 (an improvement of about 8.59%). This indicates that ST-DualNet can more effectively capture the non-rigid deformation and complex motion trajectories of precipitation radar echoes, thereby significantly outperforming the modeling capabilities of traditional RNN-based units.

ST-DualNet also exhibits strong performance when compared to recently proposed high-performance models. Against Rainformer, ST-DualNet shows advantages of 0.081 in CSI and 0.045 in HSS. This verifies that the carefully designed convolutional and recurrent architecture retains distinctive strengths in processing local texture details and short-term dynamic variations. It is worth noting that although GA-SmaAt-GNet shows competitive performance in CSI, ST-DualNet achieves the highest HSS value of 0.384 among all compared models, ranking first in this metric that better reflects the genuine predictive skill of a model. Given that the HSS metric effectively filters out random prediction noise while comprehensively balancing hits and false alarms, this result suggests that ST-DualNet attains a currently superior level in terms of prediction reliability and robustness.

In short-term precipitation forecasting, a model is required not only to accurately distinguish between rainy and non-rainy areas but also to capture precipitation events with high intensity. We set the precipitation intensity thresholds r at 0.5 mm/h, 2 mm/h, 5 mm/h and 10 mm/h, corresponding to light rain, moderate rain, heavy rain and torrential rain levels respectively. For each threshold, we calculated the CSI, HSS and MSE metrics, comparing ST-DualNet against UNet, SmaAt-UNet, ConvLSTM, PredRNN, and PredRNN++. The results are shown in Table 2, Table 3, Table 4 and Table 5.

From the perspective of the MSE metric, ST-DualNet consistently achieves the lowest error values across all tests, showing a significant reduction compared to both the UNet series based on convolutional architectures and the LSTM series based on recurrent architectures. This indicates that the dual-branch architecture effectively mitigates the blurring effect in predicted images through independent spatiotemporal modeling and efficient feature fusion, demonstrating a clear advantage in the accuracy of reconstruction at the pixel level.

In Table 2, Table 3, Table 4 and Table 5, the Mean Squared Error (MSE) is reported as a global continuous metric representing the overall pixel-wise accuracy of the model. Consequently, for a specific model, the MSE remains consistent across different evaluation tables. Conversely, the categorical metrics (CSI and HSS) are calculated based on specific intensity thresholds (r), thus varying to reflect the model’s skill in capturing different rainfall levels.

Figure 7 illustrates the variations in CSI and HSS scores. It can be observed that while the performance of all models degrades as the threshold increases, the ST-DualNet we proposed consistently outperforms other state-of-the-art methods across all thresholds, demonstrating superior robustness, especially for heavier rainfall events.

In the fundamental precipitation forecasting task (r = 0.5 mm/h), ST-DualNet also demonstrates outstanding performance, achieving a CSI of 0.748 and an HSS of 0.384. This confirms the model’s exceptional accuracy in the basic classification task of distinguishing between rainy and non-rainy conditions.

As the precipitation intensity threshold increases, all models exhibit a natural decline in predictive performance. However, ST-DualNet demonstrates exceptional robustness during moderate to heavy rainfall events. At the moderate rainfall level, while SmaAt-UNet shows strong competitiveness, ST-DualNet maintains its lead with a CSI value of 0.578. More notably, traditional RNN models exhibited particularly severe performance degradation under the most challenging heavy and torrential rain conditions. For instance, under extreme conditions of 10.0 mm/h, PredRNN achieved a CSI of merely 0.110, while ConvLSTM dropped to 0.129, struggling to effectively capture the core of high intensity precipitation. By contrast, ST-DualNet maintained a CSI of 0.221 and HSS of 0.180 at this threshold, significantly outperforming all benchmarks. This outcome robustly demonstrates the pivotal role of the ST-DConvLSTM unit within the model’s temporal branch. Its internal deformable convolution mechanism adaptively captures the non-rigid deformation and rapid movement trajectories of intense precipitation radar echoes, thereby effectively reducing false negatives in extreme weather scenarios. This validates the method’s advanced capability and reliability in handling complex spatiotemporal dynamic variations.

In addition to the quantitative evaluation, we conducted a visualization analysis to intuitively demonstrate the models’ capability in capturing spatiotemporal evolution patterns. Figure 8 displays the continuous six-frame forecast results for a typical precipitation case under the threshold condition of r = 0.5 mm/h.

As can be clearly observed in the figure, with increasing forecast time steps, the images generated by the comparison models gradually exhibit a pronounced smoothing effect, leading to the loss of textural detail in high-intensity echo regions. In contrast, ST-DualNet effectively mitigates this blurring issue, not only accurately predicting the movement trajectories of precipitation radar echoes but also preserving the shape characteristics and intensity distribution of the radar echoes exceptionally well, yielding results closest to the true images.

3.5. Ablation Experiments and Analysis

To thoroughly investigate the effectiveness of each core component within ST-DualNet and their contribution to the model’s overall performance, this subsection conducts systematic ablation experiments on the KNMI NL-50 dataset. Using the full model (Ours) as the baseline, we constructed five distinct variant models for comparative validation: w/o DeformConv replaces deformable convolutions in the temporal branch with standard convolutions to verify the necessity of dynamic deformation modeling; w/o ST-Memory removes the spatiotemporal memory unit M from ST-DConvLSTM, retaining only the traditional H and C states to validate the role of long-term sequential memory; w/o Spatial Branch entirely removes the spatial branch, retaining only the temporal branch for prediction to assess the benefit of the dual-branch architecture; w/o Transformer removes the Transformer module from the spatial branch, using only convolutions to extract local features to validate the importance of global dependency modeling; w/o CBAM removes the attention mechanism in the feature fusion stage, employing direct channel concatenation to validate the efficacy of adaptive feature fusion. Detailed quantitative results of the ablation experiments are presented in Table 6.

The data in Table 6 clearly demonstrates that removing the spatial branch inflicts the most severe damage to model performance. Compared with the complete model, the w/o Spatial Branch variant exhibits a significant drop in CSI and HSS by approximately 0.069 and 0.041, respectively, while the MSE error increases notably to 0.01072. This outcome provides compelling evidence that the global static visual features supplied by the spatial branch are crucial for compensating for the loss of detail in the temporal branch during long sequence predictions. Furthermore, the w/o CBAM variant also exhibits a notable performance decline, with CSI dropping to 0.702. This indicates that a simple linear superposition of spatiotemporal features is insufficient to fully leverage their complementary nature. Introducing an attention mechanism for adaptive filtering and recalibration of heterogeneous features is a critical step in enhancing prediction accuracy.

Regarding the internal design of the temporal branch, the comparison between the w/o DeformConv variant and the full model demonstrates the contribution of deformable convolutions. While the absolute improvement in CSI at the 0.5 mm/h threshold is approximately 0.004, we argue that this gain is meaningful for two reasons. First, in high-resolution precipitation forecasting where the CSI has already reached a high-performance plateau, any incremental improvement is challenging to achieve, and even a small margin can lead to more accurate precipitation forecasts in practical applications. Second, the necessity of DCN should be evaluated not only by the absolute gain in a single evaluation metric, but also by its contribution to the overall robustness and generalization of the model. Moreover, the additional parameters introduced by DCN are relatively limited. For a standard 3 × 3 kernel, the offset layer adds only 18 channels of convolution. Our internal profiling shows that the parameter count of ST-DConvLSTM increases by less than 6% compared to standard ConvLSTM, which we believe is a reasonable trade-off for the added structural flexibility. Meanwhile, the experimental results of w/o ST-Memory indicate that the absence of a dedicated spatiotemporal memory unit leads to information forgetting when handling long sequence dependencies, consequently impairing prediction accuracy.

Regarding the internal design of the spatial branch, a comparison between the w/o Transformer variant and the complete model shows that removing the Transformer module leads to a decline in all evaluation metrics to varying degrees. This indicates that relying solely on convolutional operations struggles to effectively capture long-range spatial dependencies within images. In contrast, the self-attention mechanism of the Transformer successfully enhances the spatial branch’s ability to perceive the macroscopic distribution patterns of precipitation systems, thereby further improving the final prediction performance. While removing the Transformer module results in a CSI reduction of approximately 0.008, this lightweight spatial branch is essential for capturing long-range spatial dependencies that standard convolutions struggle to perceive. Given that the Transformer component is highly optimized with only a few layers, its contribution to the overall 5.12 M parameters is minimal, making the performance gain a cost-effective improvement for modeling macroscopic precipitation patterns.

Thus, it can be concluded that the superior performance of ST-DualNet stems from the organic integration of its components. The dual-branch architecture establishes the foundation for feature complementarity, while CBAM achieves efficient feature fusion. Concurrently, ST-DConvLSTM and Transformer play irreplaceable roles in capturing local dynamics and modeling global static patterns respectively.

4. Discussion

Through comparative experiments and ablation analysis, we found that ST-DualNet outperforms mainstream models such as ConvLSTM and PredRNN on the KNMI NL-50 dataset.

The experimental results first confirm the effectiveness of the improved ST-DConvLSTM unit in the temporal branch. By incorporating deformable convolutions, the ST-DConvLSTM module within the temporal branch endows the network with the capability to adaptively adjust its receptive field. This enables the model to dynamically track the motion characteristics of radar echoes, aligning closely with the physical properties of atmospheric fluids. Consequently, metrics such as MSE, CSI, and HSS are significantly improved. Meanwhile, to address the common issues of gradient vanishing and information forgetting in long-sequence prediction, this unit introduces an independent spatiotemporal memory state M, distinct from the traditional ConvLSTM cell state C. State M is vertically propagated between layers and horizontally extended along the temporal axis, establishing a gradient memory flow. This enables deep networks to effectively preserve high-dimensional spatiotemporal features from the initial time step. Ablation experiments demonstrating performance degradation upon removing M conclusively validate the critical role of this independent memory mechanism in sustaining long-term prediction stability.

The experimental results also confirm the superiority of the spatiotemporal dual-branch strategy in meteorological forecasting tasks. Traditional methods based on ConvLSTM attempt to encode both spatial and temporal information within a single unit, which often leads to premature blurring of spatial details in deep networks. ST-DualNet effectively separates the modeling tasks of dynamic evolution and static structure by designing independent temporal and spatial branches. The Transformer introduced in the spatial branch explicitly models global spatial dependencies, compensating for the local nature of convolutional operations and ensuring the overall morphological plausibility of predicted images. Ablation experiments show that removing the spatial branch results in a significant performance drop, which further confirms the crucial role of global static features in maintaining the stability of long-sequence predictions.

The effective fusion of multimodal features constitutes another critical factor in enhancing prediction accuracy. Simple channel concatenation often fails to address the heterogeneity of features across different branches in terms of semantic hierarchy and numerical distribution. The CBAM attention mechanism introduced in our work acts as an adaptive gating mechanism. At the channel dimension, it filters key feature maps based on precipitation intensity, suppressing background noise. At the spatial dimension, it guides the model to focus on high-intensity echo core regions. Experimental data demonstrate that removing the CBAM module causes model performance to decline across all metrics. This indicates that establishing a feature recalibration mechanism effectively reinforces the complementary advantages of the dual-branch architecture, preserving fine-grained predictive details consistent with overall evolutionary trends.

Regarding the training stability of ST-DConvLSTM, the learned offsets exhibited smooth convergence throughout the training process. This stability is largely supported by the normalization layers within the ST-DualNet, which stabilize the feature distribution and prevent erratic offset predictions. Furthermore, while no explicit offset regularization was employed, the physical consistency of radar echo movement serves as an implicit constraint, guiding the DCN to learn stable and meteorologically meaningful deformation fields.

While the current experimental setting follows the standard protocol established in SmaAt-UNet [19] with a 30-min prediction horizon, we acknowledge that a longer prediction horizon would provide stronger evidence of the model’s ability to mitigate long-term information decay. Given that our ST-DualNet is specifically designed to address information decay through its deformable convolutions and independent spatiotemporal memory, evaluating it on longer prediction tasks is a critical direction for our future work. We anticipate that the advantages of our model will become even more pronounced as the prediction horizon increases.

Although ST-DualNet has achieved satisfactory progress in short-term precipitation forecasting tasks, it still exhibits certain limitations in capturing rapidly intensifying convective events. Regarding the spatial branch architecture, we currently employ temporal average pooling to generate global spatial context. Whilst this helps to smooth out noise and capture stable structural features, we believe that max pooling or attention-based pooling may prove more effective in forecasting extreme precipitation. Future work will explore dual-path pooling strategies to balance noise suppression with the preservation of extreme features. Furthermore, the current inputs are restricted to radar reflectivity, whereas precipitation processes are influenced by multiple physical quantities, including wind fields, air temperature, and humidity. We intend to introduce wind field data or satellite cloud imagery to construct a multi-modal input forecasting network, thereby further improving prediction accuracy for complex weather events.

5. Conclusions

To address the challenges of capturing non-rigid deformation in radar echoes and the loss of long-sequence information during short-term precipitation forecasting, we propose a spatiotemporal dual-branch neural network named ST-DualNet. Through experimentation, we have drawn the following principal conclusions.

Firstly, explicit spatiotemporal decoupling and dynamic perception strategies are pivotal for enhancing prediction accuracy. Our designed ST-DConvLSTM unit successfully achieves adaptive tracking of non-rigid deformations in precipitation radar echoes by integrating deformable convolutions, significantly reducing feature misalignment during dynamic evolution. By introducing an independent spatiotemporal memory state M, it constructs cross-level information transmission channels, resolving feature forgetting issues in long-sequence predictions. Concurrently, the Transformer module introduced in the spatial branch effectively establishes long-range spatial dependencies, compensating for the limitations of traditional RNN networks in extracting global static features.

Secondly, the adaptive fusion of heterogeneous features significantly enhances the model’s robustness. Experiments demonstrate that the simple superposition of temporal and spatial features is insufficient to fully leverage the dual-branch architecture’s advantages. By incorporating the CBAM attention mechanism, the model can automatically filter key information across both channel and spatial dimensions. This effectively resolves the semantic heterogeneity between dynamic temporal features and static spatial features, thereby improving the model’s predictive reliability under complex meteorological backgrounds.

Thirdly, quantitative evaluation has validated the superiority of this approach. Experimental results demonstrate that ST-DualNet outperforms mainstream models in both CSI and HSS metrics. In particular, the ablation experiments show that removing any of the core components leads to performance degradation, fully validating the rationality and necessity of the dual-branch architecture design.

Author Contributions

Conceptualization, Y.D. and B.Y.; methodology, Y.D.; software, Y.D.; validation, H.C. and Y.G.; formal analysis, B.Y.; investigation, T.B.; resources, B.Y.; data curation, Y.D.; writing—original draft preparation, Y.D.; writing—review and editing, B.Y. and H.C.; visualization, Y.D.; supervision, T.B. and Y.G.; project administration, B.Y.; funding acquisition, B.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Taishan Industrial Leading Talent Project Blue Talent Project and the Joint Funds of the National Natural Science Foundation of China (No. U22A2068).

Data Availability Statement

The data presented in this study are available at https://github.com/HansBambel/SmaAt-UNet (accessed on 6 July 2025).

Conflicts of Interest

Authors Bo Yin and Haipeng Cui are employed by Qingdao Jari Industry Control Technology Co., Ltd. Author Tao Bi is employed by Qingdao Port International Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Niu, D.; Diao, L.; Xu, L.; Zang, Z.; Chen, X.; Liang, S. Precipitation forecast based on multi-channel ConvLSTM and 3D-CNN. In Proceedings of the 2020 International Conference on Unmanned Aircraft Systems (ICUAS), Athens, Greece, 1–4 September 2020; pp. 367–371. [Google Scholar]
Zhao, C.; Ye, A.; Wu, L.; Zhan, S. A novel deep learning model for post-processing of short-and medium-term daily precipitation forecasts. Atmos. Res. 2025, 326, 108319. [Google Scholar] [CrossRef]
Jing, J.; Li, Q.; Peng, X.; Ma, Q.; Tang, S. HPRNN: A hierarchical sequence prediction model for long-term weather radar echo extrapolation. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 4142–4146. [Google Scholar]
Fang, Z.; Li, M.; Xia, W. PF-UNet: A Short-term Precipitation Correction Model Integrating Multiple Meteorological Elements. In Proceedings of the 2024 2nd International Conference on Computer, Vision and Intelligent Technology (ICCVIT), Huaibei, China, 24–27 November 2024; pp. 1–9. [Google Scholar]
Yadav, N.; Ganguly, A.R. A deep learning approach to short-term quantitative precipitation forecasting. In Proceedings of the 10th International Conference on Climate Informatics, Oxford, UK, 23–25 September 2020; pp. 8–14. [Google Scholar]
Liao, Y.; Lu, S.; Yin, G. Short-Term and Imminent Rainfall Prediction Model Based on ConvLSTM and SmaAT-UNet. Sensors 2024, 24, 3576. [Google Scholar] [CrossRef]
Wu, M.C.; Lin, G.F. The very short-term rainfall forecasting for a mountainous watershed by means of an ensemble numerical weather prediction system in Taiwan. J. Hydrol. 2017, 546, 60–70. [Google Scholar] [CrossRef]
Novák, P.; Březková, L.; Frolík, P. Quantitative precipitation forecast using radar echo extrapolation. Atmos. Res. 2009, 93, 328–334. [Google Scholar] [CrossRef]
Lindskog, M.; Landelius, T. Short-Range Numerical Weather Prediction of Extreme Precipitation Events Using Enhanced Surface Data Assimilation. Atmosphere 2019, 10, 587. [Google Scholar] [CrossRef]
Wang, Y.; Yao, L.; Jiang, H.; Liu, T.; Lu, Y.; Zhou, C. Precipitation Nowcasting Based on Radar Echo Images via Muti-Scale Spatiotemporal LSTM. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5105614. [Google Scholar]
Guo, S.; Sun, N.; Pei, Y.; Li, Q. 3d-unet-lstm: A deep learning-based radar echo extrapolation model for convective nowcasting. Remote Sens. 2023, 15, 1529. [Google Scholar] [CrossRef]
Bowler, N.E.H.; Pierce, C.E.; Seed, A. Development of a precipitation nowcasting algorithm based upon optical flow techniques. J. Hydrol. 2004, 288, 74–91. [Google Scholar] [CrossRef]
Jing, J.; Li, Q.; Peng, X. MLC-LSTM: Exploiting the spatiotemporal correlation between multi-level weather radar echoes for echo sequence extrapolation. Sensors 2019, 19, 3988. [Google Scholar] [CrossRef] [PubMed]
Liu, Q.; Xiao, Y.; Gui, Y.; Dai, G.; Li, H.; Zhou, X.; Ren, A.; Zhou, G.; Shen, J. MMF-RNN: A Multimodal Fusion Model for Precipitation Nowcasting Using Radar and Ground Station Data. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4101416. [Google Scholar] [CrossRef]
An, S.; Oh, T.-J.; Sohn, E.; Kim, D. Deep learning for precipitation nowcasting: A survey from the perspective of time series forecasting. Expert Syst. Appl. 2025, 268, 126301. [Google Scholar] [CrossRef]
Han, L.; Sun, J.; Zhang, W. Convolutional neural network for convective storm nowcasting using 3-D Doppler weather radar data. IEEE Trans. Geosci. Remote Sens. 2019, 58, 1487–1495. [Google Scholar] [CrossRef]
Ma, Z.; Zhang, H.; Liu, J. MM-RNN: A multimodal RNN for precipitation nowcasting. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4101914. [Google Scholar] [CrossRef]
Zhao, X.; Wang, H.; Bai, M.; Xu, Y.; Dong, S.; Rao, H.; Ming, W. A comprehensive review of methods for hydrological forecasting based on deep learning. Water 2024, 16, 1407. [Google Scholar] [CrossRef]
Trebing, K.; Staǹczyk, T.; Mehrkanoon, S. SmaAt-UNet: Precipitation nowcasting using a small attention-UNet architecture. Pattern Recognit. Lett. 2021, 145, 178–186. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Fernández, J.G.; Mehrkanoon, S. Broad-UNet: Multi-scale feature learning for nowcasting tasks. Neural Netw. 2021, 144, 419–427. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Cao, H.; Wu, Y.; Bao, Y.; Feng, X.; Wan, S.; Qian, C. UTrans-Net: A model for short-term precipitation prediction. Artif. Intell. Appl. 2023, 1, 90–97. [Google Scholar] [CrossRef]
Yang, Y.; Mehrkanoon, S. Aa-transunet: Attention augmented transunet for nowcasting tasks. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–8. [Google Scholar]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Wang, Y.; Long, M.; Wang, J.; Gao, Z.; Yu, P.S. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Shi, X.; Gao, Z.; Lausen, L.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Deep learning for precipitation nowcasting: A benchmark and a new model. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Bai, C.; Sun, F.; Zhang, J.; Song, Y.; Chen, S. Rainformer: Features extraction balanced network for radar-based precipitation nowcasting. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4023305. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Li, J.; Xiao, J.; Liu, H.; Du, X.; Liu, S. Spatiotemporal Ionospheric TEC Prediction with Deformable Convolution for Long-Term Spatial Dependencies. Atmosphere 2025, 16, 950. [Google Scholar] [CrossRef]
Wang, Y.; Gao, Z.; Long, M.; Wang, J.; Yu, P.S. Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 5123–5132. [Google Scholar]
Wang, Y.; Zhang, J.; Zhu, H.; Long, M.; Wang, J.; Yu, P.S. Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9154–9162. [Google Scholar]
Reulen, E.; Shi, J.; Mehrkanoon, S. GA-SmaAt-GNet: Generative adversarial small attention GNet for extreme precipitation nowcasting. Knowl.-Based Syst. 2024, 305, 112612. [Google Scholar] [CrossRef]

Figure 1. Structure diagram of ConvLSTM.

Figure 2. The overall architecture of the ST-DualNet. T: temporal feature tensor; S’: replicated spatial feature tensor;

F_{merge}

: fused feature tensor; B: batch size; C: channel; T: time steps; H, W: height and width.

Figure 2. The overall architecture of the ST-DualNet. T: temporal feature tensor; S’: replicated spatial feature tensor;

F_{merge}

: fused feature tensor; B: batch size; C: channel; T: time steps; H, W: height and width.

Figure 3. Structure diagram of ST-DConvLSTM.

Figure 4. Schematic of 3 × 3 deformable convolutions.

Figure 5. Differences in sampling locations for feature extraction in precipitation imagery using 3 × 3 ordinary convolution (a) and 3 × 3 deformable convolution (b).

Figure 6. Schematic diagram of the ST-DConvLSTM independent memory flow architecture. Note: A 3-layer stack is shown for general conceptual illustration; the actual implementation uses 2 layers for the encoder and decoder respectively.

Figure 7. Performance comparison of different models under varying rainfall intensity thresholds: (a) The trend of Critical Success Index (CSI). (b) The trend of Heidke Skill Score (HSS).

Figure 8. Visual comparison of frame-by-frame predictions of each model in typical precipitation cases. The first row is the real radar echo image, and the remaining rows are the prediction results of UNet, SmaAt-UNet, ConvLSTM, PredRNN, PredrNN++ and ST-DualNet in sequence. From left to right, they respectively represent the predicted images of the first frame to the sixth frame in the future. Additional examples are available from the corresponding author upon request.

Table 1. Comparison results with other models (r = 0.5 mm/h).

Model	CSI ↑	HSS ↑
UNet [19]	0.658	0.329
SmaAt-UNet [19]	0.647	0.322
ConvLSTM	0.677	0.343
PredRNN	0.683	0.345
PredRNN++	0.690	0.351
MIM [31]	0.666	0.337
Rainformer [31]	0.667	0.339
GA-SmaAt-GNet * [36]	0.793	0.374
ST-DualNet	0.748	0.384

* The results are directly cited from the original paper. Note that their test set (Year 2017–2022) differs from the test set (Year 2019) used in this work. ↑ indicates that higher values are better. Bold indicates the best performance in each column.

Table 2. CSI, HSS and MSE for each model at threshold r = 0.5 mm/h.

Model	CSI ↑	HSS ↑	MSE ↓
UNet [19]	0.658	0.329	0.0122
SmaAt-UNet [19]	0.647	0.322	0.0122
ConvLSTM	0.677	0.343	0.0098
PredRNN	0.683	0.345	0.0102
PredRNN++	0.690	0.351	0.0095
ST-DualNet	0.748	0.384	0.0079