1. Introduction
Short-term precipitation forecasting, also known as precipitation nowcasting, stands as a significant research topic within meteorology [
1]. Since ancient times, precipitation has played a substantial role in daily life, with our ancestors observing natural phenomena to predict the arrival of rain and prepare accordingly. In the modern era, precipitation forecasting retains considerable research value for both societal life and production activities. When categorized by temporal scale, precipitation can be divided into short-term, medium-term [
2], and long-term precipitation [
3]. Among these, short-term precipitation exerts the most severe impact on human society and production, while also presenting the greatest forecasting challenges [
4]. Short-term precipitation forecasting aims to predict the intensity and distribution of rainfall over the next 0–2 h using radar image sequences [
5]. Its outcomes hold immense practical value in areas such as heavy rainfall warnings, urban flood prevention, and meteorological support for shipping operations [
6].
Existing short-term precipitation forecasting methods can be broadly categorized into numerical weather prediction (NWP) [
7] methods and radar echo reflectance extrapolation [
8] methods. NWP methods primarily rely on mathematical models, requiring complex atmospheric mathematical equations and massive amounts of observational data. Although they are generally accurate in weather forecasting, they have limitations in short-term precipitation forecasting tasks due to their high computational cost, poor real-time performance, and difficulty in effectively capturing complex atmospheric processes in a short time and at small scale [
9]. In contrast, deep learning methods based on radar echo reflectance extrapolation can predict future echo distributions using only historical radar image sequences [
10,
11]. They then utilize the Z-R relationship [
12] (where Z is radar reflectance and R is precipitation intensity) to convert reflectance into precipitation, enabling rapid forecasting. This method excels in timeliness and scalability, gradually becoming the more efficient and accurate mainstream choice in current short-term precipitation forecasting tasks [
13].
In recent years, deep learning techniques have been extensively applied to various meteorological forecasting tasks and have demonstrated remarkable success [
14,
15]. Especially in the field of short-term precipitation forecasting, existing methods based on deep learning can be roughly divided into two categories according to their core architecture: one is mainly based on convolutional neural networks (CNNs) [
16] for spatial feature extraction, and the other is based on recurrent neural networks (RNNs) for temporal dynamic modeling [
17,
18].
CNN-based models have attracted considerable attention in short-term precipitation forecasting due to their robust spatial feature extraction capabilities. A series of variants centered on the classic U-Net encoder–decoder architecture have achieved notable progress in this domain. SmaAt-UNet [
19], proposed by Kevin Trebing et al., stands as a foundational and highly efficient representative work in this direction. This model integrates a CBAM within the U-Net architecture and employs depthwise-separable convolutions (DSC) [
20] to replace standard convolutions. This approach reduces the number of parameters to approximately one-quarter of the original U-Net while maintaining comparable or even superior forecasting performance. The advantage of this model lies in its enhanced focus on key meteorological features through the CBAM mechanism and the substantial improvement of model efficiency and deployment feasibility achieved by means of DSC.
To more effectively extract multi-scale spatial features, Jesús García Fernández et al. proposed Broad-UNet [
21]. This model introduces asymmetric parallel convolutions and Atrous Spatial Pyramid Pooling (ASPP) [
22] modules within the U-Net encoder. Asymmetric parallel convolutions simultaneously extract multi-scale features using convolutional kernels of varying sizes, while the ASPP module expands the receptive field through convolutions with different dilation rates to fuse more global contextual information. Experiments demonstrate that Broad-UNet achieves superior accuracy to baseline models in both precipitation and cloud cover forecasting tasks.
In recent years, the Transformer architecture has been introduced into visual tasks to address the limitations of CNNs due to its robust global modeling capabilities. The representative models include UTrans-Net [
23] proposed by Hao Cao et al. and AA-TransUNet [
24] proposed by Yimin Yang et al. UTrans-Net attempts to integrate Transformer modules into U-Net and uses a self-attention mechanism to assign weights to different meteorological elements in order to improve the effectiveness of feature extraction.
By contrast, AA-TransUNet [
24] proposes a more systematic and higher-performing fusion architecture. Centred on TransUNet (a hybrid CNN-Transformer encoder and U-Net decoder), it further integrates the CBAM and depthwise-separable convolutions. This model creatively incorporates CBAM into the CNN part of the encoder and each layer of the decoder, achieving dual attention enhancement in both channel and spatial dimensions, while using DSC in the decoder to control the number of parameters. Comprehensive experiments demonstrate that AA-TransUNet outperforms prior models such as SmaAt-UNet [
19] and Broad-UNet [
21] across multiple evaluation metrics, proving the superiority of hybrid architectures combining global attention with local convolutions for short-term forecasting tasks.
Recurrent neural networks (RNNs) and their variants possess inherent advantages in sequence prediction tasks due to their robust temporal modeling capabilities. In the field of short-term precipitation forecasting, researchers have focused on integrating RNNs with spatial feature extraction capabilities to construct predictive models capable of jointly modeling spatiotemporal dependencies.
A pioneering work in this area is the ConvLSTM [
25] proposed by Shi et al. This model first replaced the matrix multiplications in fully connected LSTM (FC-LSTM) with convolution operations, introducing the convolutional LSTM unit. This innovation allows the state transitions within the model to retain spatial structure, thereby enabling direct processing of spatiotemporal sequence data. By stacking layers and constructing an encoding–forecasting architecture, ConvLSTM achieves end-to-end training. It significantly outperforms traditional optical flow methods (such as the ROVER algorithm) and FC-LSTM in precipitation forecasting tasks, uniformly and intrinsically modeling spatiotemporal correlations.
To enhance the modeling capability for complex spatiotemporal dynamics, subsequent studies have made profound improvements to the structure of ConvLSTM. PredRNN [
26] proposes a recurrent neural network architecture for spatiotemporal predictive learning, whose core is a novel Spatiotemporal Long Short-Term Memory unit (ST-LSTM). This method introduces a unified memory pool, allowing memory states to be vertically transferred between stacked RNN layers and horizontally flowed between temporal states, thus simultaneously modeling spatial appearance and temporal dynamics. PredRNN achieved state-of-the-art performance on multiple video prediction datasets at the time. Its advantages lie in its ability to effectively capture long-term motion trajectories and detailed spatial deformations, generate clearer and more accurate prediction frames, and offer a flexible framework that is easily extendable to other forecasting tasks.
Based on ConvLSTM, to overcome the limitation posed by the positional invariance of its convolutional structure in modeling complex motions such as rotation and scaling, Shi et al. further proposed the Trajectory GRU (TrajGRU) [
27] model. The core improvement of this model lies in transforming the recurrent connection structure from static convolution to dynamic learning. TrajGRU retains the encoding–forecasting framework of ConvLSTM while significantly enhancing the model’s ability to model location-variant motions like rotation and scaling. Furthermore, this work concurrently established the first large-scale precipitation forecasting benchmark, encompassing data, balanced loss functions, and online/offline evaluation protocols. It has laid a crucial foundation for subsequent research.
The CNN model based on U-Net and its variants that introduce the attention mechanism and Transformer have continuously promoted the development of short-term precipitation forecasting technology by constantly optimizing the efficiency of feature extraction. Nevertheless, the key challenge remains how to more effectively model irregular spatial patterns and non-stationary long-term temporal dependencies in tandem. RNN-based methods have progressively enhanced the capacity to represent complex spatiotemporal dynamic evolution by designing more sophisticated memory mechanisms (e.g., PredRNN) and more flexible connection schemes (e.g., TrajGRU). Although these methods perform well in modeling temporal dependencies, they usually rely on sequential recursive computations, resulting in low training parallelism and long training time. Concurrently, the efficient integration of multi-scale spatial features with refined temporal modeling remains a challenge for this class of methods. The shortcomings of both CNN-based and RNN-based models in short-term precipitation forecasting form the impetus for this work.
To address the challenges of current short-term precipitation forecasting in handling irregular spatial morphology and complex spatiotemporal evolution, we propose a novel spatiotemporal dual-branch neural network—ST-DualNet. This model achieves explicit decoupled learning and deep fusion of the dynamic evolution process and spatial hierarchical features of the precipitation field through parallel temporal and spatial branches.
The temporal branch of the ST-DualNet model is constructed around our newly designed Spatiotemporal Deformable Convolutional Long Short-Term Memory (ST-DConvLSTM) module. This module is one of the core innovations of ST-DualNet, making two key improvements to the traditional ConvLSTM: for one thing, it enhances the ability to remember long-term weather models by introducing an independent memory state M; for another, it replaces standard convolutions with deformable convolutions. This key substitution endows the network with an adaptive receptive field, enabling it to dynamically focus computational resources on regions most valuable for prediction.This allows the model to accurately capture the key dynamics of the non-rigid and irregular morphology of precipitation systems with less computational loss, thus achieving a significant performance leap without excessively increasing computational costs. Meanwhile, the spatial branch of the model integrates dilated convolution [
28] and Transformer [
29] modules in parallel, collaboratively capturing multi-scale local textures and global spatial dependencies in precipitation radar echo maps. Finally, we feed the heterogeneous features from the two branches into the CBAM [
30] attention module to achieve adaptive weighting of features and fusion of multidimensional information, thereby generating prediction results with high spatiotemporal consistency.
The main contributions of this paper are summarised as follows:
We propose a novel dual-branch network architecture named ST-DualNet, which provides an efficient and logically clear modeling framework for complex spatiotemporal sequence forecasting tasks by explicitly decoupling the learning process of temporal dynamics and spatial features.
The ST-DConvLSTM module is designed as the core component of the temporal branching architecture. By integrating deformable convolution and independent memory states M, it synergistically enhances the capability of the model in modeling irregular spatial patterns and long-term temporal dependencies.
A hybrid spatial feature extraction branch is constructed, which integrates dilated convolutions and Transformers. This branch effectively captures multi-scale and global spatial information of precipitation fields. Additionally, the CBAM module is adopted to achieve intelligent fusion of the dual-branch features.
Comprehensive comparative experiments were conducted on the publicly available KNMI radar precipitation dataset [
19]. The results show that our proposed ST-DualNet model significantly outperforms existing baseline methods on several key evaluation metrics.
The structure of this paper is as follows. In
Section 2, we propose a spatiotemporal dual-branch neural network model for short-term precipitation forecasting. In
Section 3, we present comparative and ablation experiments of the ST-DualNet model.
Section 4 and
Section 5 are the discussion and conclusion sections of this paper.
3. Results
3.1. Dataset
The ST-DualNet network employs radar echo datasets released by the Royal Netherlands Meteorological Institute (Koninklijk Nederlands Meteorologisch Instituut, KNMI) for model training and evaluation. This dataset is collected by two C-band Doppler weather radars located at De Bilt and Den Helder, covering the entire territory of the Netherlands and neighbouring regions [
19]. The raw data spans the period from 2016 to 2019, featuring a temporal resolution of 5 min and a spatial resolution of 1 km. Original radar images measure
pixels, with each pixel value representing the cumulative rainfall over the preceding 5 min. This data not only contains precipitation intensity information but also documents the spatial distribution patterns of precipitation systems.
To enhance data quality and adapt to model inputs, we implemented a rigorous preprocessing workflow. First, considering the presence of invalid data regions beyond the detection range at the edges of raw radar images, and to reduce computational redundancy, we cropped the image centre to a size of 288 × 288 pixels. Second, to address the problem of the inherent class imbalance in precipitation data—where non-rainfall samples vastly outnumber rainfall samples—direct training tends to bias models towards predicting zero values. Therefore, this study followed the strategy of Trebing et al., constructing the NL-50 dataset [
19]. This dataset employs a filtering mechanism to retain only samples where at least 50% of pixels within the target image exhibit non-zero precipitation values (intensity > 0 mm/min). This ensures the model focuses on capturing complex spatiotemporal evolution patterns under high precipitation probability. During sample construction, a sliding window method generates continuous sequence samples. Each sample comprises 18 frames of radar echo images, where the preceding 12 frames (corresponding to the past hour) serve as the input sequence
, and the subsequent 6 frames (representing the next 30 min) constitute the predicted ground truth
[
19]. Finally, the training and validation sets of the processed precipitation dataset contain 5734 samples, whilst the test set contains 1557 samples.
To ensure the reliability and temporal independence of experimental results, the dataset was strictly partitioned by year to prevent leakage of future information. Data from 2016 to 2018 were selected for model training and validation, while the full year of 2019 was used for model testing [
19]. The ratio of training, validation, and testing sets was approximately 7:1:2. Finally, to accelerate model convergence and eliminate the influence of numerical dimensions, all input data underwent maximum value normalisation. Pixel values were divided by the maximum rainfall intensity observed within the training set, mapping the data to the range [0, 1].
3.2. Model Evaluation
To comprehensively evaluate the performance of the ST-DualNet model in short-term precipitation forecasting tasks, we adopt a standard quantitative evaluation system commonly used in meteorology, which covers both continuous and categorical metrics. The mean squared error (MSE) serves as a continuous metric to measure the overall deviation between predicted precipitation maps and actual radar echo maps at the pixel level. A lower MSE value indicates that the precipitation intensity distribution generated by the model is closer to the real observation, reflecting better fitting performance. Its calculation formula is given as follows:
where
denotes the true precipitation intensity,
represents the precipitation intensity predicted by the model, and
n is the total number of pixels in the image.
Since precipitation forecasting focuses not only on numerical accuracy but also on the spatial location capability of precipitation events, we also introduce a set of categorical evaluation metrics. Based on the intensity threshold r of meteorological radar echoes, continuous forecast maps and ground truth maps are converted into binary masks. Then the confusion matrix is statistically analysed to determine True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN). Based on this confusion matrix, the following key metrics are calculated: Precision measures the proportion of correctly predicted precipitation areas among all areas predicted as precipitation; Recall reflects the completeness of the model in detecting actual precipitation regions; the F1 score represents the harmonic mean of Precision and Recall, providing a comprehensive assessment of detection performance. Most crucially, the Critical Success Index (CSI) indicates the model’s capability to accurately identify precipitation events. Additionally, Accuracy reflects the proportion of correctly predicted pixels out of the total pixels, while the Heidke Skill Score (HSS) [
27] measures the overall improvement of the model relative to random forecasting. Generally, higher Precision, Recall, F1, CSI, HSS and Accuracy values indicate superior precipitation detection performance. The calculation formulas are as follows:
3.3. Experimental Setup
All experiments in this study were conducted on a high-performance computing server running a Linux system. The server is equipped with three NVIDIA GeForce RTX 4090 GPUs (24 GB memory each) with driver version 535.113.01. The experiments were implemented in Python 3.8 using the PyTorch 2.6.0 deep learning framework. To further clarify the model’s structural scale and computational complexity for practical deployment, we analyzed its total parameters and FLOPs. The ST-DualNet consists of approximately 5.12 M parameters, and its total computational cost for a single forward pass (predicting 6 frames from 12 input frames) is 14.04 GFLOPs.
The initial learning rate during the model training phase is set to 0.0005, with a batch size of 4 and a patch size of 4. The number of hidden units in the recurrent cells is configured as ‘8, 8, 8, 8’. The learning rate automatically decays to 0.1 times its original value when the validation set loss fails to decrease for 15 consecutive epochs. The loss function employs Mean Squared Error (MSE) to measure the error between predicted precipitation intensity and actual values. The training process typically converges within approximately 150 epochs, ensuring the model reaches a stable and optimal state.
3.4. Experimental Results and Comparison with Mainstream Models
In this subsection, to comprehensively validate the effectiveness and advancement of the proposed ST-DualNet model in the short-term precipitation forecasting tasks, we conduct extensive comparative evaluations against representative mainstream methods and advanced models within the current domain. The selected benchmark models are mainly divided into three categories. The first category comprises classical spatiotemporal sequence prediction models, such as ConvLSTM and its improved variants PredRNN [
26] and PredRNN++ [
34], which represent the mainstream direction based on recurrent neural networks. The second category comprises CNN-based segmentation models, including the standard UNet and its meteorologically optimized variant SmaAt-UNet. The third category encompasses recently proposed high-performance architectures such as MIM [
35], Rainformer [
31], and GA-SmaAt-GNet [
36].
To ensure a rigorous evaluation, for the proposed ST-DualNet and classical baselines, we conducted experiments under unified settings. For the CNN-based baseline models, the 12 input radar frames are stacked along the channel dimension, resulting in an input tensor of shape (B, 12, H, W). This configuration allows the 2D convolutional kernels to capture temporal correlations by treating time steps as feature channels. For MIM and Rainformer, we directly cite the results reported in their original papers to avoid reproduction errors, as they utilise the identical KNMI NL-50 dataset and evaluation protocol. For GA-SmaAt-GNet, we cited the results directly from their original papers to respect their reported performance, with specific experimental differences noted in the corresponding tables. While their evaluation spanned a broader period than our test set, both studies are based on the same KNMI radar infrastructure and data processing protocols. We chose to cite the original results to represent the baseline at its verified optimal performance, thereby avoiding the risks associated with sub-optimal self-re-implementation. While this temporal misalignment is a limitation, the consistent data source ensures that the relative strengths of the models, specifically ST-DualNet’s superior skill in HSS, remain a robust observation.
Table 1 presents the quantitative comparison results between ST-DualNet and current mainstream spatiotemporal sequence prediction models under the 0.5 mm/h threshold. Overall, ST-DualNet demonstrates outstanding performance across all key metrics. Compared to the classic ConvLSTM and PredRNN series models, ST-DualNet achieves significant improvements in both CSI and HSS, two critical meteorological metrics. For instance, relative to PredRNN++, the proposed model increases CSI from 0.690 to 0.748 (an improvement of about 7.75%) and raises HSS from 0.351 to 0.384 (an improvement of about 8.59%). This indicates that ST-DualNet can more effectively capture the non-rigid deformation and complex motion trajectories of precipitation radar echoes, thereby significantly outperforming the modeling capabilities of traditional RNN-based units.
ST-DualNet also exhibits strong performance when compared to recently proposed high-performance models. Against Rainformer, ST-DualNet shows advantages of 0.081 in CSI and 0.045 in HSS. This verifies that the carefully designed convolutional and recurrent architecture retains distinctive strengths in processing local texture details and short-term dynamic variations. It is worth noting that although GA-SmaAt-GNet shows competitive performance in CSI, ST-DualNet achieves the highest HSS value of 0.384 among all compared models, ranking first in this metric that better reflects the genuine predictive skill of a model. Given that the HSS metric effectively filters out random prediction noise while comprehensively balancing hits and false alarms, this result suggests that ST-DualNet attains a currently superior level in terms of prediction reliability and robustness.
In short-term precipitation forecasting, a model is required not only to accurately distinguish between rainy and non-rainy areas but also to capture precipitation events with high intensity. We set the precipitation intensity thresholds r at 0.5 mm/h, 2 mm/h, 5 mm/h and 10 mm/h, corresponding to light rain, moderate rain, heavy rain and torrential rain levels respectively. For each threshold, we calculated the CSI, HSS and MSE metrics, comparing ST-DualNet against UNet, SmaAt-UNet, ConvLSTM, PredRNN, and PredRNN++. The results are shown in
Table 2,
Table 3,
Table 4 and
Table 5.
From the perspective of the MSE metric, ST-DualNet consistently achieves the lowest error values across all tests, showing a significant reduction compared to both the UNet series based on convolutional architectures and the LSTM series based on recurrent architectures. This indicates that the dual-branch architecture effectively mitigates the blurring effect in predicted images through independent spatiotemporal modeling and efficient feature fusion, demonstrating a clear advantage in the accuracy of reconstruction at the pixel level.
In
Table 2,
Table 3,
Table 4 and
Table 5, the Mean Squared Error (MSE) is reported as a global continuous metric representing the overall pixel-wise accuracy of the model. Consequently, for a specific model, the MSE remains consistent across different evaluation tables. Conversely, the categorical metrics (CSI and HSS) are calculated based on specific intensity thresholds (r), thus varying to reflect the model’s skill in capturing different rainfall levels.
Figure 7 illustrates the variations in CSI and HSS scores. It can be observed that while the performance of all models degrades as the threshold increases, the ST-DualNet we proposed consistently outperforms other state-of-the-art methods across all thresholds, demonstrating superior robustness, especially for heavier rainfall events.
In the fundamental precipitation forecasting task (r = 0.5 mm/h), ST-DualNet also demonstrates outstanding performance, achieving a CSI of 0.748 and an HSS of 0.384. This confirms the model’s exceptional accuracy in the basic classification task of distinguishing between rainy and non-rainy conditions.
As the precipitation intensity threshold increases, all models exhibit a natural decline in predictive performance. However, ST-DualNet demonstrates exceptional robustness during moderate to heavy rainfall events. At the moderate rainfall level, while SmaAt-UNet shows strong competitiveness, ST-DualNet maintains its lead with a CSI value of 0.578. More notably, traditional RNN models exhibited particularly severe performance degradation under the most challenging heavy and torrential rain conditions. For instance, under extreme conditions of 10.0 mm/h, PredRNN achieved a CSI of merely 0.110, while ConvLSTM dropped to 0.129, struggling to effectively capture the core of high intensity precipitation. By contrast, ST-DualNet maintained a CSI of 0.221 and HSS of 0.180 at this threshold, significantly outperforming all benchmarks. This outcome robustly demonstrates the pivotal role of the ST-DConvLSTM unit within the model’s temporal branch. Its internal deformable convolution mechanism adaptively captures the non-rigid deformation and rapid movement trajectories of intense precipitation radar echoes, thereby effectively reducing false negatives in extreme weather scenarios. This validates the method’s advanced capability and reliability in handling complex spatiotemporal dynamic variations.
In addition to the quantitative evaluation, we conducted a visualization analysis to intuitively demonstrate the models’ capability in capturing spatiotemporal evolution patterns.
Figure 8 displays the continuous six-frame forecast results for a typical precipitation case under the threshold condition of r = 0.5 mm/h.
As can be clearly observed in the figure, with increasing forecast time steps, the images generated by the comparison models gradually exhibit a pronounced smoothing effect, leading to the loss of textural detail in high-intensity echo regions. In contrast, ST-DualNet effectively mitigates this blurring issue, not only accurately predicting the movement trajectories of precipitation radar echoes but also preserving the shape characteristics and intensity distribution of the radar echoes exceptionally well, yielding results closest to the true images.
3.5. Ablation Experiments and Analysis
To thoroughly investigate the effectiveness of each core component within ST-DualNet and their contribution to the model’s overall performance, this subsection conducts systematic ablation experiments on the KNMI NL-50 dataset. Using the full model (Ours) as the baseline, we constructed five distinct variant models for comparative validation: w/o DeformConv replaces deformable convolutions in the temporal branch with standard convolutions to verify the necessity of dynamic deformation modeling; w/o ST-Memory removes the spatiotemporal memory unit M from ST-DConvLSTM, retaining only the traditional H and C states to validate the role of long-term sequential memory; w/o Spatial Branch entirely removes the spatial branch, retaining only the temporal branch for prediction to assess the benefit of the dual-branch architecture; w/o Transformer removes the Transformer module from the spatial branch, using only convolutions to extract local features to validate the importance of global dependency modeling; w/o CBAM removes the attention mechanism in the feature fusion stage, employing direct channel concatenation to validate the efficacy of adaptive feature fusion. Detailed quantitative results of the ablation experiments are presented in
Table 6.
The data in
Table 6 clearly demonstrates that removing the spatial branch inflicts the most severe damage to model performance. Compared with the complete model, the w/o Spatial Branch variant exhibits a significant drop in CSI and HSS by approximately 0.069 and 0.041, respectively, while the MSE error increases notably to 0.01072. This outcome provides compelling evidence that the global static visual features supplied by the spatial branch are crucial for compensating for the loss of detail in the temporal branch during long sequence predictions. Furthermore, the w/o CBAM variant also exhibits a notable performance decline, with CSI dropping to 0.702. This indicates that a simple linear superposition of spatiotemporal features is insufficient to fully leverage their complementary nature. Introducing an attention mechanism for adaptive filtering and recalibration of heterogeneous features is a critical step in enhancing prediction accuracy.
Regarding the internal design of the temporal branch, the comparison between the w/o DeformConv variant and the full model demonstrates the contribution of deformable convolutions. While the absolute improvement in CSI at the 0.5 mm/h threshold is approximately 0.004, we argue that this gain is meaningful for two reasons. First, in high-resolution precipitation forecasting where the CSI has already reached a high-performance plateau, any incremental improvement is challenging to achieve, and even a small margin can lead to more accurate precipitation forecasts in practical applications. Second, the necessity of DCN should be evaluated not only by the absolute gain in a single evaluation metric, but also by its contribution to the overall robustness and generalization of the model. Moreover, the additional parameters introduced by DCN are relatively limited. For a standard 3 × 3 kernel, the offset layer adds only 18 channels of convolution. Our internal profiling shows that the parameter count of ST-DConvLSTM increases by less than 6% compared to standard ConvLSTM, which we believe is a reasonable trade-off for the added structural flexibility. Meanwhile, the experimental results of w/o ST-Memory indicate that the absence of a dedicated spatiotemporal memory unit leads to information forgetting when handling long sequence dependencies, consequently impairing prediction accuracy.
Regarding the internal design of the spatial branch, a comparison between the w/o Transformer variant and the complete model shows that removing the Transformer module leads to a decline in all evaluation metrics to varying degrees. This indicates that relying solely on convolutional operations struggles to effectively capture long-range spatial dependencies within images. In contrast, the self-attention mechanism of the Transformer successfully enhances the spatial branch’s ability to perceive the macroscopic distribution patterns of precipitation systems, thereby further improving the final prediction performance. While removing the Transformer module results in a CSI reduction of approximately 0.008, this lightweight spatial branch is essential for capturing long-range spatial dependencies that standard convolutions struggle to perceive. Given that the Transformer component is highly optimized with only a few layers, its contribution to the overall 5.12 M parameters is minimal, making the performance gain a cost-effective improvement for modeling macroscopic precipitation patterns.
Thus, it can be concluded that the superior performance of ST-DualNet stems from the organic integration of its components. The dual-branch architecture establishes the foundation for feature complementarity, while CBAM achieves efficient feature fusion. Concurrently, ST-DConvLSTM and Transformer play irreplaceable roles in capturing local dynamics and modeling global static patterns respectively.
4. Discussion
Through comparative experiments and ablation analysis, we found that ST-DualNet outperforms mainstream models such as ConvLSTM and PredRNN on the KNMI NL-50 dataset.
The experimental results first confirm the effectiveness of the improved ST-DConvLSTM unit in the temporal branch. By incorporating deformable convolutions, the ST-DConvLSTM module within the temporal branch endows the network with the capability to adaptively adjust its receptive field. This enables the model to dynamically track the motion characteristics of radar echoes, aligning closely with the physical properties of atmospheric fluids. Consequently, metrics such as MSE, CSI, and HSS are significantly improved. Meanwhile, to address the common issues of gradient vanishing and information forgetting in long-sequence prediction, this unit introduces an independent spatiotemporal memory state M, distinct from the traditional ConvLSTM cell state C. State M is vertically propagated between layers and horizontally extended along the temporal axis, establishing a gradient memory flow. This enables deep networks to effectively preserve high-dimensional spatiotemporal features from the initial time step. Ablation experiments demonstrating performance degradation upon removing M conclusively validate the critical role of this independent memory mechanism in sustaining long-term prediction stability.
The experimental results also confirm the superiority of the spatiotemporal dual-branch strategy in meteorological forecasting tasks. Traditional methods based on ConvLSTM attempt to encode both spatial and temporal information within a single unit, which often leads to premature blurring of spatial details in deep networks. ST-DualNet effectively separates the modeling tasks of dynamic evolution and static structure by designing independent temporal and spatial branches. The Transformer introduced in the spatial branch explicitly models global spatial dependencies, compensating for the local nature of convolutional operations and ensuring the overall morphological plausibility of predicted images. Ablation experiments show that removing the spatial branch results in a significant performance drop, which further confirms the crucial role of global static features in maintaining the stability of long-sequence predictions.
The effective fusion of multimodal features constitutes another critical factor in enhancing prediction accuracy. Simple channel concatenation often fails to address the heterogeneity of features across different branches in terms of semantic hierarchy and numerical distribution. The CBAM attention mechanism introduced in our work acts as an adaptive gating mechanism. At the channel dimension, it filters key feature maps based on precipitation intensity, suppressing background noise. At the spatial dimension, it guides the model to focus on high-intensity echo core regions. Experimental data demonstrate that removing the CBAM module causes model performance to decline across all metrics. This indicates that establishing a feature recalibration mechanism effectively reinforces the complementary advantages of the dual-branch architecture, preserving fine-grained predictive details consistent with overall evolutionary trends.
Regarding the training stability of ST-DConvLSTM, the learned offsets exhibited smooth convergence throughout the training process. This stability is largely supported by the normalization layers within the ST-DualNet, which stabilize the feature distribution and prevent erratic offset predictions. Furthermore, while no explicit offset regularization was employed, the physical consistency of radar echo movement serves as an implicit constraint, guiding the DCN to learn stable and meteorologically meaningful deformation fields.
While the current experimental setting follows the standard protocol established in SmaAt-UNet [
19] with a 30-min prediction horizon, we acknowledge that a longer prediction horizon would provide stronger evidence of the model’s ability to mitigate long-term information decay. Given that our ST-DualNet is specifically designed to address information decay through its deformable convolutions and independent spatiotemporal memory, evaluating it on longer prediction tasks is a critical direction for our future work. We anticipate that the advantages of our model will become even more pronounced as the prediction horizon increases.
Although ST-DualNet has achieved satisfactory progress in short-term precipitation forecasting tasks, it still exhibits certain limitations in capturing rapidly intensifying convective events. Regarding the spatial branch architecture, we currently employ temporal average pooling to generate global spatial context. Whilst this helps to smooth out noise and capture stable structural features, we believe that max pooling or attention-based pooling may prove more effective in forecasting extreme precipitation. Future work will explore dual-path pooling strategies to balance noise suppression with the preservation of extreme features. Furthermore, the current inputs are restricted to radar reflectivity, whereas precipitation processes are influenced by multiple physical quantities, including wind fields, air temperature, and humidity. We intend to introduce wind field data or satellite cloud imagery to construct a multi-modal input forecasting network, thereby further improving prediction accuracy for complex weather events.
5. Conclusions
To address the challenges of capturing non-rigid deformation in radar echoes and the loss of long-sequence information during short-term precipitation forecasting, we propose a spatiotemporal dual-branch neural network named ST-DualNet. Through experimentation, we have drawn the following principal conclusions.
Firstly, explicit spatiotemporal decoupling and dynamic perception strategies are pivotal for enhancing prediction accuracy. Our designed ST-DConvLSTM unit successfully achieves adaptive tracking of non-rigid deformations in precipitation radar echoes by integrating deformable convolutions, significantly reducing feature misalignment during dynamic evolution. By introducing an independent spatiotemporal memory state M, it constructs cross-level information transmission channels, resolving feature forgetting issues in long-sequence predictions. Concurrently, the Transformer module introduced in the spatial branch effectively establishes long-range spatial dependencies, compensating for the limitations of traditional RNN networks in extracting global static features.
Secondly, the adaptive fusion of heterogeneous features significantly enhances the model’s robustness. Experiments demonstrate that the simple superposition of temporal and spatial features is insufficient to fully leverage the dual-branch architecture’s advantages. By incorporating the CBAM attention mechanism, the model can automatically filter key information across both channel and spatial dimensions. This effectively resolves the semantic heterogeneity between dynamic temporal features and static spatial features, thereby improving the model’s predictive reliability under complex meteorological backgrounds.
Thirdly, quantitative evaluation has validated the superiority of this approach. Experimental results demonstrate that ST-DualNet outperforms mainstream models in both CSI and HSS metrics. In particular, the ablation experiments show that removing any of the core components leads to performance degradation, fully validating the rationality and necessity of the dual-branch architecture design.