3D-UNet-LSTM: A Deep Learning-Based Radar Echo Extrapolation Model for Convective Nowcasting

: Radar echo extrapolation is a commonly used approach for convective nowcasting. The evolution of convective systems over a very short term can be foreseen according to the extrapolated reﬂectivity images. Recently, deep neural networks have been widely applied to radar echo extrapolation and have achieved better forecasting performance than traditional approaches. However, it is difﬁcult for existing methods to combine predictive ﬂexibility with the ability to capture temporal dependencies at the same time. To leverage the advantages of the previous networks while avoiding the mentioned limitations, a 3D-UNet-LSTM model, which has an extractor-forecaster architecture, is proposed in this paper. The extractor adopts 3D-UNet to extract comprehensive spatiotemporal features from the input radar images. In the forecaster, a newly designed Seq2Seq network exploits the extracted features and uses different convolutional long short-term memory (ConvLSTM) layers to iteratively generate hidden states for different future timestamps. Finally, the hidden states are transformed into predicted radar images through a convolutional layer. We conduct 0–1 h convective nowcasting experiments on the public MeteoNet dataset. Quantitative evaluations demonstrate the effectiveness of the 3D-UNet extractor, the newly designed forecaster, and their combination. In addition, case studies qualitatively demonstrate that the proposed model has a better spatiotemporal modeling ability for the complex nonlinear processes of convective echoes.


Introduction
Convective nowcasting usually refers to forecasting the evolution trends of convective systems for lead times of up to a few hours, which is significant for protecting lives and property and supporting outdoor activities [1][2][3]. However, it is still challenging due to the obvious suddenness, rapid changes, and inherent uncertainty of convection systems.
In most cases, extrapolation-based forecasts have higher skills for lead times of up to 1-2 h. Spatiotemporal extrapolation techniques use statistical models or data-driven models to extrapolate radar or satellite images into the imminent future. After obtaining extrapolation results, convective nowcasting can be conducted with radar echo reflectivity values ≥ 35 dBZ [4] or cloud-top brightness temperatures below a certain threshold [5], and convective precipitation fields can also be estimated with the Z-R relation [6] and nonlinear mapping algorithms [7,8].
Traditional extrapolation techniques are usually based on statistical models, and most of them follow the framework of Lagrangian persistence, which utilizes the motion field calculated from recent images to extrapolate the latest available image under the assumption that the intensity and motion are constant [9]. These methods can be roughly divided into object-based extrapolation [10][11][12] and region-based extrapolation approaches [9,[13][14][15]. Object-based extrapolation first identifies a convective storm cell and then extrapolates its trajectory based on the calculated motion vectors; this technique is mainly suitable for However, these two types of DNNs still have some limitations. First, it is not easy for standard ConvRNN models to tailor their predictions at different timestamps. One reason is that their sequence-to-sequence (Seq2Seq) structures use the same weights to generate the hidden states of all timestamps. Second, CNN models mainly emphasize spatial features while weakening the temporal variations between the input images, leading to difficulty learning relatively long-range temporal dependencies. Even though a few studies have noticed that 3D convolutions can extract spatiotemporal representations [36,37], they still follow the image-to-image translation paradigm and rarely explicitly model the temporal correlations among the extracted features in the prediction stage.
To leverage the advantages of the UNet and ConvRNN models while avoiding the above limitations, we develop a radar echo extrapolation model called 3D-UNet-LSTM for convective nowcasting, which combines 3D-UNet and a newly designed Seq2Seq network in an extractor-forecaster architecture. We first adopt 3D-UNet as the extractor to extract the spatiotemporal features of the input radar reflectivity images while retaining more detailed information, such as textures. In the forecaster, the Seq2Seq network uses different unstacked ConvLSTM layers to iteratively generate hidden states for different future timestamps. Finally, these hidden states are mapped to predicted images via a convolution layer.
The remainder of this paper is organized as follows. Section 2 describes the data used in this paper, and Section 3 illustrates the proposed model, the loss function, and the evaluation metrics in detail. The experimental results are presented in Section 4. Finally, a summary and discussions are given in Section 5. Appendix A briefly introduces some prior knowledge related to our work.

Data
The radar reflectivity data used in this paper are provided by an open meteorological database named MeteoNet [38], which covers two geographical areas, the northwest zone (NW) and southeast zone (SE) of France in Figure 1, and spans 3 years, 2016 to 2018, with 5-min intervals. The data were collected using the Doppler radar network of METEO FRANCE, and 3D reflectivity maps were obtained by each radar scanning the sky. The radar's spatial resolution is 0.01 degrees, and the projection system used is EPSG:4326.
To build our dataset, we first generate 1.5-h radar image sequences (each sequence has 19 radar images) every 25 min. Next, sequence samples are selected if the total number of pixels with reflectivity values ≥ 35 dBZ in one of their last 12 images exceeds 2000, and a total of 12,503 sequence samples are collected. To reduce the computational and memory cost and maintain adequate spatial resolution, the images in each sequencing sample are resized from 565 × 784 to 104 × 160 through bilinear interpolation, with a spatial resolution In addition, the reflectivity (in dBZ) can be approximated to a rainfall intensity R (mm/h) by using the Marshall-Palmer relation: where a = 200 and b = 1.6.

Methodology
Consecutive radar images can directly show the evolution of convective systems. In this section, we propose a DNN model called 3D-UNet-LSTM to extrapolate future radar reflectivity images. The locations and intensities of convective systems over a very short term can be foreseen according to the extrapolated results. M consecutive radar images are given to predict the subsequent N radar images. In the implementation, we use the radar images in the past 0.5 h to forecast those in the next 1 h (i.e., M = 7, N = 12). We describe the architecture of 3D-UNet-LSTM in Section 3.1 and introduce the loss function and evaluation metrics in Section 3.2 and Section 3.3, respectively.

3D-UNet-LSTM
The proposed 3D-UNet-LSTM is an end-to-end trainable model with an extractorforecaster architecture, as illustrated in Figure 2. In the extractor part, we use 3D-UNet [39] to extract the comprehensive spatiotemporal features of consecutive radar images. It is composed of multiple 3D convolutional layers with kernel sizes of 2 × 3 × 3, each of which is followed by a rectified linear unit (ReLU) activation function. Like UNet, the extractor contains a downsampling path, a symmetrical upsampling path and skip connections. Since skip connections require the temporal and spatial sizes of the features before each downsampling operation to be consistent with those observed after the symmetrical upsampling operation, we add a zero image before the 7 consecutive radar images and stack them along the temporal dimension as the model input. In the downsampling path, the temporal and spatial sizes of the input sequence are progressively halved by using three 3D convolutional layers with strides of 2, each followed by two 3D convolutional layers, and spatiotemporal features with different representation levels are extracted. In the upsampling path, the high-level features gradually return to the original size via three transposed 3D convolutional layers, each followed by two 3D convolutional layers. Furthermore, lowlevel features are received from the downsampling path through skip connections, bringing detailed information to the more comprehensive representations. Batch normalization (BN) [40] is used after the last convolutional layer to mitigate the vanishing gradient effect during backward propagation. After that, the comprehensive spatiotemporal features of the radar image sequence are output. The forecaster part is designed to further exploit the spatiotemporal features extracted by the extractor and output the predicted radar images. This part, a Seq2Seq network is presented to explicitly model time and extrapolate the hidden states step-by-step. ConvL-STM is selected as the basic unit due to its simplicity and effectiveness. For the Seq2Seq structure, considering the two common structures in Figure A1 that use shared parameters to generate hidden states for the predictions over all future timestamps, their ability to make corresponding adjustments according to the specific situations encountered at different timestamps in the future may be limited. To alleviate this problem, we utilize N ConvLSTM layers that have different parameters to individually generate the hidden states for future timestamps in an iterative way, as shown in Figure 2, each ConvLSTM layer has a step length of 8 with a convolutional kernel size of 3 × 3 and 64 hidden state channels, thereby exploiting the long-term spatiotemporal information of the inputs and obtaining a hidden state correlated with a specific future timestamp. The hidden state output by the previous ConvLSTM layer is concatenated behind the inputs of the last 7 timestamps of this layer. Then, these are fed into the next layer to output the hidden state of the next future timestamp. In addition to utilizing different layers to tailor the predictions for different timestamps, the iterative design can ensure that the previous features, whether extracted by the extractor or generated by specific ConvLSTM layers, can be reused multiple times; thus, it is also helpful in improving the quality of long-term forecasts. Finally, the hidden state at each future timestamp is converted to a corresponding radar reflectivity image through a 2D convolutional layer with a kernel size of 1 × 1.

Loss Function
In many spatiotemporal sequence forecasting tasks, such as video prediction and traffic flow prediction, where the pixel values of images are relatively evenly distributed, the mean absolute error (MAE) and mean squared error (MSE) are used as the loss functions to train DNN models. However, for radar reflectivity images, the proportion of low-intensity pixels is much larger than that of high-intensity pixels [21]. Training the extrapolation model with the original MAE and MSE losses will make it focus on predicting low-intensity pixels (indicating no weather echoes and weak echoes), limiting the forecasting effect in areas with relatively strong echoes associated with hazardous convection. To achieve better forecasting performance for strong echoes, we introduce a balanced reconstruction loss Remote Sens. 2023, 15, 1529 6 of 18 function L B−rec that assigns greater weights to the errors of higher reflectivity values in the calculation process: where I t+n,i,j denotes the observed reflectivity value of the (i, j)th pixel of the future image at timestamp t + n, andÎ t+n,i,j denotes the corresponding predicted value. weight t+n,i,j is the weight assigned to each pixel according to the range of its observed reflectivity. H and W are the height and width of the radar images, respectively. As in previous work [20,21,41], the values of weight t+n,i,j are determined based on experience. The prediction errors of high reflectivity values are given larger weights compared to those of low reflectivity values, but the difference between weights is only 2-3 times. Finally, the weights are determined by experiment. We verify the effectiveness of the balanced reconstruction loss function in Section 4.

Evaluation Metrics
To quantitatively evaluate the nowcasting performance of extrapolation models, we apply the probability of detection (POD), false-alarm ratio (FAR), bias score (BIAS), critical success index (CSI), root mean square error (RMSE) and correlation coefficient (CC) and design a temporally weighted average CSI (twaCSI) measure. These metrics can be computed based on a given threshold τ, representing a corresponding echo intensity level. CSI can provide a ratio of correct predictions. For its calculation, the observed image and predicted image are first binarized by a threshold τ. A pixel value greater than τ is set to 1; otherwise, it is set to 0. Then, TP, FN, and FP, which denote the numbers of true positives (prediction = 1, observation = 1), false negatives (prediction = 0, observation = 1) and false positives (prediction = 1, observation = 0), respectively, are obtained. The CSI is computed as Furthermore, considering it becomes more challenging to forecast radar images with increasing lead time, we design twaCSI τ to evaluate the temporal sequence of predicted radar images. It emphasizes the CSI scores of the images predicted at later timestamps by assigning them heavier weights; this step is defined as where CSI τ t+n is the CSI score of the predicted image at timestamp t + n. POD and FAR would emphasize the amount of missed events and false alarms. Also, including BIAS will give an idea about the deviation of predictions. when BIAS > 1, the forecast result is stronger than the real; when BIAS < 1, the forecast result is weaker; when BIAS = 1, the forecast deviation is 0, which is the highest prediction skill. In addition, for each predicted image, we utilize RMSE τ and CC τ to present the prediction error and consistency in the area where the observed reflectivities are greater than τ. Denoting the sets of observed values larger than τ and the corresponding predicted values as s andŝ, respectively, RMSE τ and CC τ are calculated as follows: where |s| represents the number of values in set s. Specifically, we select 18 dBZ (0.5 mm/h, indicating rain or not [21]) and 35 dBZ (used to identify strong convections [10]) as the thresholds.

Experiments and Results
To evaluate the effectiveness and superiority of the proposed 3D-UNet-LSTM model, extrapolation-based 0-1 h nowcasting experiments are conducted. For comparison, six baseline models and a state-of-the-art model are reimplemented, including the Eulerian persistence model (hereafter called Persistence), which assumes that future radar images do not differ from the most recent observed image, a conventional model based on optical flow (Rainymotion [14]), five deep learning models including three four-layer ConvRNN models (ConvLSTM [17], PredRNN [22], SA-ConvLSTM [26]), a U-Net [32] model, and a state-of-the-art model (RainPredRNN [23]). In those models, ConvLSTM adopts the "same-side" structure, and PredRNN and SA-ConvLSTM apply the "oppositeside" structure.
We first separately train the 3D-UNet-LSTM model and the other deep learning models on the training set and validation set following the settings in Section 4.1 and then compare the performance of Persistence, Rainymotion and the well-trained models on the whole test set in Section 4.2. Then, to verify the effectiveness of the model design, Section 4.3 compares the 3D-UNet-LSTM model with two variations, including 3D-UNet. Next, in Section 4.4, we further investigate the impact of the balanced loss and adversarial loss functions on the performance of DNNs in accurately predicting convective echoes. Finally, two representative cases are studied in Section 4.5.

Implementation Details for Training
The radar reflectivity images are first normalized to [0, 1] and then fed into the DNN models. For a fair comparison, all models are trained with the balanced reconstruction loss function on the training set via the adaptive moment estimation (ADAM) optimizer [42] with an initial learning rate of 10 −4 . The batch size of each training iteration is set to 4. To prevent overfitting, the training process is stopped if the twaCSI 35 obtained on the validation set is not improved for 20 epochs. All experiments are implemented in TensorFlow [43] and executed on a TITAN RTX GPU (24 GB).

Quantitative Evaluation of Eight Models on the Test Set
We quantitatively evaluate the overall 0-1 h nowcasting performance of the proposed 3D-UNet-LSTM model, RainPredRNN and six baseline models with the CSI, twaCSI, CC and RMSE scores (averaged over all 1137 samples) obtained on the test set. The twaCSI results and the mean CSI, CC and RMSE values obtained for all lead times at thresholds of 18 and 35 dBZ are tabulated in Table 2. Persistence has the poorest scores for all metrics. The optical flow based Rainymotion approach obviously performs better than Persistence with the help of the calculated motion field. The six well-trained DNN models significantly outperform the above two traditional models, which demonstrates the powerful modeling capability of deep learning. Among the ConvRNN models, although PredRNN achieves the same performance as ConvLSTM in terms of the CSI and twaCSI, it obtains higher CC and lower RMSE scores at both thresholds that the nowcasting values of PredRNN are more precise and closely aligned with the ground truth than those of ConvLSTM. RainPredRNN performs better than PredRNN with the help of the ST-LSTM unit and setting appropriate hyperparameters. Another SA-ConvLSTM obtains similar CSI 18 and twaCSI 18 scores compared to those of ConvLSTM, PredRNN and RainPredRNN. Yet, it is superior to both when the threshold is set to 35 dBZ, particularly for twaCSI 35 , implying that SA-ConvLSTM has a better nowcasting performance at longer lead times for echoes with high-intensity levels. The UNet model, which does not have a special design for time series modeling, obtains even better scores for all metrics than the above three advanced ConvRNN models at the thresholds of 18 dBZ and 35 dBZ, which is noteworthy, as it shows the high potential of the UNet architecture for extrapolation-based convective nowcasting. The proposed 3D-UNet-LSTM model yields the best nowcasting scores among the eight models, which verifies its superiority. Greater improvements in the CSI and twaCSI are achieved at the 35 dBZ threshold than at the 18 dBZ threshold because we focus more on improving the prediction accuracy for convective echoes, especially at longer lead times. In addition, the best CC and RMSE scores obtained at both thresholds indicate that the predicted radar reflectivities of 3D-UNet-LSTM are more precise and, thus better for estimating future rainfall intensities. The best and second-best scores are marked in bold and underlined, respectively, ↑ which means that higher is better, while ↓ lower is better.
The POD, FAR and BIAS values obtained for all lead times at thresholds of 18 and 35 dBZ are tabulated in Table 3. For the forecasting of medium and strong echoes, the BIAS score of our proposed model is greater than 1, and the overall forecast results are strong. The reason is that the model is designed to focus more on strong echoes. The model has the best POD and FAR scores at the thresholds of 35 dBZ (strong echo). The best and the second-best scores are marked in bold and underlined, respectively, ↑ which means that higher is better, while ↓ lower is better. Beyond that, to directly show the convective nowcasting performance over time, the CSI, CC and RMSE curves produced by the eight models at the 35 dBZ threshold against different nowcasting lead times up to 60 min are plotted in Figure 3. The results show that the performance of all extrapolation models deteriorates with increasing lead times, which can be expected and mainly results from unavoidable error accumulation and increasing uncertainty in the forecasting process. RainPredRNN and PredRNN obtain similar performance on all metrics over time. In addition, we notice that although UNet achieves a better overall performance in terms of mean CSI 35 and RMSE 35 in Table 2 than the three ConvRNN models and RainPredRNN, this is largely due to the contribution of its better scores for lead times between 5 and 30 min. Later, the performance of UNet gradually becomes comparable to that of SA-ConvLSTM and is finally exceeded by that approach for lead times beyond approximately 45 min. One reason for this phenomenon presumably is that UNet focuses on maintaining or changing spatial appearances for radar images but fails to capture the internal temporal dependencies; this appears to affect its long-term prediction effectiveness. In contrast, the proposed 3D-UNet-LSTM produces the best CSI 35 value for any lead time in one hour and achieves a score of more than 0.25 for 60-min nowcasts, while those of other deep learning models are in the range of 0.21 to 0.23. The same is true for RMSE 35 ; the proposed model remains competitive over the whole period, and its superiority becomes increasingly obvious at lead times after 30 min. For 60-min nowcasts, it reduces the average error by almost 2 dBZ compared with UNet. In terms of CC 35 , the prediction results of the proposed model exhibit consistency with the observation values, especially at shorter lead times. Although its performance drops sharply as the lead time increases, our model still achieves the highest CC 35 scores compared to other models. In general, 3D-UNet-LSTM has better early performance than UNet and consistently outperforms SA-ConvLSTM at long lead times, demonstrating its effective spatiotemporal modeling ability and better overall performance for convective nowcasting.

Evaluation of the Model Design
To evaluate the effectiveness of the 3D-UNet-LSTM model design, we first design two variations of the model, one that removes the forecaster and retains the 3D-UNet extractor only and another that replaces the forecaster with a two-layer ConvLSTM network (this variation model is referred to as '3D-UNet + ConvLSTM'). Then, the overall performance of the original ConvLSTM, UNet, 3D-UNet-LSTM and these two variations are compared, as shown in Table 4. When only the 3D-UNet extractor is retained, it still outperforms ConvLSTM and UNet in terms of the metrics at the 35 dBZ threshold, indicating that the 3D-UNet extractor has good potential for convective nowcasting. However, as we attempt to use a common ConvLSTM network to further leverage the features extracted by 3D-UNet and generate future hidden states according to the shared parameters, the nowcasting performance decreases considerably, becoming even worse than that of the original ConvLSTM. In contrast, when utilizing our designed forecaster to produce future hidden states with different parameters, the model obtains better scores than those of 3D-UNet, demonstrating the effectiveness of the forecaster. The best and second-best scores are marked in bold and underlined, respectively, ↑ which means that higher is better, while ↓ lower is better.
We also draw the CSI 35 , CC 35 and RMSE 35 curves of these methods for different lead times in Figure 4. It can be seen that by combining 3D-UNet and the forecaster, our model has better performance than the other approaches for nearly all lead times. The superiority of its design is more obvious for longer lead times.

Evaluation of Different Loss Functions
In the following, we train the 3D-UNet-LSTM model with different loss functions and test their effects on the prediction accuracy for convective echo regions. These loss functions are the reconstruction loss (the sum of the MAE and MSE) widely used in video prediction tasks [22,26], the sum of the reconstruction loss and adversarial loss, which has been applied to address the blurring problem for echo prediction [24], the balanced reconstruction loss [21] applied in this paper, and the sum of the balanced reconstruction loss and adversarial loss [37,44]. The scaling factor of the adversarial loss is set to 0.03 to ensure that it can exert a certain degree of influence on the model training process. When the scaling factor is set to 0.003, its influence is quite slight. The results are shown in Table 5. We can see that without using any weights for reflectivities, the reconstruction loss slightly improves the CSI 18 and twaCSI 18 scores but yields much poorer performance than that of the balanced loss functions in terms of other metrics, especially CSI 35 and twaCSI 35 . As we add an adversarial term to the reconstruction loss, these gaps are slightly narrowed. Regarding the balanced loss functions, the balanced reconstruction loss applied in this paper obtains the best scores for all evaluation metrics at the 35 dBZ threshold. The best and second-best scores are marked in bold and underlined, respectively; ↑ which means that higher is better, which ↓ means that lower is better.
Regarding its combination with an adversarial loss, the convective nowcasting performance deteriorates with increasing scaling factors for the adversarial term. It can be concluded that compared with the original reconstruction loss, the balanced loss can significantly improve the convective nowcasting performance of a deep learning model. It seems that adding an adversarial loss to the reconstruction loss can slightly improve the prediction accuracy for convective echoes. However, for the balanced reconstruction loss, adding an adversarial loss term is of no help for further increasing the prediction precision.

Representative Case Study
To qualitatively evaluate the performance of the proposed model, we select two representative cases from the test set and visually examine the nowcasts produced by different models. The images of two cases, including radar observations and nowcasts, are presented in Figure 5 and Figure 6, respectively, and are displayed every 15 min to show the evolutions of convective systems. Figure 5 shows a representative case of local strong convective growth over northwest France at a forecasting time of T = 7 August 2018, 11:55 UTC. In the input radar images, it can be seen that an isolated convective cell is located in the west at time T -30 min, moving northeast together with other dispersed echoes, and the formation of a new strong small-scale convective cell occurs in Region B at forecasting time T. For the ground-truth observations in the next hour, the echoes continue to move in the northeast direction, and during this period, the new convective cell gradually grows and appears to merge with the older cell. Comparing the nowcasting results of each model with the ground truth, one can observe that all models can capture the movements of most echoes. However, the optical flow-based Rainymotion method simply advects the radar echoes. It fails to forecast the subsequent growth and evolution of the newly formed convective cell because it cannot completely model nonlinear processes. In contrast, all deep learning models successfully forecast that the newly formed convective cell will grow at time T + 30 min but underestimate its intensity. This under-forecasting problem, also called blurry prediction, is common when utilizing deterministic deep learning models for radar echo extrapolation, especially with longer lead times; this is mainly because a DNN model tends to average all probable outcomes to a blurry prediction in a case in which it has difficulty dealing with future uncertainty [45]. Nonetheless, the 30-min nowcast obtained by the 3D-UNet-LSTM model is closer to the ground truth in terms of the horizontal extent of the convection than those derived from other models. For the 60-min nowcasts, the forecasted intensities of the old convective cell in the results of other deep learning models deviate considerably from the ground truth, while the 3D-UNet-LSTM model and 3D-UNet model can maintain their intensity values at relatively high levels (≥ 40 dBZ). It is noted that only the 3D-UNet-LSTM model forecasts a further growth trend in the size of the newly formed convective cell from time T + 30 min to T + 60 min, and its 60-min nowcasting result also successfully depicts the merging phenomenon of the two isolated convective echoes that occur in regions A and B one hour later.  Another representative case is shown in Figure 6, which describes the evolution of a severe squall line that occurs in southeast France at a forecasting time of T = 13 August 2018, 05:00 UTC. It is clear from the radar observations that a squall line is moving eastward while the convective area behind it gradually becomes larger, and it finally develops into a bow echo at time T + 60 min. As in the first case, all models provide relatively accurate moving directions for the quasi-linear convective system. The 30-min nowcasts obtained from all models, especially UNet, achieve good agreement with the radar observations, presumably because the system evolves relatively slowly during the first half hour after forecasting time T. However, for the 60-min nowcasts, it is difficult for the optical flowbased Rainymotion method to predict the subsequent convective evolution. Although the deep learning models successfully forecast that the convective area will expand in the future, significant differences remain between their 60-min nowcast performances. For example, one can observe that the three ConvRNN models give misleading information that high-impact meteorological hazards (reflectivity ≥ 40 dBZ) tend to decrease. Although UNet and 3D-UNet effectively preserve their intensities, neither they nor the ConvRNN models can forecast the bow echo structure at time T + 60 min. It is noted that the proposed 3D-UNet-LSTM yields a more trustable 60-min nowcast in Region A with a realistic bow echo structure (the region with reflectivity ≥ 40 dBZ in Figure 6) and a reasonable intensity distribution than those of other models. Bow echo is bowed toward the direction of movement. There are general weaknesses in reflectivity behind the bow. Only the nowcasting results of the proposed approach depict the squall line-to-bow echo transition clearly, indicating that 3D-UNet-LSTM has a better spatiotemporal modeling ability for the complex nonlinear processes of convective echoes.

Conclusions
In this paper, we propose a novel deep learning model called 3D-UNet-LSTM to precisely extrapolate radar reflectivity images for convective nowcasting. This model combines a well-known CNN named 3D-UNet and a newly designed Seq2Seq network in an extractor-forecaster architecture. We first apply 3D-UNet as the extractor to extract the comprehensive spatiotemporal representations of input radar images. Then, in the forecaster, the extracted features are further leveraged by the Seq2Seq network to individually generate hidden states for different future timestamps with different ConvLSTM layers. These hidden states are finally transformed into predicted radar images by a convolutional layer.
We conduct comparative experimental studies on a test set. The quantitative evaluation results show that 3D-UNet-LSTM outperforms conventional methods and state-of-theart deep learning models regarding the prediction of convective echoes, particularly with long lead times. In addition, the evaluation of the model design demonstrates the effectiveness of the 3D-UNet extractor and the newly designed forecaster, as well as their combination. It is noteworthy that UNet-based models, especially 3D-UNet, achieve comparable or even superior performance to that of some ConvRNN-based models. We also verify the effectiveness of the utilized balanced loss function on the model performance for precisely forecasting strong echoes. Finally, representative case studies qualitatively illustrate that the 3D-UNet-LSTM model can better model the nonlinear processes of the evolutions of convective echoes and produce more reasonable and location-accurate nowcasts.
Although the quantitative and qualitative comparison and analysis verify the superiority and effectiveness of 3D-UNet-LSTM for extrapolation-based convective nowcasting, some limitations remain. We think these should be noted and discussed. First, like other deep learning models, the proposed model has difficulty forecasting convective initiation, which is still challenging for the meteorological community. One main reason is that the input reflectivity images cannot provide a DNN with sufficient early signals and characteristics of convective initiation. From there, adding relevant radar variables to supplement input reflectivities may be a promising direction. Second, the loss function has much room for improvement and introducing an additional classification network and an effective classification loss seems to be a good solution. Thirdly, we are currently working on only one benchmark dataset and will try to conduct studies using different benchmark data. In future work, we will carry out research on these three aspects.  Data Availability Statement: Meteonet data [38] is available at https://meteonet.umr-cnrm.fr/ (accessed on 6 April 2022). Figure A1. Two commonly used Seq2Seq structures for RNN-based radar echo extrapolation (choosing ConvLSTM as the basic unit). (a) The "same-side" structure; (b) The "opposite-side" structure.

Appendix A.3. Adversial Loss Function
A GAN [46] is a kind of architecture that is mostly used for image synthesis. A regular GAN-based architecture consists of a generator and a discriminator. The generator outputs images, and the discriminator is trained to distinguish whether its input is produced by the generator or derived from the training dataset (binary classification). At the same time, when training the generator with an adversarial loss function to fool the discriminator, the quality of its output images is improved.
In recent years, some studies have treated the extrapolation model as the generator and trained it in a GAN-based architecture with suitably designed adversarial loss functions to improve the textures of predicted images [19,24,44,47,48]. In that context, a simple yet effective adversarial loss function [48] can be defined as: where L g adv and L d adv denote the loss functions of the generator G and discriminator D, respectively. The generator G takes radar images x as input and generates predicted images G(x), intended to have the same echo distribution as y, the training (ground-truth) data. D(·) is the output of the discriminator D. {} represents the concatenation operation.