An Long Short-Term Memory Model with Multi-Scale Context Fusion and Attention for Radar Echo Extrapolation

: Precipitation nowcasting is critical for areas such as agriculture, water resource management, urban drainage systems, transport and disaster preparedness. In recent years, methods such as convolutional recurrent neural networks (ConvRNN) in deep learning techniques have been used to solve this task. Despite the effective improvement in forecasting quality, there are still problems with blurred and distorted prediction images, as well as difficulties in effectively forecasting high echo regions. To solve the above problems, this article presents a spatio-temporal long–short-term memory network model in view of multi-scale context fusion and attention mechanisms. This method fully extracts the short-term context information of different scales of radar image through the multi-scale context fusion module. The attention module broadens the time perception domain of the prediction unit so that the model perceives more historical time dynamics. Using the Hong Kong region weather radar data as a sample, the results of the experimental comparative analysis show that the spatio-temporal long and short-term memory network in view of multi-scale context fusion and attention mechanism achieves better prediction performance. Our model is effective in improving both image quality and meteorological assessment metrics with higher accuracy and more details.


Introduction
Precipitation nowcasting has always been an important task in meteorological forecasting, which usually refers to the prediction of short-term (usually 0-2 h) rainfall in a certain area in the future [1].Usually, accurate precipitation nowcasting predication can provide preventive operations (e.g., weather guidance for agriculture, navigation, etc.), especially in severe weather such as heavy rainfall and thunderstorms, thereby reducing human casualties and property damage.Therefore, how to utilize the technology of radar echo extrapolation to obtain accurate and fast weather short-impeding predication has become a hot issue in meteorological research.
Precipitation forecasting can be seen as a spatial and temporal series prediction problem.The forecast radar map is converted into rainfall intensity through the Z-R relationship [2] as the nowcasting forecast.The main methods of traditional radar echo extrapolation are cross-correlation [3,4], monomeric center-of-mass [5,6] and optical flow [7,8].The cross-correlation method divides the entire data area into several small regions, calculates the correlation coefficient between adjacent small regions of the radar echo image, determines the corresponding relationship between the regions in the adjacent time images through the maximum correlation coefficient, and then determines the average motion of the echo region.However, tracking failures significantly increase in severe convective weather.The monomeric center-of-mass method involves identifying, analyzing and tracking thunderstorms as three-dimensional monomers, as well as fitting extrapolations of the thunderstorms to make proximity forecasts.The accuracy of this method is greatly reduced when the radar echoes are more fragmented or appear to merge or split.The optical flow method essentially obtains the motion vector field of the radar echo by calculating the optical flow field of the echo and extrapolates the radar echo based on this motion vector field.However, the optical flow method accumulates errors in the two steps of calculating the optical flow vector and extrapolating.Therefore, the traditional method often cannot model the spatio-temporal relationship well or obtain better forecast results.
In recent times, deep learning has become the most rapidly developing technique in machine learning, and in response to the problems of traditional methods, more and more people are trying to use deep learning methods to solve video prediction problems [9][10][11], traffic flow prediction [12][13][14] and precipitation nowcasting [15][16][17][18][19][20][21][22], as well as other spatiotemporal sequence prediction problems.Deep learning methods can handle complex spatio-temporal relationships in order to adaptively learn the patterns of rainfall variability from a large number of previous radar echo sequences.For medium-term weather forecasting on a global scale, the Google DeepMind team [23] designed a deep generative model, DGMR, based on conditional generative adversarial networks, to accurately predict echo motions and precipitation while generating clear forecasts.For medium-term weather forecasting on a global scale.The Nvidia team [24] designed a high-resolution weather model, FourCastNet, that uses adaptive Fourier neural operators to generate global data-driven forecasts of key atmospheric variables at a 0.25 • resolution.More deep learning models for global weather have since been proposed [25][26][27][28].For the purpose of localized precipitation forecasting, deep-learning-based extrapolation of radar echoes is more advantageous.For example, Shi et al. [29] suggested a convolutional LSTM (ConvLSTM) model that combines convolutional neural networks (CNN) and short-term memory (LSTM) networks for precipitation prediction in Hong Kong.LSTM is used to extract temporal dynamic information and storing it in temporal memory units, while CNN is responsible for extracting spatial information.Therefore, the network can learn and model spatio-temporal information better.Considering that ConvLSTM only focuses on temporal information and ignores the spatial information from between layers, Wang et al. [30] suggested the ST-LSTM unit (Spatiotemporal LSTM) to preserve the spatial features of each layer and apply them to the new end-to-end model PredRNN by adding a new parallel spatial memory unit to ConvLSTM.Wang et al. [31] further constructed the Causal-LSTM unit by cascading the dual memory units and adding the Gradient Highway Unit (GHU), which is used to alleviate the problem of gradient vanishing, and formed a new end-to-end model, PredRNN++.Wang et al. [32] suggested a new model Eidetic 3D LSTM (E3D-LSTM) that integrates 3D convolution into the RNN, enabling the storage unit to store better short-term features.For long-term relationships, the current memory state is made to interact with its historical record by a gate-controlled self-attentive mechanism.However, the integrated 3D convolution makes the computational load of E3D-LSTM very high.Wang et al. [33] suggested a Memory In Memory network (MIM) that can capture the non-smooth and near-smooth features in radar echo images.Additionally, several variant structures based on ConvLSTM and PredRNN have emerged, such as PredRANN [34], SAST-LSTM [35] and PrecipLSTM [36], among others.
Despite the significant improvements in the above methods, these networks still have two shortcomings.Firstly, the prediction unit does not fully consider the contextual correlation between the previous output and current input, resulting in the input and hidden states not being able to assist each other in identifying and preserving important information.Therefore, as the depth of the model increases, the correlation between the contexts gradually decreases and short-term correlation information is lost.Second, as the prediction time increases, the problem of gradual decay of the stored information in the memory unit occurs, i.e., it is difficult for the memory unit at the current moment to effectively recall the stored memory at the previous moment.These problems lead to the gradual blurring of the radar echo prediction image with the increase in prediction time in the radar echo extrapolation task, and the trend in the disappearance of radar echo areas with high reflectivity greatly affects the prediction accuracy.
With the aim of enhancing the level of detail in prediction images, improving the capability to forecast high echo regions and achieving accurate precipitation forecasts over extended time periods, this paper proposes a spatio-temporal LSTM model with multiscale context fusion and attention mechanism (MCA-LSTM) to address the above problems.First, this paper proposes a multi-scale context fusion module to effectively extract multiscale spatio-temporal information of images and improve contextual relevance.Then, an attention module is proposed to make the model perceive more temporal information by widening the temporal perceptual domain of the network model.By integrating these two modules into the network unit, the performance is significantly improved, especially in areas with heavy rainfall.
Our approach can be summarized as follows: 1.
We propose a multi-scale contextual information fusion module for efficient multiscale feature extraction, as well as for the improvement in the correlation between contexts.It effectively improves the blurring problem of predicted images and enhances the details.

2.
We propose an attention module that effectively improves the forgetting problem of the prediction unit during information transmission.A better establishment of long-term time dependence improves the prediction of high echo regions.

3.
Combining the above two methods, CAST-LSTM is constructed.Experiments show that CAST-LSTM achieves state-of-the-art results on long-term prediction tasks.
The rest of the article is organized as follows: Section 2 shows the dataset of standard Moving MNIST and the real radar echo used for experiments.Section 3 presents the details of the method proposed in this article.Section 4 carries out the experimental analysis to compare and analyze the results of the dataset tests.Section 5 summarizes the paper to draw conclusions and provides an outlook for future research work.

Moving MNIST Dataset
The Moving MNIST dataset is the most widely used dataset in spatio-temporal sequence prediction, where several shapes randomly move within a limited range and have several motion modes, including rotation, rescaling, illumination changes and so on.Every 20 consecutive frames are divided into a sequence.There are 10 frames for input, 10 frames for prediction, and each frame is 64 × 64.The training dataset consists of 10,000 sequences, the validation dataset consists of 2000 sequences, and the testing dataset consists of 3000 sequences.

Radar Dataset
This paper uses the well-known public radar dataset HKO-7 provided by the Hong Kong Observatory (HKO) to evaluate the model performance.The dataset is the Hong Kong weather radar data from 2009-2015, which is produced by the Hong Kong Observatory (HKO) and stored in the form of gray-scale maps.The time interval is 6 min and the grid size of the single-time data (i.e., single image) is 480 × 480 pixels, taken at an altitude of 2 km and covering a 512 km × 512 km area centered on Hong Kong.Data are recorded every 6 min, resulting in 240 frames per day.In order to be more suitable for model training and testing, our experimental radar images were scaled to 256 × 256 by bilinear interpolation.
As precipitation does not occur every day, the radar echo images when there is no precipitation are not meaningful for the development of the network, so the part of the data without precipitation needs to be filtered before dividing the training and test sets.Over 20,000 radar images were filtered as the training set and 5000+ radar images were used as the test set.In this paper, 30 radar images with an interval of 6 min are used as a sequence sample.The first 10 radar echo images are used as input and the last 20 radar echo images are used as predicted output in each sample.Therefore, a two-hour extrapolation is predicted based on the observations of the past hour.

Algorithm Description
This section describes the detail of the MCA-LSTM model.First, the multi-scale context fusion module is introduced, then the attention module is elaborated, and how to embed the multi-scale context fusion module and the attention module into the ST-LSTM unit is described.Finally, the overall extrapolation structure of the proposed MCA-LSTM model is presented.

Context Fusion Module
In LSTM-based models (e.g., ConvLSTM, PredRNN, etc.) with a gating structure consisting of input gates, forgetting gates, input modulation gates and output gates, which learn new input features and previous features in the current input and previous hidden states, respectively, there is not only a sequential relationship between and in time, but also a low-level and high-level relationship in space.This close connection between contexts is therefore crucial to the accuracy of the prediction results.However, the previous hidden state and current input of existing networks can only interact individually through convolutional layers and additional operations.As the depth of the model increases, the contextual relationship between the current input and the previous hidden state gradually weakens, which leads to the loss of short-term relevance information of the model and makes the prediction results inaccurate.Therefore, this paper proposes a multi-scale context fusion module for extracting multi-scale features and improving contextual relevance, as shown in Figure 1.
model training and testing, our experimental radar images were scaled to 256 × 256 by bilinear interpolation.
As precipitation does not occur every day, the radar echo images when there is no precipitation are not meaningful for the development of the network, so the part of the data without precipitation needs to be filtered before dividing the training and test sets.Over 20,000 radar images were filtered as the training set and 5000+ radar images were used as the test set.In this paper, 30 radar images with an interval of 6 min are used as a sequence sample.The first 10 radar echo images are used as input and the last 20 radar echo images are used as predicted output in each sample.Therefore, a two-hour extrapolation is predicted based on the observations of the past hour.

Algorithm Description
This section describes the detail of the MCA-LSTM model.First, the multi-scale context fusion module is introduced, then the attention module is elaborated, and how to embed the multi-scale context fusion module and the attention module into the ST-LSTM unit is described.Finally, the overall extrapolation structure of the proposed MCA-LSTM model is presented.

Context Fusion Module
In LSTM-based models (e.g., ConvLSTM, PredRNN, etc.) with a gating structure consisting of input gates, forgetting gates, input modulation gates and output gates, which learn new input features and previous features in the current input and previous hidden states, respectively, there is not only a sequential relationship between and in time, but also a low-level and high-level relationship in space.This close connection between contexts is therefore crucial to the accuracy of the prediction results.However, the previous hidden state and current input of existing networks can only interact individually through convolutional layers and additional operations.As the depth of the model increases, the contextual relationship between the current input and the previous hidden state gradually weakens, which leads to the loss of short-term relevance information of the model and makes the prediction results inaccurate.Therefore, this paper proposes a multi-scale context fusion module for extracting multi-scale features and improving contextual relevance, as shown in Figure 1.First, spatio-temporal information at different scales of the context is extracted by means of a multi-scale module, as shown in Equation ( 1): First, spatio-temporal information at different scales of the context is extracted by means of a multi-scale module, as shown in Equation ( 1): where " * " denotes two-dimensional convolution, "W" denotes the weight matrix and "Concat" denotes channel stitching.From this equation, the inputs in the MCA-LSTM model are X t and H l t−1 .Convolution operations are performed on the inputs X t and H l t−1 using convolution kernels of 1 × 1, 3 × 3 and 5 × 5 sizes.This is used to help the contextual information extract detailed features at different scales.Channel concatenation is performed separately, followed by convolutional operations to restore the channels, resulting in the acquisition of the current input X t ′ and the previous hidden state H l t−1 ′ , both enriched with multi-scale feature information.
Then, the current input X t ′ and the previously hidden state H l t−1 ′ are fused, and in order to control the fusion rate of the information, the two fusion gates are shown in Equation ( 2): where G x denotes the current moment fusion gate, G h denotes the previous moment fusion gate and "σ" denotes the sigmoid function.The fusion is performed by two gates as shown in Equation ( 3): where "⊙" means the Hadamard product.As seen by the above equation, finer multi-scale spatio-temporal features are extracted by convolving the contextual information with different sizes.The use of fusion gates to control the context fusion process improves the contextual relevance of the current input and previously hidden states.Therefore, this module can effectively solve the problem of weakening contextual relevance with increasing prediction time.Meanwhile, the multiscale feature extraction method adopted by this module can effectively improve the details and clarity of the prediction results.

Attention Module
In this paper, an attention module is proposed as shown in Figure 2 to further improve the model's long-term dependability and the reduction in information loss.This module obtains the corresponding attention scores based on the correlation between the current spatial state M l−1 t and the historical spatial state M l−1 t−τ:t−1 .Different degrees of attention are given to historical temporal states C l t−τ:t−1 based on the attention scores, and the attended historical temporal states are aggregated into long-term memory units C att .As a result, the prediction unit can perceive more temporal information from a wider sensory domain.Then, the long-term memory unit C att and the short-term memory unit C l t−1 are further fused into the final enhanced memory unit C ATT .
To achieve this important process, the correlation attention fraction between the spatial state of current time and the spatial state of historical time is first calculated, as shown in Equation (4): , where "•" denotes the matrix dot product operation and "β i " denotes the correlation coefficient.The dot product of the current spatial state M l−1 t and the spatial memory M l−1 t−τ:t−1 is calculated at multiple time steps in history, respectively.Then, it is further normalized using the Softmax activation function to attentional fraction Score i .
In order to aggregate the multi-step historical temporal information in the time domain, the attention scores Score i is applied to the corresponding temporal memory units and then fused by summation, as shown in Equation ( 5): where "C" denotes the temporal memory unit in the prediction unit.The attention score is obtained by the correlation between the current spatial state and the historical spatial state and can better and selectively retain the information of the historical temporal memory unit; C att can be represented as temporal attention information, which represents a long-term motion trend.
In order to effectively aggregate long-term motion trend information C att and shortterm motion information C l t−1 , the fusion rate between the two is controlled by setting a temporal fusion gate G f , as shown in Equation ( 6): The final enhanced motion information C ATT is obtained by using G f to control the percentage of short-term motion state information retained and the percentage of long-term motion trend information retained by (1 − G f ).
The above process widens the time-receptive domain of the prediction unit so that it can capture more historical information.The problem of irreversible information loss in the prediction process is effectively improved, and the prediction capability for high echo regions is enhanced.

MCA-LSTM Cell
In this subsection, the internal structure of the MCA-LSTM cell is introduced.As shown in Figure 3, the input of MCA-LSTM cell includes current input t X , spatial

MCA-LSTM Cell
In this subsection, the internal structure of the MCA-LSTM cell is introduced.As shown in Figure 3, the input of MCA-LSTM cell includes current input X t , spatial memory M l−1 t , temporal memory C l t−1 , historical temporal memory set C l t−τ:t−1 , historical spatial memory set M l−1 t−τ:t−1 and hidden state H l−1 t .The current input X t and hidden states H l−1 t are firstly fused by extracting detailed spatio-temporal features at different scales through the context fusion block, and then the new input Xt and hidden states Ĥl−1 t are obtained.The current spatial memory M l−1 t , historical spatial memory set M l t−τ:t−1 , temporal memory C l t−1 and historical temporal memory set C l t−τ:t−1 are used as the input of the attention module to obtain the enhanced memory unit C ATT .The MCA-LSTM unit is calculated as shown in Equation ( 7): where "MSCF" denotes the multi-scale context fusion module; "ATT" denotes the attention module; i t is the first input gate; g t is the first input modulation gate; f t is the first forgetting gate; i t ′ is the second input gate; g t ′ is the second input modulation gate; f t ′ is the second forgetting gate; o t is the output gate; C l t denotes the updated temporal memory unit; M l t denotes the updated spatial memory unit; W denotes the corresponding convolution kernel; and b denotes the corresponding deviation value."*" denotes the 2D convolution operation, "⊙" denotes the Hadamard product and τ is the historical time step.In particular, in the ATT equation, when l = 1, M l−1 t = X t , M l−1 t−j = X t−j .

MCA-LSTM Network Structure
The network structure of the MCA-LSTM model is presented in

Evaluation Metrics
For evaluation, the critical success index (CSI), Heidke skill score (HSS) and probability of detection (POD) metrics are used in this paper to assess the results.For this purpose, the following transformations are applied to convert the pixel values pixel of ground truth and predicted echo maps to reflectance dBZ as shown in Equation ( 8 The predicted echo maps and ground truth maps are converted to binary matrices by setting thresholds.If the radar echo value is greater than the given threshold, the corresponding value is set to 1; otherwise, it is set to 0. Analogously to the meteorology as shown in the confusion matrix in Table 1 the true positive prediction TP (prediction = 1, true value = 1), false positive prediction FP (prediction = 1, true value = 0), true negative prediction TN (prediction = 0, true value = 0) and false negative prediction FN (prediction = 0, true value = 1) are calculated.

Evaluation Metrics
For evaluation, the critical success index (CSI), Heidke skill score (HSS) and probability of detection (POD) metrics are used in this paper to assess the results.For this purpose, the following transformations are applied to convert the pixel values pixel of ground truth and predicted echo maps to reflectance dBZ as shown in Equation (8): The predicted echo maps and ground truth maps are converted to binary matrices by setting thresholds.If the radar echo value is greater than the given threshold, the corresponding value is set to 1; otherwise, it is set to 0. Analogously to the meteorology as shown in the confusion matrix in Table 1 the true positive prediction TP (prediction = 1, true value = 1), false positive prediction FP (prediction = 1, true value = 0), true negative prediction TN (prediction = 0, true value = 0) and false negative prediction FN (prediction = 0, true value = 1) are calculated.Specifically, 20, 35 and 45 dBZ are chosen as thresholds.CSI and HSS are composite measures that take into account the detection probability and false alarm rate can directly reflect the advantages of the model.The better performance with the larger CSI and HSS.Again, the larger the POD, the better the forecasting performance of the model.

Experiments and Analysis
The experiments are conducted on the Moving MNIST dataset and the Hong Kong Radar dataset, respectively, and are analyzed in comparison with the existing models in this section.Four layers of MCA-LSTM cells are applied according to Figure 4, with the number of channels per cell set to 64 and the convolutional kernel size set to 5 × 5.The comparison models are all used in the same way as in Figure 4.All models are trained and tested on the Pytorch-based framework, and the experiments are implemented on an NVIDIA A10 GPU (NVIDIA Santa Clara, CA, USA).The Adam optimizer is selected for optimization, with a learning rate of 0.0001 and a batch size of 4. To stabilize the training process, LeakyReLU activation function is used after each convolutional layer in MCA-LSTM.

Results and Analysis
This article uses two commonly used metrics to evaluate performance, including mean square error (MSE) and structural similarity index (SSIM).As shown in Table 2, lower MSE and higher SSIM indicate better predictive performance.As shown in Figure 5, the MCA-LSTM proposed in this paper clearly outperforms other methods, especially in the prediction of the last two time steps.The MCA-LSTM network retains the details of digit variations well, especially when dealing with overlapping trajectories, and maintains clarity over time.This is due to the fact that MCA-LSTM introduces a multi-scale context fusion module for extracting the detail information of moving digits and, at the same time, increases the interactivity between the contexts to recognize important information from each other.In addition, MCA-LSTM introduces an attention mechanism module, which effectively ameliorates the forgetfulness problem in information transfer by obtaining more multi-step history information from a wider temporal sensory domain.In comparison, the prediction results of ConvGRU and ConvLSTM networks become blurred very quickly and gradually lose the detailed information; this is because ConvGRU and ConvLSTM only focus on the temporal information of lateral propagation and ignore the spatial information between different cell layers.Although other methods can also achieve some prediction effect, such as PredRNN and PredRNN++ models, which have made some improvements for temporal and spatial information, the effect is still unsatisfactory.As the prediction time goes by, only MCA-LSTM can retain more detailed information in the prediction result of the last time step, which is more advantageous in terms of localization accuracy and spatial appearance.

Results and Analysis
In the experiments, our model was compared with state-of-the-art models such as ConvGRU, ConvLSTM and PredRNN on the predictive evaluation metrics CSI, HSS and POD.
In order to provide a comprehensive assessment of the algorithm's prediction accuracy performance, we also provide prediction evaluation scores for multiple thresholds (20 dBZ, 35 dBZ and 45 dBZ) corresponding to different rainfall levels.Tables 3-5 show the comparison results of different methods.The best results are also marked in bold.It can be seen that the MCA-LSTM model proposed in this paper has the better performance at all thresholds, and the advantage of the model becomes increasingly evident as the threshold increases.In particular, the evaluated CSI, HSS and POD metrics reach 0.1852, 0.2725 and 0.2239, respectively, when the threshold is 45dBZ, where CSI, HSS and POD are 11.6%, 11.3% and 14.8% better, respectively, than the PredRNN algorithm, and 38.8%, 35.0% and 47.7% higher, respectively, than the IDA-LSTM algorithm.This implies that the developed multi-scale context fusion module and attention module help to improve the prediction of high rainfall areas.

Radar Dataset Experiments Results and Analysis
In the experiments, our model was compared with state-of-the-art models such as Con-vGRU, ConvLSTM and PredRNN on the predictive evaluation metrics CSI, HSS and POD.
In order to provide a comprehensive assessment of the algorithm's prediction accuracy performance, we also provide prediction evaluation scores for multiple thresholds (20 dBZ, 35 dBZ and 45 dBZ) corresponding to different rainfall levels.Tables 3-5 show the comparison results of different methods.The best results are also marked in bold.It can be seen that the MCA-LSTM model proposed in this paper has the better performance at all thresholds, and the advantage of the model becomes increasingly evident as the threshold increases.In particular, the evaluated CSI, HSS and POD metrics reach 0.1852, 0.2725 and 0.2239, respectively, when the threshold is 45dBZ, where CSI, HSS and POD are 11.6%, 11.3% and 14.8% better, respectively, than the PredRNN algorithm, and 38.8%, 35.0% and 47.7% higher, respectively, than the IDA-LSTM algorithm.This implies that the developed multi-scale context fusion module and attention module help to improve the prediction of high rainfall areas.To better illustrate the results, curves for CSI, HSS and POD at different forecast moments (6-120 min) are shown in Figure 6 to show the performance of the various models at different time steps.It can be seen that within the first hour of the prediction time, the variability between the networks is not significant.Even at the threshold of 20 dBZ, our model has no significant advantage, and our model becomes more and more advantageous as the prediction time goes on, due to the fact that MCA-LSTM incorporates a multi-scale context fusion module and an attention module, which fully extracts spatio-temporal information at different scales to improve contextual relevance; the attention module can perceive more temporal dynamics from a wider sensory domain, reducing information forgetting and better modelling of short-term and long-term dependencies.As a result, MCA-LSTM can better retain the detail of the prediction results and perform better in the strong echo region.In addition, at higher thresholds, the results of PredRNN are not as good as the model proposed in this paper.This is because PredRNN suffers from the problem of not fully extracting contextual relevance information and loss of memory unit information.
In order to better compare and understand the results, we visualized the extrapolation results of the radar echo from 8:00-10:00 a.m. on 2 April 2014 in Hong Kong from the China radar dataset, as shown in Figure 7.As can be seen from the figure, the prediction results within the first hour do not differ much between the methods, and all of them achieve good results.As the prediction time continues to increase, the extrapolation results of the ConvGRU, ConvLSTM and IDA-LSTM models gradually become blurred, the high echo region gradually becomes smaller or even disappears and the whole prediction boundary region is gradually smoothed.This is because ConvGRU and ConvLSTM pay attention to the time information of lateral propagation and lack attention to the spatial information between cell layers, which leads to the inability to model the spatio-temporal information well in long time series prediction.IDA-LSTM, despite the use of the self-attention mechanism, is still not able to aggregate the multi-step history information effectively, which leads to its failure to achieve good results in long-term prediction.Although PredRNN and PredRNN++ networks take into account the extraction and preservation of temporal and spatial information, MIM also considers non-smooth and nearly smooth characteristics.However, the forgetting problem in the information transfer process has not been effectively improved.Our proposed MCA-LSTM adopts multi-scale contextual information fusion to effectively extract multi-scale detailed spatio-temporal features, which improves the interaction capability between contexts.The attention module preserves more historical information by widening the time-receptive domain, which effectively improves the information decay problem.It effectively improves the prediction ability for high echo regions, and the details of the prediction results are also higher.
model has no significant advantage, and our model becomes more and more advantageous as the prediction time goes on, due to the fact that MCA-LSTM incorporates a multiscale context fusion module and an attention module, which fully extracts spatio-temporal information at different scales to improve contextual relevance; the attention module can perceive more temporal dynamics from a wider sensory domain, reducing information forgetting and better modelling of short-term and long-term dependencies.As a result, MCA-LSTM can better retain the detail of the prediction results and perform better in the strong echo region.In addition, at higher thresholds, the results of PredRNN are not as good as the model proposed in this paper.This is because PredRNN suffers from the problem of not fully extracting contextual relevance information and loss of memory unit information.
Remote Sens. 2024, 16, x FOR PEER REVIEW 13 of 16 In order to better compare and understand the results, we visualized the extrapolation results of the radar echo from 8:00-10:00 a.m. on 2 April 2014 in Hong Kong from the China radar dataset, as shown in Figure 7.As can be seen from the figure, the prediction results within the first hour do not differ much between the methods, and all of them achieve good results.As the prediction time continues to increase, the extrapolation results of the ConvGRU, ConvLSTM and IDA-LSTM models gradually become blurred, the high echo region gradually becomes smaller or even disappears and the whole prediction boundary region is gradually smoothed.This is because ConvGRU and ConvLSTM pay attention to the time information of lateral propagation and lack attention to the spatial information between cell layers, which leads to the inability to model the spatio-temporal information well in long time series prediction.IDA-LSTM, despite the use of the selfattention mechanism, is still not able to aggregate the multi-step history information effectively, which leads to its failure to achieve good results in long-term prediction.Although PredRNN and PredRNN++ networks take into account the extraction and preservation of temporal and spatial information, MIM also considers non-smooth and nearly smooth characteristics.However, the forgetting problem in the information transfer process has not been effectively improved.Our proposed MCA-LSTM adopts multi-scale contextual information fusion to effectively extract multi-scale detailed spatio-temporal features, which improves the interaction capability between contexts.The attention module preserves more historical information by widening the time-receptive domain, which effectively improves the information decay problem.It effectively improves the prediction ability for high echo regions, and the details of the prediction results are also higher.
In addition, as seen from the ground truth sequence, the intensity of the high echo value region becomes higher and the location changes with time.the MCA-LSTM model can predict the trend well and the prediction results are more detailed.For other deep learning models, they cannot predict the high echo region, which gradually blurs or even disappears as the prediction time increases.In order to better compare and understand the results, we visualized the extrapolation results of the radar echo from 8:00-10:00 a.m. on 2 April 2014 in Hong Kong from the China radar dataset, as shown in Figure 7.As can be seen from the figure, the prediction results within the first hour do not differ much between the methods, and all of them achieve good results.As the prediction time continues to increase, the extrapolation results of the ConvGRU, ConvLSTM and IDA-LSTM models gradually become blurred, the high echo region gradually becomes smaller or even disappears and the whole prediction boundary region is gradually smoothed.This is because ConvGRU and ConvLSTM pay attention to the time information of lateral propagation and lack attention to the spatial information between cell layers, which leads to the inability to model the spatio-temporal information well in long time series prediction.IDA-LSTM, despite the use of the selfattention mechanism, is still not able to aggregate the multi-step history information effectively, which leads to its failure to achieve good results in long-term prediction.Although PredRNN and PredRNN++ networks take into account the extraction and preservation of temporal and spatial information, MIM also considers non-smooth and nearly smooth characteristics.However, the forgetting problem in the information transfer process has not been effectively improved.Our proposed MCA-LSTM adopts multi-scale contextual information fusion to effectively extract multi-scale detailed spatio-temporal features, which improves the interaction capability between contexts.The attention module preserves more historical information by widening the time-receptive domain, which effectively improves the information decay problem.It effectively improves the prediction ability for high echo regions, and the details of the prediction results are also higher.
In addition, as seen from the ground truth sequence, the intensity of the high echo value region becomes higher and the location changes with time.the MCA-LSTM model can predict the trend well and the prediction results are more detailed.For other deep learning models, they cannot predict the high echo region, which gradually blurs or even disappears as the prediction time increases.In addition, as seen from the ground truth sequence, the intensity of the high echo value region becomes higher and the location changes with time.the MCA-LSTM model can predict the trend well and the prediction results are more detailed.For other deep learning models, they cannot predict the high echo region, which gradually blurs or even disappears as the prediction time increases.

Conclusions
In this paper, we address the issue of blurry distortions in radar echo extrapolation results caused by the ConvRNN-based method, particularly the problem of underestimation in high echo regions.We propose a novel deep-learning-based radar echo image extrapolation model called MCA-LSTM for short-term precipitation forecasting based on weather radar data within the 0-2 h range.Comparative experiments are conducted using the Moving MNIST dataset and Hong Kong meteorological radar data.Through a comparative analysis with existing algorithms, the following conclusions are drawn: 1.
The proposed multi-scale context information fusion module effectively enhances the contextual relevance of network units and improves the detail of the predicted images by extracting multi-scale feature information.

2.
The proposed attention module captures more historical temporal dynamics from a broader perception field, reducing information loss and enhancing the prediction capability for strong echo regions.
By incorporating these two modules into the ST-LSTM network units, a four-layer radar echo extrapolation network (MCA-LSTM) is constructed.Experimental results on the Moving MNIST dataset and Hong Kong weather radar dataset demonstrate that, compared to recent alternative methods, this approach achieves higher prediction detail and stronger forecasting capability for high echo regions, meeting the requirements for fine-grained predictions in long-term forecasting tasks.The current deep learning algorithm has led to a significant improvement in the extrapolation of radar echoes, but it is still some way from real-life conditions.In subsequent studies, we will investigate how to take more meteorological factors into account in the radar echo extrapolation task and explore more effective algorithms to further improve the prediction capability of short-range precipitation forecasts.

H
− are firstly fused by extracting detailed spatio-temporal features at different scales through the context fusion block, and then the new input ˆt X and hidden states

Figure 2 .
Figure 2. Attention module embedded in model.

M
denotes the updated spatial memory unit; W denotes the correspond- ing convolution kernel; and b denotes the corresponding deviation value."*" denotes the 2D convolution operation, "⊙" denotes the Hadamard product and τ is the historical time step.In particular, in the ATT equation, when l = 1,

Figure 3 .
Figure 3. Internal structure diagram of context fusion attention long-short-term memory unit.

Figure 4 .
This network is constructed by stacking four layers of MCA-LSTM cells, in which the spatial storage cell M (shown by the black dashed line) is updated in the zigzag direction and the temporal storage cell C (shown by the black solid line) is updated in the horizontal direction, and the top layer outputs the prediction results ˆt I .

Figure 3 . 16 Figure 4 .
Figure 3. Internal structure diagram of context fusion attention long-short-term memory unit.3.4.MCA-LSTM Network StructureThe network structure of the MCA-LSTM model is presented in Figure4.This network is constructed by stacking four layers of MCA-LSTM cells, in which the spatial storage cell M (shown by the black dashed line) is updated in the zigzag direction and the temporal

Figure 5 .
Figure 5. Results of different methods on mobile MNIST dataset.

Figure 5 .
Figure 5. Results of different methods on mobile MNIST dataset.

Figure 6 .
Figure 6.CSI, HSS and POD scores of echo forecasts at 20, 35 and 45 dBZ thresholds for each algorithm.

Figure 6 .
Figure 6.CSI, HSS and POD scores of echo forecasts at 20, 35 and 45 dBZ thresholds for each algorithm.

Figure 6 .
Figure 6.CSI, HSS and POD scores of echo forecasts at 20, 35 and 45 dBZ thresholds for each algorithm.

Figure 7 .
Figure 7.The prediction results of all methods are from an example of the radar dataset.The first line is input, and the second line is ground-truth output.The other lines are predictions under different models.