Raindrop Size Distribution Prediction by an Improved Long Short-Term Memory Network

The observation of and research on raindrop size distribution (DSD) is important for mastering and understanding the mutual restriction relationship between cloud dynamics and cloud microphysics in a process of precipitation; it also plays an irreplaceable role in many fields, such as radar meteorology, weather modification, boundary layer land surface processes, aerosols, etc. Using more than 1.7 million minutes of raindrop data observed with 17 laser disdrometers at 17 stations in Anhui Province, China, from 7 August 2009 to 30 April 2020, a DSD training dataset was constructed. Furthermore, the data are fitted to a normalized Gamma function and used to obtain its three parameters, i.e., the normalized intercept Nw, the mass weighted average diameter Dm, and the shape factor μ. Based on the long short-term memory network (LSTM), a DSD Gamma distribution prediction network (DSDnet) was designed. In the process of modeling based on DSDnet, a self-defined loss function (SLF) was proposed in order to improve the DSD prediction by increasing the weight values in the poor fitting regions according to the common mean square error loss function (MLF). By means of the training dataset, a DSDnet-based model was trained to realize the prediction of Nw, Dm, and μ minute-to-minute over the course of 30 min, and then was evaluated by the test dataset according to three indicators, namely, mean relative error (MRE), mean absolute error (MAE), and correlation coefficient (CC). The CC of lgNw, Dm, and μ can reach 0.93403, 0.90934, and 0.89741 for 12-min predictions, and 0.87559, 0.85261, and 0.84564 for 30-min predictions, respectively, which means that the DSD prediction accuracy within 30 min can basically reach the application level. Furthermore, the 12and 30-min predictions of 3 precipitation processes were taken as examples to fully demonstrate the application effect of model. The prediction effects of Nw and Dm are better than that of μ, and the stratiform precipitation is better than the convective and convective-stratiform mixed cloud precipitation.


Introduction
The raindrop size distribution (DSD) reflects the variation in the particle number concentrations of raindrops with sizes in different unit volumes. From DSD, various parameters of microphysical characteristics, such as raindrop number concentration, average diameter, precipitation intensity, water content, etc., can be calculated in a precipitation process. In addition, the rainfall development can be more clearly understood through the variation of DSD, which is significant when optimizing the parameterization scheme in the weather and climate models [1,2], evaluating the effects of weather modification, and improving the radar quantitative precipitation estimation, etc. with a beam width of 1°, radial resolution of reflectivity of 1 km, and maximum detection distance of 460 km. The radar adopts the VCP 21 (volume coverage pattern, scan strategy #2, version 1) scan mode, which comprises 9 elevation angles (0.5°, 1.5°, 2.4°, 3.4°, 4.3°, 6.0°, 9.9°, 14.6°, and 19.5°), and the time for a volume scanning is six minutes. The locations of the radar and disdrometers are shown in Figure 1, in which all disdrometers are within the radar detection range of 230 km.

Data Preprocessing
To ensure the effectiveness of deep learning training, data cleaning was crucial. Therefore, before being added to the dataset, these unqualified DSD data were given up, including (1) in the first two diameter intervals, i.e., diameters less than 0.187 mm; (2) when the number of particles in a diameter interval is less than 2; (3) when the number of particles in a minute is less than 10; (4) when diameters are greater than 8 mm; and (5) when the difference in particle falling velocity between the measured and classical is greater than 5 m s −1 [22,23].

Normalized Gamma Distribution
The Gamma DSD expression proposed by Ulbrich is: (1) where N represents the corresponding number concentration of raindrops at the diameter D, N 0 is the intercept of the Gamma function, µ is the shape parameter, and λ is the slope. The unit of Gamma parameter N 0 is related to another parameter µ, so it has less physical significance in Equation (1), and N 0 can be discussed only under the same µ.
To avoid this problem, Willis proposed a DSD form for the normalized Gamma function (Equation (2) The three parameters of the normalized Gamma function, namely, normalized intercept N w , mass weighted average D m , and shape factor µ, which have clear physical significance, can be calculated by the order-moment method, in which N w , D m , and µ can be calculated from the third-, fourth-, and sixth-order-moments of DSD [24], respectively (Equations (5)- (8)). The expression of the ith-order-moment is Equation (4):

Introduction of the LSTM Algorithm
Traditional neural networks (NN) include an input layer, hidden layers, and output layer. These layers are fully connected, but the nodes between each layer are independent. Therefore, it is difficult to deal with time sequence issues by using NN. As shown in Figure 2, a recurrent neural network (RNN) is composed of repeated NN modules in a chain form, in which A represents a unit, X t is the input at time t, and Y t is the corresponding output through the unit and is used as another input factor for the next unit at time t+1 for information transmission. Therefore, the nodes between the hidden layers are connected and the information is transmitted.
The three parameters of the normalized Gamma function, namely, normalized intercept Nw, mass weighted average Dm, and shape factor μ, which have clear physical significance, can be calculated by the order-moment method, in which Nw, Dm, and μ can be calculated from the third-, fourth-, and sixth-order-moments of DSD [24], respectively (Equations (5)- (8)). The expression of the i th -order-moment is Equation (4):

Introduction of the LSTM Algorithm
Traditional neural networks (NN) include an input layer, hidden layers, and output layer. These layers are fully connected, but the nodes between each layer are independent. Therefore, it is difficult to deal with time sequence issues by using NN. As shown in Figure 2, a recurrent neural network (RNN) is composed of repeated NN modules in a chain form, in which A represents a unit, Xt is the input at time t, and Yt is the corresponding output through the unit and is used as another input factor for the next unit at time t+1 for information transmission. Therefore, the nodes between the hidden layers are connected and the information is transmitted. However, RNN has the problem of gradient disappearance and gradient explosion in long sequence training. As a special RNN, LSTM not only inherits most of the characteristics of RNN, but also effectively avoids these defects. Compared with the RNN, LSTM has a more complex memory cell, which has three gates, namely, the forgetting gate, input gate, and output gate ( Figure 3) [25,26]. However, RNN has the problem of gradient disappearance and gradient explosion in long sequence training. As a special RNN, LSTM not only inherits most of the characteristics of RNN, but also effectively avoids these defects. Compared with the RNN, LSTM has a more complex memory cell, which has three gates, namely, the forgetting gate, input gate, and output gate ( Figure 3) [25,26].

DSDnet Design
On the basis of LSTM, a DSD prediction network was designed and named DSDnet, which includes two LSTM layers, a linear layer, and an output layer ( Figure 4).

Training Dataset Construction
After preprocessing, the measured raindrop data were re-sorted into chronological order according to the stations. While Nw was not equal to 0 in more than 12 consecutive minutes, and the duration of precipitation was larger than 60 min, the sequence was marked as one record of the dataset.
Finally, the dataset was built by a total of 6725 sequences from 1,788,915-min samples. Additionally, these data were normalized with the min-max standardization method to map into the [0, 1] interval, that is: y = y − min max − min (9) where the maximum Nw, Dm, and μ were set to 10 5.5 m −3 mm −1 , 4 mm, 30, and the minimum of Nw and Dm were set to 0, while μ was set to −5, respectively. If the values of the fitting

DSDnet Design
On the basis of LSTM, a DSD prediction network was designed and named DSDnet, which includes two LSTM layers, a linear layer, and an output layer ( Figure 4).

DSDnet Design
On the basis of LSTM, a DSD prediction network was designed and named DSDnet, which includes two LSTM layers, a linear layer, and an output layer ( Figure 4).

Training Dataset Construction
After preprocessing, the measured raindrop data were re-sorted into chronological order according to the stations. While Nw was not equal to 0 in more than 12 consecutive minutes, and the duration of precipitation was larger than 60 min, the sequence was marked as one record of the dataset.
Finally, the dataset was built by a total of 6725 sequences from 1,788,915-min samples. Additionally, these data were normalized with the min-max standardization method to map into the [0, 1] interval, that is: y = y − min max − min (9) where the maximum Nw, Dm, and μ were set to 10 5.5 m −3 mm −1 , 4 mm, 30, and the minimum of Nw and Dm were set to 0, while μ was set to −5, respectively. If the values of the fitting

Training Dataset Construction
After preprocessing, the measured raindrop data were re-sorted into chronological order according to the stations. While N w was not equal to 0 in more than 12 consecutive minutes, and the duration of precipitation was larger than 60 min, the sequence was marked as one record of the dataset.
Finally, the dataset was built by a total of 6725 sequences from 1,788,915-min samples. Additionally, these data were normalized with the min-max standardization method to map into the [0, 1] interval, that is: where the maximum N w , D m , and µ were set to 10 5.5 m −3 mm −1 , 4 mm, 30, and the minimum of N w and D m were set to 0, while µ was set to −5, respectively. If the values of the fitting parameters of the normalized Gamma were larger than the maximum or less than the minimum, their values were set to the corresponding maximum or minimum. We randomly selected 20% from the dataset, i.e., 1345 sequences, as the test set, and the remaining 80%, i.e., 5380 sequences, were used as the training set. The samples in the training set were rearranged into the input and label form required by the DSDnet, namely, how many minutes (T step ) of data in the past were used to fit the parameter values at the prediction minute (M pred ). Therefore, the input of DSDnet is a matrix consisting of T step row minutes and three column parameters, N w , D m and µ, and the corresponding labels are N w , D m , and µ at the T step + M pred minute. Without losing representativeness, T step was set to 18 min and M pred was set to 12 and 30 min, respectively. Namely, the input of the network was an 18 × 3 matrix, and the labels were the actual values of N w , D m , and µ at 12 or 30 min after 18 min, respectively. The inputs started from the 1st to the 18th row of the samples, and the matrix slid down one row to form the second input, and so on, until the labels were at the last row to complete the training dataset's construction. During the modeling process, the samples were shuffled every iteration, and 15% of the training set was randomly extracted as the validation set. Finally, the model was evaluated by the test set. Three precipitation systems, i.e., stratiform, mixed convective-stratiform, and convective clouds, were selected to demonstrate the prediction effects of the DSDnet-based model.

Self-Defined Loss Function
In the process of model training, the fitting value was compared with the real value by using a loss function to calculate the loss value E after each iteration. Then, through the back propagation of the loss value, the weight coefficients w j,k at the kth node of the j th hidden layer were adjusted by an optimizer (Equation (10)): where the subscript "new" and "old" represent the weight after and before adjustment, respectively, and α is the learning rate. In essence, the process of training a deep learning model is to minimize the loss function whose value reflects how far the fitting result is from perfection on a given dataset. For regression problems, mean squared error (MSE) is often used as a loss function (MLF), and its expression is: where y pred and y true are the normalized predicted and measured values, respectively.
Since the three parameters of the normalized Gamma function are usually a normal distribution, the frequencies are lower at large value and small value regions. If MLF is used as the loss function, the fitting results will gradually tend to the middle value regions, which have higher frequencies. Therefore, a SLF (Equation (12)) is suggested on the basis of MLF. In the model training process, according to the distribution of the three parameters in the large and small value regions, different weight coefficients are set, and higher weights are given to the large and small value regions ( Table 2, in which W is the weight value vectors, and L is the normalized value vectors of the parameters).

Hyperparameter Setting
The hyperparameters have a great impact on model training efficiency and prediction accuracy, which include the number of hidden layers, number of stacking layers, batch size, epoch number, learning rate, and optimizer, etc. Since the number of DSD samples reaches millions, automatic parameter adjustment methods, such as GridSearchCV, which are suitable for small datasets are difficult to apply here. Therefore, the hyperparameters are continuously adjusted during modeling. It was found that the model output result and convergence speed were influenced by the number of hidden layer nodes, stacking layers, and batch size. When the number of stacking layers is greater than 4, or the batch size is less than 1000, the convergence speed of the model decreases significantly. After taking into account the principles of fast convergence and small error, the number of hidden layer nodes, stacking layers, and batch size are set to 128, 2, and 4000, respectively. Other hyperparameters are set as follows: Adam gradient descent optimizer with a learning rate of 0.001 and 500 epochs. Meanwhile, to improve the efficiency, an early stop mechanism was adopted. If the loss value of the validation set does not decrease after five consecutive iterations, the training process was stopped, and the model was saved.

MRE y true , y pred
y true − y pred y true (13) MAE y true , y pred = 1 n ∑ n i=1 y true − y pred (14) CC y true , y pred =

Modeling Flow Chart
The modeling flow chart is shown in Figure 5.

Hyperparameter Setting
The hyperparameters have a great impact on model training efficiency and prediction accuracy, which include the number of hidden layers, number of stacking layers, batch size, epoch number, learning rate, and optimizer, etc. Since the number of DSD samples reaches millions, automatic parameter adjustment methods, such as GridSearchCV, which are suitable for small datasets are difficult to apply here. Therefore, the hyperparameters are continuously adjusted during modeling. It was found that the model output result and convergence speed were influenced by the number of hidden layer nodes, stacking layers, and batch size. When the number of stacking layers is greater than 4, or the batch size is less than 1000, the convergence speed of the model decreases significantly. After taking into account the principles of fast convergence and small error, the number of hidden layer nodes, stacking layers, and batch size are set to 128, 2, and 4000, respectively. Other hyperparameters are set as follows: Adam gradient descent optimizer with a learning rate of 0.001 and 500 epochs. Meanwhile, to improve the efficiency, an early stop mechanism was adopted. If the loss value of the validation set does not decrease after five consecutive iterations, the training process was stopped, and the model was saved.

Modeling Flow Chart
The modeling flow chart is shown in Figure 5.

Model Evaluation by Test Set
After the model was finished, the model was evaluated by the test dataset. Figure 6 shows the scatter plot of lgN w , D m , and µ, which were fitted by the observed data and predicted by the model (Figure 6a-c modeled with MLF, and Figure 6d-f with SLF). From the indicators, N w is better than D m and µ. The model performs slightly worse in the large and small value regions, which is caused by the relatively less data and more difficult rule mining in these parts. In addition, after using SLF, the model accuracy was greatly improved. By comparing the model using SLF and MLF, it can be seen from Table 3 that the MRE of lgN w , D m , and µ decreased by 3.98%, 3.05%, and 6.48%; MAE decreased by 2.56%, 3.06%, and 5.78%; and CC increased by 0.25%, 1.44%, and 1.94%, respectively.   Figure 7 is the same with Figure 6 but 30-min prediction scatter plot. Compared with the 12-min set, when the correlation decreases, the error of each parameter increases, and the model using SLF is more obviously improved. It can be seen from Table 4 that the MRE of lgN w , D m , and µ decreased by 12.87%, 7.18%, and 11.3%; MAE decreased by 13.7%, 7.29%, and 14.75%; and CC increased by 2.32%, 1.51%, and 2.82%, respectively.   I  I  i  I  I  I   I  I  I   I  I  I   I  I  I   C --i-----~------~--,  I  I  I  I  I  •  I  I  I

Model Application
To further demonstrate the application effect of the model, the 12-and 30-min predictions were used as examples by means of three rainfall cases. Representative 0.5 • PPI of the radar are shown in Figures 8, 12 and 15. To facilitate the comparison, the predicted (red line) and fitted values (blue line) are shown in Figures 9, 10, 13, and 16, in which the vertical dashed lines are at T step + M pred minute, that is, the start time of prediction, and N w is recalculated with the logarithm of base 10. If N w is less than 1, lgN w is set as 0. Overall, these figures illustrate that the model has a satisfactory prediction of the parameters.

Stratiform Cloud Precipitation
This case is a long-term winter stratiform cloud precipitation process with a total of 1184 min observed at the Lujiang station (117.17 • E, 31.16 • N) from 0333 to 2316 LST (local standard time, the same below) on 26 January 2020. Three representative 0.5 • PPI (plan position indicators) of the radar at 0726, 1226, and 1332 LST are shown in Figure 8, corresponding to the 233rd, 533rd, and 599th minutes in Figures 9 and 10, respectively. The red triangle is the location of the disdrometer. The curves of the 12-and 30-min predictions and actual values of lgN w , D m , and µ are shown in Figures 9 and 10, respectively, in which the left ones (a, b, c) are modeled with MLF and the right ones (d, e, f) with SLF. The curves show that, except for the slightly larger individual error, the model has satisfactory prediction results based on the two loss functions, and the fitting accuracies of N w and D m are better than that of µ. The red triangle is the location of the disdrometer. The curves of the 12-and 30-min predictions and actual values of lgNw, Dm, and μ are shown in Figures 9 and 10, respectively, in which the left ones (a, b, c) are modeled with MLF and the right ones (,d, e, f) with SLF. The curves show that, except for the slightly larger individual error, the model has satisfactory prediction results based on the two loss functions, and the fitting accuracies of Nw and Dm are better than that of μ.   between 4 to 4.8, and µ are between 4 to 12. By observing the change in the curves, when D m is small, lgN w is large; on the contrary, when D m is large, lgN w is small. The maximum value of µ is about 24, and most of the values of µ are larger than those of the convectivestratiform mixed clouds and convective precipitation (the cases are demonstrated below). In addition, the parameter µ has large fluctuations and poor prediction for the prediction of N w and D m . Due to the inherent smoothness of LSTM, it is difficult to fit such data, which have strong short-term fluctuations.
The evaluation results are listed in Table 5. The overall prediction effect was improved after the SLF is adopted, in which the MRE of lgN w , D m , and µ decreased by 2.24%, 2.39%, and 13.71%; MAE decreased by 2.21%, 23.9%, and 13.69%; and CC increased by 2.31%, 6.26%, and 1.42%, respectively.

30-min Prediction Results
The actual and predicted values of lgN w, D m , and µ of the 30-min prediction are shown in Figure 10. It shows that the performance of the 30-min prediction is relatively poor compared to that of the 12-min prediction, especially in the regions of large values of lgN w and D m , where the predicted values are generally lower than the measured ones. From the comparison of Tables 5 and 6, the prediction error for 30-min was larger than that of 12-min, and the CC prediction for 30-min was lower than that of 12-min. This is because the time-series correlation decreases when the prediction duration increases. Even so, the accuracy of the 30-min prediction was still satisfactory. After the model adopted the SLF, the prediction effect was greatly improved. The MRE of lgN w , D m , and µ decreased by 17.68%, 15.05%, and 21.08%; MAE decreased by 16.69%, 19.72%, and 20.03%; and CC increased by 3.45%, 0.69%, and 5.81%, respectively (Table 6). As an example, the DSDs of measurement, Gamma fitting, and model prediction at the 1063rd min are shown in Figure 11. Figure 11a,b are the 12-and 30-min prediction, respectively. Overall, the predicted DSD curve was very close to the curve fitted with the measured values, which indicated that the prediction effect of the model was satisfactory, and that the 12-min prediction was better than 30-min one.

Mixed Convective-Stratiform Clouds
This case was a mixed convective-stratiform cloud precipitation process observed at ChuZhou station (118.17°E, 32.21°N) from 0733 to 2337 LST on 24 November 2015, for a total of 965 min in early winter. Figure 12 is the 0.5° PPI of the radar reflectivity corresponding to the three typical times, 0828, 0914, and 0949 LST in the process, which correspond to the 55th, 101th, and 136th min in Figure 13. The red triangle is the position of the disdrometer at ChuZhou station. To save the length of the article and not lose representativeness, this and the next case only demonstrate the 30-min prediction results. In Figure 13, although the prediction error of cumulus mixed cloud precipitation increases compared with stratiform precipitation, the performance of the DSDnet-based model remains satisfactory, except for individual points.

Mixed Convective-Stratiform Clouds
This case was a mixed convective-stratiform cloud precipitation process observed at ChuZhou station (118.17 • E, 32.21 • N) from 0733 to 2337 LST on 24 November 2015, for a total of 965 min in early winter. Figure 12 is the 0.5 • PPI of the radar reflectivity corresponding to the three typical times, 0828, 0914, and 0949 LST in the process, which correspond to the 55th, 101st, and 136th min in Figure 13. The red triangle is the position of the disdrometer at ChuZhou station. To save the length of the article and not lose representativeness, this and the next case only demonstrate the 30-min prediction results. In Figure 13, although the prediction error of cumulus mixed cloud precipitation increases compared with stratiform precipitation, the performance of the DSDnet-based model remains satisfactory, except for individual points.

Mixed Convective-Stratiform Clouds
This case was a mixed convective-stratiform cloud precipitation process observed at ChuZhou station (118.17°E, 32.21°N) from 0733 to 2337 LST on 24 November 2015, for a total of 965 min in early winter. Figure 12 is the 0.5° PPI of the radar reflectivity corresponding to the three typical times, 0828, 0914, and 0949 LST in the process, which correspond to the 55th, 101th, and 136th min in Figure 13. The red triangle is the position of the disdrometer at ChuZhou station. To save the length of the article and not lose representativeness, this and the next case only demonstrate the 30-min prediction results. In Figure 13, although the prediction error of cumulus mixed cloud precipitation increases compared with stratiform precipitation, the performance of the DSDnet-based model remains satisfactory, except for individual points.  Results of 30-min Prediction Figure 13 shows the change curves of the actual and predicted values of lgNw, Dm, and μ for the 30-min prediction. By contrasting Figures 13 and 10, the parameter fluctuations of the mixed convective-stratiform clouds were more extreme and wider than those of stratiform clouds, and the values of lgNw and μ were smaller, but Dm was larger. lgNw is distributed between 2.4 to 4.8, and μ appeared at values less than 0. The maximum of Results of 30-min Prediction Figure 13 shows the change curves of the actual and predicted values of lgN w, D m , and µ for the 30-min prediction. By contrasting Figures 10 and 13, the parameter fluctuations of the mixed convective-stratiform clouds were more extreme and wider than those of stratiform clouds, and the values of lgN w and µ were smaller, but D m was larger. lgN w is distributed between 2.4 to 4.8, and µ appeared at values less than 0. The maximum of D m was around 2.1 mm. Near the 580th min, lgN w and µ decreased rapidly, but D m increased, which may be related to the change of precipitation type, and the latter half of the case was dominated by convective precipitation. In general, the fitting accuracy of lgN w and D m was better than µ. After modeling with the SLF, the prediction effect was improved. The MRE of lgN w, D m , and µ decreased by 11.96%, 23.49%, and 19.36%; MAE decreased by 18.43%, 14.84%, and 7.71%; and CC increased by 2.07%, 2.50%, and 4.25%, respectively (Table 7). Similarly, as an example, the DSDs of measurement, Gamma fitting, and model prediction at the 594th min are shown in Figure 14, respectively. The Gamma function of the prediction basically coincides with the fitting one, and the 12-min prediction is better than the 30-min one. Dm was around 2.1 mm. Near the 580th min, lgNw and μ decreased rapidly, but Dm increased, which may be related to the change of precipitation type, and the latter half of the case was dominated by convective precipitation. In general, the fitting accuracy of lgNw and Dm was better than μ. After modeling with the SLF, the prediction effect was improved. The MRE of lgNw, Dm, and μ decreased by 11.96%, 23.49%, and 19.36%; MAE decreased by 18.43%, 14.84%, and 7.71%; and CC increased by 2.07%, 2.50%, and 4.25%, respectively (Table 7). Similarly, as an example, the DSDs of measurement, Gamma fitting, and model prediction at the 594th min are shown in Figure 14, respectively. The Gamma function of the prediction basically coincides with the fitting one, and the 12-min prediction is better than the 30-min one.

The Case of Convective Clouds
This case is a convective cloud precipitation process observed by the Dingyuan station (117.40°E, 32.32°N) from 0355 to 0845 LST on 29 June 2015, for a total of 291 min in summer. Figure 15 shows the PPI corresponding to the three typical times, 0436, 0522, and 0528 LST, which are the 41th, 87th, and 93th min in Figure 16. The red triangle is the position of the disdrometer. The actual and 30-min prediction values are shown in Figure 16.

The Case of Convective Clouds
This case is a convective cloud precipitation process observed by the Dingyuan station (117.40 • E, 32.32 • N) from 0355 to 0845 LST on 29 June 2015, for a total of 291 min in summer. Figure 15 shows the PPI corresponding to the three typical times, 0436, 0522, and 0528 LST, which are the 41st, 87th, and 93rd min in Figure 16. The red triangle is the position of the disdrometer. The actual and 30-min prediction values are shown in Figure 16.   Figure 16 shows the evolution of the observation and 30-min prediction values in this convective rainfall event. Compared with the two types of precipitation mentioned above, the D m values of cumulus were larger, and the maximum of D m exceeded 2.4 mm. lgN w were distributed between 3.5 to 4.2, and µ appeared as negative values too. When D m was large, µ was smaller, and, on the contrary, when D m was small, µ was larger. Obviously, the prediction effect of the three parameters was lower than the above two cases, which may be related to the lower number of samples of cumulus precipitation. According to the evaluation results in Table 8, the prediction with SLF was greatly improved. The MRE of lgN w , D m , and µ decreased by 18.71%, 17.92%, and 20.52%; MAE decreased by 12.07%, 11.35%, and 13.51%; and CC increased by 6.99%, 1.94%, and 7.31%, respectively. Similarly, as an example, the DSDs of measurement, Gamma fitting, and model prediction at the 86th min are shown in Figure 17. Although the error increased, the Gamma function of the prediction was close to the fitting one, and the 12-min prediction was better than the 30-min one.  Similarly, as an example, the DSDs of measurement, Gamma fitting, and model prediction at the 86th min are shown in Figure 17. Although the error increased, the Gamma function of the prediction was close to the fitting one, and the 12-min prediction was better than the 30-min one.

Conclusion and Discussion
In recent years, artificial intelligence technology has been rapidly applied in all walks of life. It is one of the promising research directions of meteorological and hydrological prediction, in which deep learning algorithms are used to analyze a large amount of data that has been observed with various meteorologic detection instruments over many years and to extract the rules and information from this big data.
On the basis of LSTM, a DSD network (DSDnet) was designed to predict the distribution of raindrops during a precipitation process. However, due to their intrinsic structure, deep learning algorithms are more inclined to fit the data with high frequency during the modeling process. Furthermore, with the depth of the network increasing, the information in the previous layers gradually attenuates and is difficult to transmit to the final

Conclusions and Discussion
In recent years, artificial intelligence technology has been rapidly applied in all walks of life. It is one of the promising research directions of meteorological and hydrological prediction, in which deep learning algorithms are used to analyze a large amount of data that has been observed with various meteorologic detection instruments over many years and to extract the rules and information from this big data.
On the basis of LSTM, a DSD network (DSDnet) was designed to predict the distribution of raindrops during a precipitation process. However, due to their intrinsic structure, deep learning algorithms are more inclined to fit the data with high frequency during the modeling process. Furthermore, with the depth of the network increasing, the information in the previous layers gradually attenuates and is difficult to transmit to the final output layer.
By aiming at the inherent problem of the LSTM method, a self-defined loss function was proposed to improve the smoothness by increasing the weight of the small and large diameter particles.
By means of a large amount of data observed with the laser raindrop disdrometers, the parameters of the normalized Gamma functions, i.e., N w , D m , and µ, were initially fitted, and the, as the input and output factors, a DSDnet-based model was trained to realize high-accuracy DSD predictions minute-by-minute. Compared to only using the common MSE as a loss function, the accuracy modeled with the SLF was significantly enhanced according to the multiple quantitative evaluation indicators. The prediction results were beneficial to cloud physics and dynamic research, radar quantitative precipitation estimation, and weather modification operations, which can provide new ideas and methods for DSD research.
During the process of modeling, the quality control and standardization of DSD data were indispensable. From the perspective of time, the sequence autocorrelation decreases whenhen time increaseses, and the prediction accuracy worsens. From the perspective of cloud types, the prediction of stratiform clouds waswas better than that of mixed convective-stratiform clouds, while the mixed convective-stratiform clouds werewere better than that of convective clouds.
For machine learning, the most important matter was the amount of data. Although the samples in this paper reach more than 1.7 million, more observational data are still needed to achieve better fitting.
Herein, the DSDnet-based model waswas built with all samples together. However, the DSD in convective and stratiform clouds are significantly different. Although the model prediction is relatively satisfactory, the data frfromm stratiform clouds account for the vast majority ofof the training dataset, which results in a a large error in convective cloud precipitation. Therefore, it would beould be better to model by distinguishing different clouds if there are sufficient samples. In addition, with the polarization upgrading of weather radar in most countries and regions, the DSD can be retrieved by dual polarimetric radar data first, and then the model can be used to obtain the prediction of DSD for the whole radar detection range.
At present, artificial intelligence technology is booming and a variety of new deep learning algorithms have endlessly emerged, and using more new algorithms to construct a DSD prediction network may achieve more accurate predictions.
Author Contributions: Y.Z. led manuscript writing and contributed to data analysis and research design. Z.H. supervised this study, contributed to the research design, manuscript writing and discussion of the results, and served as the corresponding author. S.Y., J.Z., D.L. and F.H. contributed to data analysis and model design. All authors have read and agreed to the published version of the manuscript.