Fusion of rain radar images and wind forecasts in a deep learning model applied to rain nowcasting

Short or mid-term rainfall forecasting is a major task for several environmental applications, such as agricultural management or monitoring flood risks. Existing data-driven approaches, especially deep learning models, have shown significant skill at this task, using only rain radar images as inputs. In order to determine whether using other meteorological parameters such as wind would improve forecasts, we trained a deep learning model on a fusion of rain radar images and wind velocity produced by a weather forecast model. The network was compared to a similar architecture trained only on rainfall data, to a basic persistence model and to an approach based on optical flow. Our network outperforms the F1-score calculated for the optical flow on moderate and higher rain events for forecasts at a horizon time of 30 minutes by 8%. Furthermore, it outperforms the same architecture trained using only rainfalls by 7%.


Introduction
Forecasting precipitations at short and mid-term horizon (also known as rain nowcasting) is important for real life problems, for instance the World Meteorological Organization recently stressed out concrete applications in agricultural management, aviation, or management of severe meteorological events [1]. Rain nowcasting requires a quick and reliable forecast of a process which is highly non-stationary at local scale. Due to the strong constraints of computing time, operational short-term precipitation forecasting systems are very simple in their design. To our knowledge there are two main types of operational approaches all based on radar imagery. Methods based on storm cell tracking [2][3][4][5][6] try to match image structures (storm cells, obtained by thresholding) seen between two successive acquisitions. Matching criteria are based on similarity and proximity of these structures. Once the correspondence and their displacement have been established, the position of these cells is extrapolated to the desired time horizon. The second category relies on the estimation of a dense field of apparent velocities at each pixel of the image and modeled by the optical flow [7,8]. The forecast is also obtained by an extrapolation in time and advection of the last observation with the velocity field.
Over the past few years, machine learning proved to be able to address rain nowcasting and was applied in several countries [9][10][11][12][13][14]. More recently, new neural network architectures were used as [15] adapting a PredNet [16] to rain nowcasting in the region of Kyoto [17] and using a U-Net architecture [18] in the region of Seattle to predict low to middle intensity rainfalls. The key idea in these works is to train a neural network on sequences of consecutive rain radar images in order to predict the rainfalls at a subsequent time. Although rain nowcasting based on deep learning is widely used, it is driven by observed radar or satellite images. In this work, we propose an algorithm merging meteorological forecasts with observed radar data to improve these predictions.
Météo-France (the French national weather service) recently released MeteoNet [19], a database that provides a large number of meteorological parameters on the French territory. The data available is as diverse as rainfalls (acquired by Doppler radars of the Météo-France network), outcomes of two meteorological models (ARPEGE and AROME), topographical masks and so on. The outcomes of the weather forecast model AROME include hourly forecast of both wind velocity and direction, considering that advection is a prominent factor in precipitations evolution we chose to include wind as a significant additional predictor.
The forecasts of the neural network are based on a set of parameters weighting the features of their inputs. A training procedure adjust the network's parameters to emphasise the weight on the features significant for the network's predictions. The deep learning model used in this work is a shallow U-Net architecture [18] known for its skill in image processing [20]. Moreover this architecture is flexible enough to easily add relevant inputs, an interesting property for data fusion. Two networks were trained, both on MeteoNet data in the region of Brest in France, their inputs were sequences of rainfall radar images and wind forecasts at consecutive time steps spanning over one hour and their targets were rainfall maps at the horizon of 30 minutes for the first one and 1 hour for the second.
An accurate regression of rainfall is an ill-posed problem, mainly due to issues of an imbalanced dataset, heavily skewed towards null and small values. We made the choice to transform the problem into a classification problem, similarly to [17]. This approach is also relevant given the potential uses of rain nowcasting, especially in predicting flash flooding, in aviation and agriculture, where the exact measurement of rain is not as important as the reaching of a threshold [1]. Using Météo-France orders of magnitude on precipitation [21] we split the rain data in several classes depending on its precipitation rate. A major issue faced during the training is rain scarcity, given that an overwhelming number of voxels in the database are null, the training dataset will be imbalanced in favor of null data which makes it quite difficult for a neural network to extract significant features during training. We present a method of data re-sampling and a normalization procedure to address this issue. We compared our model to the persistence model which consists in taking the last rain radar image of an input sequence as the prediction (though simplistic, this model is frequently used in rain nowcasting [12,13,17] and can prove difficult to outperform) and to a similar neural network trained using only radar rain images as inputs. We also compare our approach with an operational and optical flow-based rain nowcasting system [22]. Our models present satisfactory forecasting skill at horizon times of thirty minutes and one hour and outperform both comparison models indicating that data fusion has a significant positive impact. Merging rain and wind data stabilized the training process and enabled significant improvement especially on the difficult-to-predict high precipitation rainfalls.

Problem statement
Two types of images are used: rain radar images (also named rainfall maps, see Figure 1) providing for each pixel the accumulation of rainfall over 5 minutes and wind maps (see Figure 2) providing for each pixel the wind velocity components U and V. Both rain and wind data will be detailed further in section 3.
Each meteorological parameter (rainfall, wind velocity U and wind velocity V) is available across the metropolitan France at distinct time steps. The images are stacked along the temporal axis to form a cube whose indexes will now be defined.  Table 3. Grey corresponds to missing data. . Version December 10, 2020 submitted to Remote Sens. Each voxel of the cube is indexed by three indices (i, j, k). i and j index space and respectively map a voxel to its longitude lon j and latitude lat j . k indexes time and maps a voxel to its time step t k . In the following, time and spatial resolutions are assumed to be constant.

of 23
For a voxel (i, j, k) let CRF i,j,k be the cumulative rainfall between times t k−1 and t k at longitude lon i and latitude lat j . Let U i,j,k and V i,j,k be respectively the horizontal (East to West) and vertical (South to North) wind velocity vectors at t k , longitude lon i and latitude lat j . Finally let M i,j,k = (CRF i,j,k , U i,j,k , V i,j,k ) be the vector stacking all data. Given a sequence of MeteoNet data (M i,j,k−1 , · · · , M i,j,k−s ) where s ∈ N is the length of the sequence, the aim is to forecast the rainfall at a subsequent time CRF i,j,k+p where p ∈ N is a given timelag.
To define the classes let us consider an ordered sequence of N L ∈ N limits (L m ) ∈ R N L . Those boundaries split [L 0 ; +∞[ in N L classes defined as follows : ∀m, class C m is defined by C m = {CRF i,j,k ≥ L m | ∀(i, j, k)}. Splitting the cumulative rainfalls in N L classes enables to model the problem as a classification problem, it converts the problem from directly predicting CRF i,j,k+p to determining to which classes CRF i,j,k+p belongs. Considering that classes are embedded, a prediction can belong to several classes. This type of problem is formalised as multi-label classification problem and it is classic to transform a multi-label classification problem with N L limits into N L independent binary classification problems using the binary relevance method [23]. N L independent binary classifiers are trained; classifier m determining the probability that CRF i,j,k+p exceeds the threshold L m .
Knowing states (M i,j,k−1 , · · · , M i,j,k−s ) for all voxels, the classifier associated to class C m will model the probability P m i,j,k that the cumulative rainfall CRF i,j,k+p belongs to C m : Ultimately, the sequence of probabilities (P m i,j,k ) m is compared to 0.5 to determine the belonging of CRF i,j,k+p to the different classes. The final forecast will be the class with the highest m that verifies P m i,j,k ≥ 0.5. Let us consider the assumption of the independence of each classifier relatively to the others. Due to the class ordering, it is clear that for a given voxel (i, j, k), P m i,j,k is a decreasing sequence in m. However, because of the assumption on independence, each classifier makes a prediction without knowing the results of the others. Therefore a voxel for which ∃m, P m < P m+1 could in theory exist which contradicts the definition of the classes. An alternative modelling based on the classifier chains method [24] would take this drawback into account, however in practice this phenomenon does not occur so the assumption of independence can be made.

Data
MeteoNet [19] is a Météo-France project gathering meteorological data on the French territory. Every data type available spans from 2016 to 2018 on two areas of 500 km × 500 km each, framing the North West and the South East parts of French metropolis. This paper focuses on rain radar and wind data in the North West area. The rain data in the North West part of France provided by MeteoNet is the cumulative rainfall over time steps of 5 minutes. The acquisition of the data is made using Météo-France Doppler radar network: each radar scans the sky to build a 3D reflectivity map, the different maps are then crossed by Météo-France to remove meteorological artifacts and to obtain MeteoNet rainfall data. The spatial resolution of the data is 0.01 degrees (roughly 1 km × 1.2 km). More information about the Météo-France radar network can be found in [25] and about the measurement of rainfall in [26].
The data presented in MeteoNet are images of size 565 × 784 pixels, each pixel's value being the CRF over 5 minutes. These images are often referred to as rainfall maps in this paper (see Figure 1 for an example).
The aim is to predict the rainfall at the scale of a French department hence the study area has been restricted to 128 × 128 pixels (roughly 100 km × 150 km). However the quality of the acquisition is non uniform across the territory, MeteoNet provides a rain radar quality code data (spanning from 0 to 100 per cent) to quantify the quality of the acquisition on each pixel (see Figure 3). The department of Finistère is mainly inland and has an overall quality code score over 80% hence the study area has been centered on the city of Brest.

Definition and distribution of rainfall classes in the training base
Treating this problem as a classification problem requires proper classes definitions which relied on Météo-France precipitation scale [21] that quantitatively defines the qualitative appreciations of "no rain" (0 to 1 millimeters of rain per hour), "continuous light rain" (1 to 3 millimeters of rain per hour), "moderate rain" (4 to 7 millimeters of rain per hour) and "heavy rain" (over 8 millimeters of rain per hour). Hence the following bounds were selected: L 0 = 0 mm /h, L 1 = 0.1 mm /h, L 2 = 1 mm /h and L 3 = 2.5 mm /h. In the following, we define 3 classifiers (N L = 3) associated to L 1 , L 2 and L 3 . Pixels are classified C 0 if they are rejected from the classifiers for higher classes.
The definition of the classes is summarized in Table 1. Note that Météo-France scale is in millimeters of rain per hour whereas MeteoNet provides the cumulative rainfall over 5 minutes hence a factor 1/12 is applied.
Very light rain Continuous light rain Moderate rain and higher The distribution of the classes across the database will now be further detailed. For each class, let us consider the part of its interval that does not overlap with the interval defining the subsequent class. The proportion of voxel of the training base (see Subsection 4.1 for definition of training base) belonging to each of those intervals is assessed in Table 2. To calculate the percentage of voxels by interval, only data of the training base were considered. This table clearly shows that voxels corresponding to "no rain" are highly dominant, highlighting the scarcity of rain events. Table 2. Percentage of voxels of the training base belonging to each interval of precipitation. The intervals of precipitation considered are the intersections of C m and C C m+1 .
Secondly, in order to evaluate the distribution of classes across rainfall maps (see Figure 1 for an example of rainfall map), the histogram of the maximum CRF of each rainfall map restricted to the study area was calculated and is presented in Figure 4.
Similarly to Table 2 only the data of the training base were considered. This histogram shows that voxels above 2.5 mm /h are present and evenly distributed among the rainfall maps: even if they only account for 1.2% of the total proportion of voxels, more than 30% of rainfall maps contain at least one of those voxels. Hence it is likely that the voxels of this class will form small patches distributed across the images. This phenomenon and rain scarcity are major problems in rain nowcasting, because of them the adjustment of the weights of the neural network during the training phase will be unstable for the highest classes which will interfere with the forecasts. Hence, the higher the class the more difficult to predict. This problem will be tackled in subsection 4.2.
A summary of classes definition and distribution is presented in Table 3. No rain L 0 =0 92.6% 39% 1 Very light rain Continuous light rain L 2 =1 1.7% 9% 3 Moderate rain and higher L 3 =2.5 1.2% 34% Table 3. Summary of classes definition and distribution.
Version December 10, 2020 submitted to Remote Sens.

Presentation of the MeteoNet wind data
MeteoNet provides the weather forecasts made by two Meteo-France weather forecast models: AROME and ARPEGE. Because it provides a better precision in both time and space, only AROME data were used. From 2016 to 2018 AROME was run every day at midnight, forecasting wind velocity and direction for every hour of the day. The wind related data available are wind velocity (in m /s, wind direction (in degrees), U component (wind velocity vector component from west to east in m /s) and V component (wind velocity vector components from south to north in m /s). These forecasts are made with a spatial resolution of 0.025 degrees (1 km) at 10 meters above the ground. The data presented in MeteoNet is equivalent to images of size 227 × 315 pixels, each pixel's value being the wind velocity or wind direction at a given time. In the following, those images are referred to as wind maps (see Figure 2 for an example). Figure 5 represents the histograms of the mean wind velocity and the mean wind direction across wind maps. For the calculation, only data of the training base (see 4.1 for definition) were considered. The wind velocity distribution is similar to a Gamma distribution and as expected the wind direction is mainly directed from the ocean to the shore. Among the parameters provided by AROME, only U and V will be used in the rest of the study. In order to fit to the mesh of rainfalls, AROME data were linearly interpolated on both space and time.

Proposed approach
A major problem in rain prediction is rain scarcity which causes imbalanced classes, it has been addressed by a re-sampling procedure and normalization procedure that will be detailed in this section. The network architecture and training will also be discussed after defining the training, validation and test datasets.

Definition of the sets of data
The data is split in training, validation and test sets. Years 2016 and 2017 are used for training set. For year 2018, one week out of two is used in the validation set and the other one in test set. Before splitting, one hour of data is removed at every cut to prevent data leakage [27]. The splitting process is done on the whole year to assess seasonal effects of the prediction in both validation and test sets.
The inputs of the network will be sequences of MeteoNet images spanning over 1 hour (corresponding to 12 time steps) hence the input dimension is 36 × 128 × 128 (12 rainfall maps, 12 wind maps U, 12 wind maps V). In the formalism defined in Section 2: s = 36.
Each input sequence is associated to its prediction target which is the rainfall map coming 30 minutes (p = 6) or one hour (p = 12) after the last image of the input sequence thresholded based on the N L = 3 different thresholds. For a given sequence of voxels (M i,j,k−1 , ..., M i,j,k−s ), the voxel target is composed of 3 channels, for m ∈ 1.. N L it is CRF i,j,k+p thresholded by (L m ). This value is named T m i,j ∈ {0, 1}. An example of an input sequence and its target is given in Figure 6. If the input sequence or its target contains undefined data (due to a problem of acquisition and numbered -1 in MeteoNet), the sequence is set aside and will not be considered. Each input corresponds to a distinct hour: there is no overlapping between to different input but note that overlapping can be an option to increase the size of the training training even if it can result in overfitting. It has not been used here as the actual training set contains 17539 sequences which is considered to be large enough. The validation set contains 4293 sequences and the test set contains 4150 sequences.

Dealing with rain scarcity: over-sampling
Oversampling consists in selecting a subset of sequences of the training base and multiply them so that they appear several times in each epoch of the training phase (see section 4.5 for details on epochs and the training phase).
The main issue in rain nowcasting is to tackle rain scarcity that causes imbalanced classes. Indeed, in the training base 93% of voxels do not have rain at all (see Table 2). On the other hand, the last class is the most under-represented in the dataset and thus it will be the most difficult to predict. An over-sampling procedure is proposed to balanced this under-representation. Note that the validation and test sets are left untouched to consistently represent the reality during the evaluation of the performance.
Currently sequences (sequences are defined in 4.1) whose target contains a voxel of the last class represent roughly one third of the training base (see Figure 4). Those sequence will be duplicated until their proportion in the database reaches η ∈ [0, 1], η being a parameter to be tuned during the training phase.
Let N be the number of sequences in the training base before the oversampling procedure. Let N 3 be the number of sequences whose target contains a voxel of the last class in the training base before the oversampling procedure. Let N * be the number of sequences in the training base after the oversampling procedure. Let N * 3 be the number of sequences whose target contains a voxel of the last class in the training base after the oversampling procedure. By definition, N * 3 = η × N * . During the oversampling procedure only sequences whose target contains a voxel of the last class will be oversampled, hence N − N 3 stays constant i.e. N * − N * 3 = N − N 3 and finally N * = N−N 3 1−η . To reach the proportion η, each sequence of the original training base whose target contains a voxel of the last class is assigned the probability 1 N 3 to be drawn. Then the original training base is completed by drawing N * − N sequences with replacement. With this procedure it is insured that the proportion of sequences whose target contains a voxel of the last class is η.
The impact and tuning of the parameter η is assessed in section 5.2.

Data normalization
The data used as an input to train and validate the neural network is first normalized. The normalization procedure for the rain is the following. Let max(CRF) be the maximum cumulative rainfall over the training dataset, the following transformation is applied to every voxel: This invertible normalization function brings the dataset into the [0, 1] range while spreading out the values closest to 0. As for wind data, considering that U and V follow a Gaussian distribution, let µ and σ be respectively the mean and the standard deviation of wind over the overall training set, it is applied to every voxel of wind:

Network architecture
Convolutional neural networks (CNN) is a feed-forward neural network stacking several layers: each layer uses the output of the previous layer to calculate its own output. CNN proved their usefulness computer vision during the past decade [28] . We decided to use a specific type of CNN, the U-Net architecture [18] due to its successes in image segmentation. We chose to perform a temporal embedding through a convolutional model, rather than a LSTM or other recurrent architectures used in other studies (such as [9] and [10]), given the lack of memory of the phenomenon to be predicted. The inclusion of previous time-steps remains warranted however: the full state of the system is not observed, and the temporal coherence of the time-series constrains our prediction to better fit the real rainfall trajectory. The details of the selected architecture is presented in Figure 7. Like any U-Net network, the architecture is composed of a decreasing path also known as encoder and a increasing path also known as decoder. The encoding path starts with with two convolutional layers. Then it is composed of four consecutive cells, each being a succession of a max-pooling layer (detailed further) followed by two convolutional layers. Note that each convolutional layer used in this architecture is followed by a Batch-norm [29] layer and a rectifier linear unit (ReLU) [30], Batch-norm and ReLU will be detailed further. At the bottom of the network, two convolutional layers are applied. Then the decoding path is composed of four consecutive cells each being a succession of a bilinear up-sampling (detailed further) layer followed by two convolutional layers. Finally a fully connected layer combined to an activation function maps the output of the last cell to the segmentation map (mapping every voxel to one class). The operation and aim of each layer will now be detailed.

Convolutional layers
Convolutional layers perform a convolution by a kernel of 3 × 3, a padding of 1 is used to preserve the input size. The parameters of the convolutions are to be learnt during the training phase. Each convolutional layer in this architecture is followed by a Batch-norm and a ReLU layer.
A Batch-norm layer re-center and re-scale its inputs to ensure that the mean is close to 0 and the standard deviation is close to 1. This normalization helps the network to train faster and to be more stable [31]. For an input batch Batch, the output is y = E[Batch] √ V[Batch]+ γ + β where γ and β are trainable parameters and = 10 −5 in our architecture.
A ReLU layer, standing for rectifier linear unit, applies the following non linear function f : x ∈ R → max(0, x). Adding non linearities enables the network to model non linear relations between the input and output images.

Image scalers
In order to upscale or downscale the images we use two types of layers. The Max-pooling layer is used to downscale the image features in the encoding part. It browses the input with a 2 × 2 filter and maps each patch to its maximum. It reduces the size to the image by a factor 4 between each level of the encoding path. It also contributes to prevent overfitting by reducing the number of parameter to be optimizes during the training.
The bilinear up-sampling layer is used to to upscale the image features in the decoding part. It performs a bilinear interpolation of the input resulting the output image's size to be twice the one of the input.

Skip connections
Skip connections are the trademarks of the U-Net architecture. The output of an encoding cell is stacked to the output of a decoding cell of the same dimension and the stacking is used as input for the next decoding cell. Hence skip connections spread some information from the encoding path to the decoding path and thus help preventing the vanishing gradient problem [32].

Fully connected layer
The final layer is a fully connected convolutional layer (convolutional layer with a 1 × 1 kernel). Its output is a 3 × 128 × 128 image, each voxel has 3 channels: one for each class. For a given voxel let s m be the output for the channel m(0 ≤ m < 3) corresponding to the score for class m. The output s m is then transformed using the sigmoid function: where P m i,j,k is defined in 2. Note that, following the class definition in 1, one voxel can belong to several class.
Finally, the output voxel is said to belong to the class m if P m i,j,k ≥ 0.5.

Network training
Let θ be the vector of length N θ containing the trainable parameters (also name weights) that are to be determine through the training procedure. The training process consists in splitting the training dataset in several batches, inputting successively the batches into the network, calculating the distance between the predictions and the targets thanks to a loss function (see definition hereafter) and finally, based on the calculated loss, updating the network weights using an optimization algorithm. The training procedure is repeated during several epochs (one epoch being achieved when the entire database has gone throw the network) and aims at minimizing the loss function.
For a given input sequence (M i,j,k−1 , ..., M i,j,k−s ) let us define the loss function Loss comparing the output P m i,j,k to its target T m i,j,k : This loss is averaged across voxels of the batch, then a regularization term is added: The loss function minimizes the discrepancy between the targeted value and the predicted value and the second term is a square regularization aiming at preventing overfitting and distributing the weights more evenly. The importance of this so-called 2 -regularization in the training process is weighted by a factor δ.
The optimization algorithm used is Adam [33] (standing for Adaptive Moment Estimation), which is a stochastic gradient descent algorithm. The parameters recommended in the paper were used: β 1 = 0.9, β 2 = 0.999 and = 10 −8 .
Moreover, in order to prevent exploding gradient, the gradient clipping technique which consists in re-scaling the gradient if it becomes too large in order to keep it small was used.
The training procedure for the two neural networks was the following: • The network whose horizon time is 30 minutes was trained on 20 epochs. Initially, the learning rate was set to 0.0008 and after 4 epochs it was reduced to 0.0001. After epoch 13 the validation F1-score (the F1-score is defined in 5.1) was not increasing. We selected the weights of epoch 13 because their F1-score was the highest. • The network whose horizon time is 1 hour was trained on 20 epochs. Initially, the learning rate was set to 0.0008 and after 4 epochs it was reduced to 0.0001. After epoch 17 the validation F1-score was not increasing. We selected the weights of epoch 17 because their F1-score was the highest.
The network is particularly sensitive to hyper-parameters, specifically the learning rate, the batch size and the percentage of oversampling. The tuning of the oversampling percentage is detailed in Section 4.2) and the other hyper-parameters used to train our models are presented in Table 4.

Epochs
Learning Rate Batch Size Oversampling (%) Regularization Gradient Clipping 4 and under 0.0008 256 0.9 10 −5 0.1 Above 4 0.0001 256 0.9 5 × 10 −5 0.1 Table 4. Table of hyper-parameters used to train both networks. The first phase correspond to the first few epochs during which the loss is decreasing, the second phase corresponds to the rest of the training.
Our network was implemented and trained using PyTorch 1.5.1. The neural networks was trained on a computer with a CPU Intel(R) Xeon(R) CPU E5-2695 v4, 2.10GHz and a GPU PNY Tesla P100 (12GB).
For the implementation details, please refer to the code available online: some demonstration code to train the network, the weights and an example of usage are available on our GitLab repository 1 and archived in Zenodo [34].

Scores
Defining whether a prediction is wrong or right in a multi-label classification problem is not as straight forward as in other classification problems. Indeed a prediction can contain a subset of the correct 1 Supplementary material: https://github.com/VincentBouget/rain-nowcasting-with-fusion-of-rainfall-and-wind-data-article classes and thus be partially right. Proper metrics must be defined to correctly evaluated the full scale of correctness of a prediction. Many metrics can be found in the literature [35] but micro-precision and micro-recall were found to be the most suitable.
As our algorithm is a multi-label classification problem, each of the N L classifiers is assessed independently to the others.
For a given input sequence (M i,j,k−1 , ..., M i,j,k−s ) let us compare the output P m i,j,k to its target T m i,j,k , four possible outcomes can be obtained: True Positive TP m i,j,k , True Negative TN m i,j,k , False Positive FP m i,j,k and False Positive FN m i,j,k . Summing the outcomes of all voxel composing the database, micro-precision and recall for classifier m can be defined: Note that, in theory, if all voxels of a certain class are predicted to be null (i.e. no rain), the denominators of those scores is null. Nevertheless, as the simulation is done over all the samples of the validation or test dataset, this situation hardly occurs in practice.
Based on those definitions for precision and recall the F1-score F1 m for classifier m can be defined as: This score will be used to compare our models performance.

Tuning of the oversampling parameter
The percentage of oversampling η defined in Section 4.2 is an important parameter as it will highly modify the training database hence its impact will now be investigated. Several runs with different values of η are made and the F1-score calculated on the validation set is collected: see Figure 8 and Figure 9. It appears on those figures that the oversampling procedure has three main advantages: the F1-score converges faster, the results are higher for all classes and most of all it stabilizes the training process. On raw data the problem is not that the network can't learn but that it is not robust and thus the results will poorly generalize. The over-sampling procedure tackles this important issue. Based on this we decided to use a oversampling percentage of 90% (η = 0.9) because under this value the training phase is quite unstable and above this value the network will tend to overfit quite fast.

Impact of the oversampling on the training base
The proportion of pixels before and after oversampling is compared in Table 5. Due to the randomness of the oversampling procedure, the percentages can slightly differ from one training base to another.

Rain nowcasting with optical flow
We briefly present the optical flow method used in Subsection 5.4. Let I be a sequence of images, the optical flow is the advection equation of I by velocity W = (U, V) at pixel (x, y) and time t: where ∇ is the gradient operator and T the transpose operator i.e. ∇I T = ∂I ∂x ∂I ∂y . In the context of Rain Nowcasting, I = CRT. Equation (11) is ill-posed, the classic approach [36] is to restrict the space of solution to smooth functions using Tikhonov regularization, the following cost-function being minimizing: Ω stands for the image domain. Regularization is driven by the hyperparameter α. The gradient is easy to derive using calculus of variation. As the cost function E is convex, standard convex optimization tools can be used to obtain the solution. This approach is known to be limited to small displacements. A solution to set up this issue is to use a data assimilation approach as described in [22]. Once the estimation of velocity field W = ( U, V) computed, the last observation I last is transported, Eq. (12), at the wished temporal horizon. The dynamics of thunderstorm cells is non stationary, the velocity should also be transported by itself, Eq. (13). Finally the following system of equations is integrated in time to the wished temporal horizon t h .

Results
We trained two neural networks for predictions at horizon times of 30 minutes and 1 hour according to the training procedure defined in 4.5. To provide comparison models, for each horizon time we also trained another neural network using only rainfall maps as inputs. The results are compared with the naive baseline given by the persistence model, which consists in taking the last rainfall map of an input sequence of prediction and to the optical flow approach. Figures 10 and 11 present two examples of prediction at 30 minutes made by the neural networks trained using rainfalls and wind. The forecast is compared its target, to the persistence and to the optical flow. The comparison shows that the network was able to model advection in addition to be quite close to the target.
Using the method proposed in [37] the results were calculated on the test set and 100 bootstrapped samples were used to calculate mean and standard deviation. The test set remained unused until final evaluation to prevent data leakage. About the results presented in Table 6: Figure 10. Comparison of forecasts at a horizon of 30 minutes made by different models for an horizon time of 30 minutes. The three fist rows respectively correspond to the target, the persistence, the optical flow and the neural network. For each forecast, in addition to the raw prediction the difference to the target is given. The gap between the persistence and the target reveals the transport of the rain cells during the elapsed time. Figure 11. For caption details see Figure 10. The neural network tends to smooth the rain cells and its forecast contrary to the target which is quite sparse.
• Training a neural network using only rainfalls maps inputs is sufficient to outperform the persistence at both horizon times of 30 minutes and 1 hour. The difference is significant considering the standard deviation. It also outperforms optical flow except for class 3. This is not surprising because the optical flow is sensitive to the structures contrast and responds better when this contrast is large, which is the case of pixel belongings to class 3. • As expected, adding wind velocity map to the inputs improves significantly the results. It both outperforms persistence and optical flow. It enables the model to perform better and to stabilize the training procedure. This effect is all the more so as important on higher classes which are the most difficult to train. • Since the neural network trained on rainfall and wind velocity maps beats the neural network trained only on rainfall maps, that confirms the dynamics of thunderstorm cells can not fully be deduced from rainfall maps. And as this neural networks also beats optical flow, that shows the same limitations for the optical flow approach: the wind is not exactly the evolutionary dynamics of the thunderstorm cells visible in radar images and the physics described by the advection law of cells by the optical flow is obviously very approximating. • In Figure 11 the neural network is able to predict the global shape of the rainfall but proves unable to predict the details at a small scale resulting in very smooth borders that lack realism.

Conclusion and future work
The aim of this paper is to study the impact of merging rain radar images with wind forecasts to predict rainfalls in a near future. With few meteorological parameters used as inputs, our model was able to forecast rainfalls with satisfactory results.
The problem was transformed into a classification problem by defining classes corresponding to "no rain", "very light rain", "continuous light rain" and "moderate or higher rain". In order to overcome the imbalanced distribution of these classes we performed an oversampling of the higher CFR classes. The F1-score calculated on the highest class for forecasts at a horizon time of 30 minutes is 45%, our model was compared to a basic persistence model and to an approach based on optical flow and outperformed both. Furthermore, it outperforms the same architecture trained using only rainfalls by 7% hence this paper can be considered as a proof of concept that data fusion has a significant positive impact on rain-nowcasting.
An interesting future work would be to fusion the inputs to another determining parameter, such as the orography, which could lead to better results. Optical flow performances provided promising results and it would be interesting to investigate their inclusion through a scheme combining deep learning and data assimilation.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The