Artiﬁcial Neural Networks and Multiple Linear Regression for Filling in Missing Daily Rainfall Data

: As demand for more hydrological data has been increasing, there is a need for the development of more accurate and descriptive models. A pending issue regarding the input data of said models is the missing data from observation stations in the ﬁeld. In this paper, a methodology utilizing ensembles of artiﬁcial neural networks is developed with the goal of estimating missing precipitation data in the extended region of Chania, Greece on a daily timestep. In the investigated stations, there have been multiple missing data events, as well as missing data prior to their installation. The methodology presented aims to generate precipitation time series based on observed data from neighboring stations and its results have been compared with a Multiple Linear Regression model as the basis for improvements to standard practice. For each combination of stations missing daily data, an ensemble has been developed. According to the statistical indexes that were calculated, ANN ensembles resulted in increased accuracy compared to the Multiple Linear Regression model. Despite this, the training time of the ensembles was quite long compared to that of the Multiple Linear Regression model, which suggests that increased accuracy comes at the cost of calculation time and processing power. In conclusion, when dealing with missing data in precipitation time series, ANNs yield more accurate results compared to MLR methods but require more time for producing them. The urgency of the required data in essence dictates which method should be used.


Introduction
The successful development of reliable models for predicting the status of water resources of a particular region is inextricably linked to the quantity and quality of the climate and hydrological data used [1]. One of the most critical pieces of data for such a study is the available rainfall data in the area of interest [2]. The possibility of errors or gaps within an available rainfall data time series is real and may be due to errors in the measuring instruments, a possible instrument failure, or an extreme weather event. Therefore, the development of a model capable of accurately simulating, or even complementing, a time series of rainfall data is necessary.
The importance of rainfall data availability is inarguable in hydrological modelling as these data are an essential input parameter in almost any approach. Previous research has supported the notion that the traditional statistical methods for infilling (imputing) missing data may be inefficient for small temporal and spatial scales [3,4]. Thus, an indicator of the success of the model is its outperformance over standard interpolation methods. Such practices have become more nuanced over the years, specifically with the incorporation of weighting factors that compensate for the variation between stations due to the morphological features of each case study [5].
When looking at the recently published scientific literature, Artificial Neural Networks have shown encouraging results in modeling nonlinear problems, such as hydrological processes [6]. They are able to recognize strong seasonal patterns without the need for preprocessing raw data to remove outliers, and there is solid evidence that supports the accuracy of their prediction [7]. A work similar to the current article has been conducted using meteorological data from the internet, with the intent of forecasting future rainfall using multi-layer perceptron (MLP) with back propagation and optimization algorithms [8]. In another work, the MLPs are used for forecasting future precipitation using rainfall data from nearby weather stations as inputs [9]. As an alternative method for monthly rainfall prediction, it has been suggested that the use of ANNs with wavelet regression provides more accurate results compared to models using ANNs, which implies the need for optimization [10]. An alternative to MLPs is Long Short-Term Memory networks, which are a class of recurrent neural networks that have shown promising results in estimating runoff from rainfall. With respect to the problem at hand, the selected neural networks provide a high degree of regression ability. Using recurrent networks, like those used in rainfall runoff modelling [11], would not have a physical meaning, since the relationship between inputs and outputs (daily rainfall values) does not include a temporal delay. Other techniques for filling in missing data in the field of hydrology include K-nearest neighbors (KNN), adaptive neuro-fuzzy inference systems (ANFISs) and random forest regression (RFR) [12][13][14], but these go beyond the scope of this work and could be considered for future research. Regarding the number of inputs, large numbers of different inputs do not guarantee more accurate results. A genetic algorithm can improve the process of selection when aiming for forecasting, but in this work, in order to reduce computational demands and given the nature of the network, another optimization method was chosen [15]. Apart from genetic algorithms as optimization techniques, others exist, such as particle swarm, cuckoo search, and bat-or kidney-inspired algorithms, depending on the level of strictness demanded [16]. In this paper, optimization is achieved through the use of a competitive algorithm in the creation of each ensemble, corresponding to each combination of missing data from the observation stations. Artificial intelligence tools have been implemented in the past in different scientific fields, from filling in spatially and temporally missing data by using augmented interpolation [17] to using photonic neural networks analysis for the changing morphology of an area [18]. In regions with high unpredictability due to extreme weather conditions, ANNs have been successful in forecasting rainfall [19]; given this fact, ANNs might perform even better in regions with strong seasonal patterns and a temperate climate, such as Crete. In large areas with varied topography, proximity of stations does not always guarantee a correlation between observed rainfall values, especially if the stations belong in two different hydrological catchments [1]. In the current case study, the area is hydrologically homogenous with only a small increase in precipitation at higher elevations [20]. In addition, fluctuations between extreme values can be smoothed out by classifying data either spatially [21] or based on intensity [22], which implies training and using multiple ANNs. Multiple ANNs with targeted training working on their own niche outperform an all-purpose ANN trained with the whole data set, with differences being dependent on the physical problem [23]. As hinted previously, multiple ANNs creating an ensemble might outperform a singular one by minimizing the occurrence of local minimums and individual biases [24]. The most simplified approach to composing an ensemble of neural networks is averaging their results using simple or weighted averages. Previous research has also proposed that the structure of the ANN ensemble can itself become the input of a general regression neural network [25]. This technique can exploit the variability of results produced by biased individuals and increase overall accuracy. In addition, it utilizes a full set of ANNs in which there may be individuals that produce errorincreasing results. In order to address this, it is suggested to develop competitive algorithms where ANNs or ensembles are compared to each other and the best-performing method ends up being used for predictions [26]. In the same manner of thinking, elimination of the least significant input variables can be performed in an ensemble by considering the correlation coefficient, which has been mostly applied to climatic variables in forecasting rather than regression-based forecasting [27]. One approach to creating an ensemble with a limited data set is to alternate between training and testing data sets during the training period and eventually average out the ensemble outputs [28]. Another issue arising when working with ANNs, especially ensembles, is the network architecture, since it can greatly impact the performance; in most cases, an optimization algorithm is developed since there is no standard and optimal architecture is defined by trial and error [24]. Finally, one optimization technique which borders on architecture modification is the dropout method which randomly turns off units and their connections during training [29], which shows that random-based optimization might produce adequate results.
This paper aims to develop a methodology to estimate missing daily precipitation values from weather stations. Five weather stations monitoring rainfall in the prefecture of Chania, Greece, were used as a case study. This work focuses on the comparison of ANN ensembles based on multi-layer perceptrons and the more commonly used multiple linear regression (MLR) for completion of time series of daily rainfall data. This way, the results of the ANNs are compared to a technique that is standard practice in the field (MLR) [13]. In this approach, the best ANN from each ensemble imputes the missing data values to end up with a completed dataset for all stations. It is important to state that classification based on different combinations of missing data (henceforth called cases) adds to the accuracy of the model in general, since the ANNs are specialized in each case. This would not be feasible if modeling was done by creating a single ensemble for all stations, or an ensemble for each station. The respective MLR results are calculated as a baseline for comparison.

ANN and MLR Creation
The proposed methodology starts from a dataset with missing rainfall data for some stations and results in two completed datasets from the ANN ensembles and the MLR. The first step of the algorithm is to check every date containing recorded data. If a daily dataset has no missing values, then it is included in the dataset which will be used for training and validation of the ANN ensembles and validation of the MLR model. Otherwise, it is added to the dataset meant for imputing. It is important at this point to state that if a daily dataset has no recorded data at any of the stations, then imputation is unfeasible with the proposed methodology, primarily because completion of the time series occurs on a daily timestep by correlating the missing data with the observed data. In addition, a precipitation event is not dependent on a past precipitation event, and since rainfall is the sole input in this model, it was deemed both unnecessary and accuracy-decreasing to impute the time series by correlating data from datasets that correspond to different dates. This is the reason why the completed time series span from the first recorded dataset up to the current day and not further into the past or future.
The outcomes of the separation are two datasets: a complete and an incomplete one. The full daily datasets will be used for the training and validation of the ANN ensemble. Due to the different cases of missing data, it was deemed necessary to create multiple ANNs (multiple layer perceptron) that are specialized to each case, since inputs and outputs for each case differ, which implies a different topology for each case. The inputs and outputs are always daily rainfall values from weather stations, and for each different combination of missing data, the stations with observed values are used as input nodes and the stations with missing values are used as output nodes. In order to increase the accuracy of the model altogether, for each case an ensemble of 10,000 ANNs with one hidden layer was trained, in which the daily datasets for training and validation were randomly selected from the full set. With the use of a competitive algorithm, only one ANN-the best-performing one, according to its test error value-was selected to give outputs, using MATLAB's ANN tool version 2017b. According to the literature [30,31], one hidden layer is sufficient and might also outperform ANNs with multiple hidden layers when used for regression. The competitive algorithm selects the best-performing ANN based on training error and the results are produced solely based on that ANN. The use of ensembles instead of one single ANN addresses any concerns regarding the reliability, performance, and behavior of the proposed approach. The calibration (training and validation) dataset was 80% of the full available dataset with complete records, and the testing dataset consisted of the remaining 20% for all ANNs. After the training and validation are conducted, the ensembles are ready to complete the time series. Similarly, the MLR functions are created by the training and validation dataset for each case. After both processes have completed the time series, all negative values that are generated are turned into zeroes.
The whole process is graphically represented in Figure 1 below.

Model Evaluation
The validity of the results of both models is verified by the calculation of the correlation coefficients between the target and the simulated value. The value of the Nash-Sutcliffe coefficient is calculated, which can take values from minus infinity to one (−∞ to 1), based on which the validity of the model is determined, with a value of one (1) indicating complete agreement between the simulated values given by the model and those observed by the stations. According to the literature, an NSE index value above 0.7 corresponds to a very good estimation [32]. Finally, the Root Mean Square Error is extracted from the model results in each of the cases considered [32].

Case Study
In the prefecture of Chania, near the northern coast of Crete, there are five automatic weather stations at a relatively close distance (approximately 5 km) to each other, as shown in Figure 2. Regarding the locations of the stations as shown in Figure 2, the overall highest value of rainfall, historically, has occurred at Alikianos station, while the lowest has occurred at Platanias station. Platanias station has the lowest recorded altitude at 12 m, while Alikianos station is located at 95 m. Chania station (137 m) is located at a higher altitude than Alikianos station (95 m). Although it would be expected that a station at a higher altitude has a greater amount of rainfall, it was observed from the data that Alikianos station has a greater amount of rainfall. The reason for this might be that Alikianos station is furthest from the sea compared to all the other stations considered and is situated at the foot of the Lefka Ori. Platanias, on the other hand, is located a short distance from the sea and at a low altitude. Table 1 contains a summary of the recorded daily precipitation values available from the automatic weather station NOANN network [33] (in total 15,040 records). Based on these records, a timeline showing the availability and gaps in the datasets for the study period is shown in Figure 3. In total, 759 days had a complete dataset and were used for calibration and 4689 days had at least one missing value. The recording of the data used in this work starts with the creation of Chania station on 1 February 2006. This means that for the period from 1 February 2006 to 30 September 2010, the available rainfall data originates only from Chania station. As of the next day, on 1 October 2010, when Chania station (Center) was put into operation, the recorded rainfall data come from the two stations previously mentioned. On 1 September 2012 the Alikianos meteorological station was put into function, therefore the recorded rainfall data come from the above three meteorological stations. To continue, on 1 July 2015, the recording of rainfall data from Platanias meteorological station starts, which means that the model input data comes from four stations. Finally, on 1 November 2018, the last station, Stalos, was put into operation. Therefore, for the next period, we have logging data from all five stations until 31 December 2020. It is worth noting that the period of time that a station is in operation is not always the same as the period of time that it records data, as there may be losses due to errors in the measuring instruments, a possible instrument failure, or an extreme weather phenomenon. This is clearly shown in Figure 3 of the paper.

Different Combinations of Stations Missing Data (Cases)
There are five rainfall stations in our study and each one of them has a different installation date, from which point on data are available. In addition, there are periods when, for different reasons (maintenance, power cuts, malfunction), one or more daily values are missing from the time series. The values missing for each day, together with the values available, can be categorized into different cases, in order to organize and group the different dates based on different calculation needs. Figure 4 shows all the possible combinations of stations having or missing a daily record. By having all the possible cases identified, the algorithm is able to create ensembles for cases that have not occurred yet.
In the full, observed dataset, 9 cases occur out of a total of 32 that were theoretically possible. Specifically, the included cases are Case 2, Case 3, Case 6, Case 11, Case 14, Case 15, Case 22, Case 24, and Case 29. In three cases, namely Cases 2, 3 and 6, one station had a missing value; in three other cases, namely Cases 11, 14 and 15, two stations had missing values; in two cases, Cases 22 and 24, three stations had missing values; and in the last case, Case 29, four stations had missing values. The numbering of each case is not derived from the numerical order, but from the corresponding case, as shown in Figure 4. For example, in Case 2 the input precipitation data are the values from Chania, Chania (Center), Platanias and Stalos stations, and the output is the precipitation value for Alikianos station.

Results
After completing a full run of the algorithm built using the proposed methodology, the incomplete time series of each station receives model-generated data for the full period in which at least one of the five stations has an observed value. In the following charts (  To compare the two methods, three different metrics were used, the root mean square error, the Nash-Sutcliffe efficiency coefficient, and the correlation coefficient. The results are shown on a per-case basis, as the two methods might show different sensitivities to missing data. Comparative tables at the end of each section summarize the results of the testing dataset. Concerning the computational effort and time needed for the two methods, the ANN did take a considerably large amount of time to optimize its structure (almost 36 h on a PC (Personal Computer) with an Intel i7 8th generation processor). On the other hand, the MLR was significantly faster, and required only a few minutes to run.

Root Mean Square Error (RMSE)
The RMSE index indicates the deviation between the observed and simulated values and indicates whether the data are clustered around the line of best fit. The models calculate the Root Mean Square Error (RMSE) for each of the cases considered. Regarding the Artificial Neural Network model, the best value is presented in Case 15 and shows an error equal to 1.16 mm, while the worst value is presented in Case 29, with an error value equal to 2.42 mm. The corresponding results of the Multiple Linear Regression model are shown in Case 3 with a value of 2.37 mm and Case 2 with a value of 6.43 mm. Overall, the Artificial Neural Network model shows lower errors, ranging from 42% to 72.6%, compared to the Multiple Linear Regression model.
The following Table 2 contains all the above results aggregated as follows:

Nash-Sutcliffe Efficiency
The Nash-Sutcliffe coefficient can take values from minus infinity to one (-∞ to 1), where for these values the following applies: Similarly, the Nash-Sutcliffe Efficiency values for all cases are presented in the following Table 3 which contains all the results in an aggregated way: The results show a clear increase in the performance of the Nash-Sutcliffe efficiency when using the ANN instead of MLR. The ANN's performance was also higher when fewer stations were available compared to its MLR counterpart, which had a declining performance especially when one or two stations were available. It is also clear that there is a great correlation between the Chania (Center) and Chania stations, so when one is available, the results for the other are always very good. This is confirmed by the results of Case 3 where only Chania station is missing and from the results of Cases 15, 24 and 29, where station Chania is available, and Chania (Center) is missing.

Coefficient of Correlation (R)
The Coefficient of Correlation (R) indicates the proportion of variance of the dependent variable derived from the independent variable. A value of one (1) is the maximum value the coefficient can take, which indicates that there is a complete match between the two compared values.
Regarding the calculation of the Correlation Coefficient (R) for each case, the Artificial Neural Network model shows higher overall values ranging from 5.4% to 29.7%. More specifically, the best value of the above coefficient for the Artificial Neural Network model is presented in Case 15, with a value of 0.99274, while the worst value is presented in Case 6, with a value of 0.93957. The corresponding results for the Multiple Linear Regression model appear in Case 3, with a value of 0.93740 and in Case 29, with a value of 0.74782. Similarly, the Coefficients of Correlation for each case are presented in the following Table 4 which contains the aggregated results:

Discussions
This work developed and compared two models for the simulation of precipitation values, which simulated and accurately completed five time series of precipitation data from five meteorological stations in the region of Chania, Crete. The first model was developed using an Artificial Neural Network ensemble approach (similar to other previously published works [6,10,27]), while the second model was developed using the Multiple Linear Regression method, both in a MATLAB environment.
It is observed that the four meteorological stations that are relatively close to the sea, while at the same time are relatively close to each other (Chania, Chania (Centre), Platanias and Stalos), show similar results for their total rainfall values ( Figure 5). From a hydrological standpoint, both models present results that are in accordance with the theoretical expectations; the simulated values at the weather stations near the seafront are always lower when compared to those of stations at higher altitude. In addition, there is a small decline in the precipitation values along the west to east axis, which is expected since most of the water load in the clouds is released when they reach the coastal fronts coming from the Western Mediterranean.
Looking at the ANN results, a couple of simulated values might draw the attention of the reader as being exceptionally high and possibly outliers (e.g., October 2006 and January 2019). Nevertheless, the scientific literature and the observed values from already installed stations confirm that these were months with extreme rainfall events, confirming the plausibility of these simulated values. In October 2006, extreme rainfall events occurred throughout the study area, leading to flooding in the city of Chania, serious material damages and one casualty [20]. At that time, the only installed and operating station was the one in Chania, which had a very high observed value of 214.6 mm, one of the highest ever recorded. For the same month, the simulated precipitation value for Alikianos station is 345 mm based on ANNs, while the corresponding value using the MLR method is 194 mm. These values, although they seem quite high for the area concerned, are in accordance with the value observed in Chania. In January and February 2019, other extreme rainfall events occurred with similar results. In 2019, all weather stations were operational, but there was a 10-day gap in the beginning of January in Alikianos station, possibly because of device failure due to the extremity of the rainfall events. Regarding the month of January 2019, the simulated precipitation value for Alikianos station is 692 mm based on ANNs, while the corresponding value using the MLR method is 362 mm. The extremity of those values is confirmed by the literature, while the events continued in February with the Chionis and Oceanida storms [20]. The seemingly high simulated value for January is confirmed by the observed values in February at all weather stations. In Figure 5, the observed values in February are significantly high, with the highest value recorded at Alikianos station (568.8 mm in total) and the next highest value at Chania station (360 mm in total). Based on the above, we conclude that the simulated values for Alikianos are plausible and do not consider them as outliers. Comparing the two models, the results of the ANN model show that it is more capable of simulating extreme weather values compared to the model obtained with the MLR method.

Conclusions
Both methods have proven more than adequate for the task of imputation of gaps in the daily rainfall time series. The Nash-Sutcliffe coefficient for both methods is above 0.7 for all cases, a value generally considered as the threshold for very good performance. Nevertheless, throughout this work, the Artificial Neural Network ensembles consistently outperformed the Multiple Linear Regression model. The obvious caveat is the increased time needed for training the ANN model. When comparatively small datasets are available for training (like in this work), the computational effort for training the ANN ensembles is also relatively small (taking just over thirty-six hours). In such cases using ANNs might make more sense, always considering the urgency of the application. In cases where the available dataset is large the training time is expected to increase, but the results will probably be better than those obtained with Multiple Linear Regression. A decision should be made as to whether accuracy or speed is more important. For increased accuracy, the results of this study suggest using ANNs, for increased speed, the results point to using Multiple Linear Regression. Given the good performance of the ensembles in this work, future work can focus on testing different activation functions like the reLU and tanhLU [34].