Filling Gaps in Daily Precipitation Series Using Regression and Machine Learning in Inter-Andean Watersheds

: As precipitation is a fundamental component of the global hydrological cycle that governs water resource distribution, the understanding of its temporal and spatial behavior is of great interest, and exact estimates of it are crucial in multiple lines of research. Meteorological data provide input for hydroclimatic models and predictions, which generally lack complete series. Many studies have addressed techniques to ﬁll gaps in precipitation series at annual and monthly scales, but few have provided results at a daily scale due to the complexity of orographic characteristics and in some cases the non-linearity of precipitation. The objective of this study was to assess different methods of ﬁlling gaps in daily precipitation data using regression model (RM) and machine learning (ML) techniques. RM included linear regression (LRM) and multiple regression (MRM) algorithms, while ML included multiple regression algorithms (ML-MRM), K-nearest neighbors (ML-KNN), gradient boosting trees (ML-GBT), and random forest (ML-RF). This study covered the Malas, Omas, and Cañete River (MOC) watersheds, which are located on the Paciﬁc Slope of central Peru, and a nineteen-year period of records (2001–2019). To assess model performance, different statistical metrics were applied. The results showed that the optimized machine learning (OML) models presented the least variability in estimation errors and the best approximation of the actual data from the study zone. In addition, this investigation shows that ML interprets and analyzes non-linear relationships between rain gauges at a daily scale and can be used as an efﬁcient method of ﬁlling gaps in daily precipitation series.


Introduction
Precipitation is a fundamental component of the hydrological cycle that governs water resource distribution [1]. The hydrological cycle of a given region is directly related to its topography, geology, physical mechanisms, and climate; precipitation is the most important phenomenon [2,3]. Precipitation, due to its high spatiotemporal variability and the large number of interconnected variables involved, is one of the most difficult atmospheric variables to characterize, estimate and forecast [4], especially on a daily scale, due to its high spatial and temporal variability [5]. The understanding of the temporal and spatial behavior of precipitation is of great interest, especially in studies on climatic risks [6]. In addition, exact estimates of precipitation are fundamental in multiple lines of research, as they serve as the basis for statistical models and analysis [7,8].
The structure of the paper is organized as follows. Section 2 shows the information about the locations, the dataset, the theoretical background of the different machine learning (ML) models evaluated, the preprocessing algorithms and evaluation metrics, in addition to the quality control of the dataset by homogenization and regionalization. Then, in Sections 3 and 4, the results are reported and discussed, respectively. Finally, Section 5 describes the conclusions reached in this work.

Study Area
The study area comprised the Mala, Omas, and Cañete River (MOC) watersheds (Figure 1), located in the central part of the Peruvian Pacific Slope and coast; its total area is 9496 km 2 (2250, 1167 and 6079 km 2 , MOC basins, respectively). The area is characterized by a significant latitudinal gradient that goes from 0 to 6500 masl; above 2500 masl is the wet watershed area [32]. The rivers flow from east to west from the Andes to the Pacific Ocean, with bare, steep slopes that favor significant swelling, floods, and erosion during heavy rainfall episodes [9]. In addition, in normal conditions, the region is influenced by the South Pacific High, in combination with the Humboldt current that produces dry, stable conditions with moist air trapped below the inversion layer at about 1000 masl, and it presents major seasonal and interannual precipitation variability [9,11,13].

Methods
The methodology has four stages; Figure 2 shows the methodological diagram. The first stage is the collection of available daily precipitation information from within and near the study area. The second stage is the exploratory analysis and homogenization of rainfall data. The third stage is the regionalization process, which includes the use of the Ward, K-means, and regional vector analysis methods (RVM). Finally, the fourth stage consists of the filling of gaps in daily precipitation series using the RM and ML methods. In addition, the performance of each model was evaluated using metrics.

Methods
The methodology has four stages; Figure 2 shows the methodological diagram. The first stage is the collection of available daily precipitation information from within and near the study area. The second stage is the exploratory analysis and homogenization of rainfall data. The third stage is the regionalization process, which includes the use of the Ward, K-means, and regional vector analysis methods (RVM). Finally, the fourth stage consists of the filling of gaps in daily precipitation series using the RM and ML methods. In addition, the performance of each model was evaluated using metrics.

Collection of Available Information
A total of 17 rainfall stations were selected, some with records since 1965, others sinc 1980, etc., all of which had periods with irregular records. The stations are part of th network managed by the National Meteorology and Hydrology Service of Per (SENAMHI, https://www.senamhi.gob.pe/mapas/mapa-estaciones/mapadepesta1.ph accessed on: 5 March 2020). In addition, stations located outside the study region an those inactive during the selected period were discarded. Similarly, there are rainfall st tions with more than 10% missing (empty) data relative to the total length of the analyze series. Figure 1 shows the spatial locations of the rainfall stations. In addition, Table shows the geographic coordinates, quantity of observed data, and quantity of missin data.

Collection of Available Information
A total of 17 rainfall stations were selected, some with records since 1965, others since 1980, etc., all of which had periods with irregular records. The stations are part of the network managed by the National Meteorology and Hydrology Service of Peru (SENAMHI, https://www.senamhi.gob.pe/mapas/mapa-estaciones/mapadepesta1.php accessed on: 5 March 2020). In addition, stations located outside the study region and those inactive during the selected period were discarded. Similarly, there are rainfall stations with more than 10% missing (empty) data relative to the total length of the analyzed series. Figure 1 shows the spatial locations of the rainfall stations. In addition, Table 1 shows the geographic coordinates, quantity of observed data, and quantity of missing data. It is an essential requirement to guarantee reliable results using raw rainfall data, the application of quality control procedures, by means of graphs and the homogenization of time series, allowing the detection of observation and measurement errors, supported in recommended by Estévez et al. [31]. This process was carried out in two phases: first, a time series graph and boxplot, which allowed the identification of missing values and outliers; this process is performed in Python. Second, in order to determine inconsistencies at the stations, which could stem from a change in instrument location, variations in the conditions at the measurement site, or an observer change, the data were analyzed using the standard normal homogeneity test (SNHT), as described by [33][34][35].
The SNHT was developed by [36] and modified by [37,38]; it uses Y to denote the candidate series and Y i to denote a specific value (for example, cumulative annual precipitation or mean annual temperature) in the year (or other unit of time) i. In addition, X j denotes one of the surrounding reference sites (jth of a total of k) and X ji a specific value from that site. The following equations were used to detect the relative non-homogeneities (traditionally used in precipitation studies): and where Q i is the ratio in Equation (1) and the difference in Equation (2) in a specific year i;Ŷ represents the multi-annual mean of the candidate time series; and ρ j is the correlation coefficient between the test variable Y and the reference variable X j [36,38,39]. This method is implemented in the Climatol package for R language [34]. Climatol has three normalization methods: division by mean values, subtraction of means, and complete standardization; here, we opted for subtraction of means, as the minimum precipitation values can be zero [34,40,41]. On a preliminary basis, Climatol was run for a monthly time step, identifying breaks; based on these breaks, Climatol was run again for a daily time step. The results show graphs of absolute maximum autocorrelation (ACmx), SNHT, root mean square error (RMSE), and percentage of original data (POD).

Regionalization Process
This section describes the regionalization process, which was performed using three methods. In the first method, Ward's hierarchical clustering analysis was applied. This method is also known as "minimum variance" grouping, where Ward's objective function of the [42] algorithm minimizes the sum of squared deviations of the attribute vectors from the centroid of their respective groups; instead of merging samples or clusters as a function of distance, it starts by assigning "zero variance" to all clusters. This method was applied to ascertain the preliminary clustering of the stations [43]. This process was carried out by programming in R language.
In the second method, non-hierarchical K-means clustering (KM) was applied, which is a statistical technique designed to assign objects to a fixed number of clusters according to a set of specified variables [11,44]. It consists of obtaining a partition that minimizes intraclass inertia. This is achieved locally (it depends on the initial points) using the Euclidian distance between individuals and the moving centers used for aggregation. The KM algorithm is an iterative procedure in which the attribute vectors move from one group to another to minimize the value of the objective function, F, defined in Equation (3).
In Equation (3), k indicates the number of groups, N k represents the number of attribute vectors in group k; y k ij denotes the rescaled value of attribute j in attribute vector i assigned to group k; and y k •j is the mean value of attribute j for group k (Equation (4)) [43,45].
However, one of the problems encountered when applying the KM method lies in choosing the number of clusters. Although there is no single criterion for choosing the number of clusters, here we used the elbow method, implementing it by programming in R language.
Finally, the regional vector method (RVM), described by [10,11,44], was the third to be applied, in order to corroborate the previously obtained results. It consists of creating a fictitious station (vector) with average values from all stations in the zone. This method is aimed at the homogenization and completion-extension of precipitation data [46,47] and is based on the creation of an "average value" "vector" station. This concept refers to the calculation of a weighted average of rainfall anomalies for each station, overcoming the effects of stations with extreme and low rainfall values and problems associated with the weight of the rainiest stations relative to the least rainy ones.
This method applies the least squares method to find annual regional rainfall indices Z i and extended mean precipitation P j , which is achieved by minimizing the expression [10,11,45]: where i is the index of the year; j is the index of the station; N is the number of years; M is the number of stations; P ij is annual precipitation at station j in year i; P j is mean precipitation extended to a period of N years; and, finally, Z i is the regional rainfall index of year i. This process was carried out using the Hydracces program [48].

Gap-Filling Model
In this stage of the study, the results from the regionalization process were used. The RM and ML techniques were applied for each homogenous region. The daily precipitation series were graphed for each homogenous region, allowing the dates with missing data to be identified. In addition, the intensity of the relationships between stations was analyzed using Pearson coefficient correlations [29,30].
To apply the LRM and MRM techniques, in both cases, target stations (Y) and variables to predict were identified. Predictor stations (X) were identified for LRM and multiple predictor stations (X m ) for MRM. LRM is a computing procedure based on the alternate least squares algorithm (ALS) [49]. It has two steps: first estimating the relationship between predictors and missing values and then using the trend equation to fill the gaps [50], in accordance with Equation (6): The values of a and b can be estimated using Equations (7) and (8), respectively.
where y and x are mean values of the data series of the reference and similarity stations, respectively [50,51]. Meanwhile, MRM is a statistical technique that consists of finding a linear relationship between a dependent variable and more than one independent variable. It can be represented using the following equation: where Y i is the dependent variable; X 1 , X 2 , . . . X m are the independent variables; a is the intersection; b 1 , b 2 , . . . b m are the multiple regression coefficients, estimated using the method of least squares; and C is the error term [50,51]. ML is a scientific discipline in the artificial intelligence field that creates systems that learn automatically [8,14]. For gap filling using this technique, the data available at each station were divided randomly to generate a training dataset (train) and test dataset (test) in proportions of 75% and 25%, respectively [8]. The algorithms implemented were MRM, K-nearest neighbors (KNN), gradient boosting trees (GBT), and random forest (RF). In addition, an optimization process was carried out, generating OML-MRM, OML-KNN, OML-GBT, and OML-RF models. These algorithms were implemented using the Python programming language. KNN is a non-parametric method that can be used for both classification and regression. The result is calculated based on the weighting of a number of nearest neighbors in the attribute space based on a distance function; the most common is Euclidian distance for continuous data [8]. GBT is a method in which multiple decision trees are iteratively fit to the data, and each tree is based on the previous tree to reduce losses and improve performance. It is based on the boosting principle, that is, on the creation of a set of weak learners to improve prediction precision [8,52]. This method has three advantages: first, it does not require the application of a direct physical model; second, it serves as a computationally feasible method of capturing complex non-linear interactions between variables and a response [52,53]; and finally, it presents almost no overfitting problems, which is an important advantage, as many models over-or underestimate results [14,52,53]. RF was proposed by [54]. It is a semi-unsupervised non-parametric algorithm in the decision tree family that consists of a set of uncorrelated trees to produce predictions for classification and regression tasks [55].

Bayesian Optimization
One of the critical aspects of machine learning models' efficiency is hyperparameter selection. It is very important to establish the correct values; performance can change drastically from excellent to very poor. A common practice in the scientific community uses a trial and error technique, where different values, ranging from tens to thousands of possibilities, are evaluated [23,31]. Therefore, efficiently setting the hyperparameter space is essential, because if the hyperparameter space is ample, the algorithm wastes significant time in non-promising configurations (apart from being very slow). On the other hand, when the hyperparameter space is small, an accurate hyperparameter configuration set may be missing, even though it is fast [23,31].
Bayesian optimization was used to estimate the hyperparameters due to its great popularity in machine learning models and its good performance in optimization [56,57]. The procedure consists of four steps, as described by [23]: (1) define the hyperparameter space; (2) the algorithm considers previous evaluations to choose the next set of values to be evaluated (acquisition function); (3) to assess the new hyperparameter configuration using an objective function; and (4) if the optimization process has not finished yet, it goes to the second point. In this work, this algorithm was implemented using Python.

Evaluation Metrics
To assess the efficiency of the developed models, coefficient of determination (R 2 ), root mean square error (RMSE), Nash-Sutcliffe coefficient (NSE) and percentage bias (PBIAS) were used [8,51,58]. All of them are mathematically expressed as Equations (10)-(13), respectively: where n represents the number of prediction days, P obs corresponds to the measured value for a specific day, P pred is the predicted value, i represents measurement on a specific day, and P pred correspond to the average measured and predicted values, respectively.

Analysis of Missing Data, Outliers, and Homogenization
In Figure 3, the bar graph shows the quantity of unavailable precipitation data by station; there are three stations with more than 10% missing data (Cañete, Socsi, and Pacaran), while the remaining stations present less than 10% missing data. The Cañete, Socsi and Pacaran rainfall stations are located in the lower part of the basin, which is characterized by being dry almost all year round (less than 20 mm/year).  Figure 4a shows the boxplots for daily precipitation series of e tain a large number of scattered values, which initially could be con ever, it should be taken into account that daily precipitation sho spatial variability patterns. Figure 4b shows the boxplot at a m smaller dispersions, probably lower outliers, reflecting less spatial ity.  Figure 4a shows the boxplots for daily precipitation series of each station; these contain a large number of scattered values, which initially could be considered outliers. However, it should be taken into account that daily precipitation shows high temporal and spatial variability patterns. Figure 4b shows the boxplot at a monthly scale, showing smaller dispersions, probably lower outliers, reflecting less spatial and temporal variability.  Figure 4a shows the boxplots for daily precipitation series of each station; these contain a large number of scattered values, which initially could be considered outliers. However, it should be taken into account that daily precipitation shows high temporal and spatial variability patterns. Figure 4b shows the boxplot at a monthly scale, showing smaller dispersions, probably lower outliers, reflecting less spatial and temporal variability.

022, 14, x FOR PEER REVIEW
(a) (b)  Figure 5a shows the results of Pearson coefficient correlations; high spatial and temporal variability on a daily scale are observed, complicating the detection of homogeneities. The high variability of daily records compared to that of monthly or annual values makes it very difficult to directly apply methods for identifying inhomogeneities at the daily scale; in accordance with the recommendations of [34], the homogenization process was performed at a monthly scale, at which it is possible to detect cutoffs or breakpoints. Once the breakpoints were identified, the homogenization process was carried out on a daily scale using the Climatol package in R (https://cran.r-  Figure 5a shows the results of Pearson coefficient correlations; high spatial and temporal variability on a daily scale are observed, complicating the detection of homogeneities. The high variability of daily records compared to that of monthly or annual values makes it very difficult to directly apply methods for identifying inhomogeneities at the daily scale; in accordance with the recommendations of [34], the homogenization process was performed at a monthly scale, at which it is possible to detect cutoffs or breakpoints. Once the breakpoints were identified, the homogenization process was carried out on a daily scale using the Climatol package in R (https://cran.r-project.org/web/packages/climatol/index.html accessed on: 5 May 2020) [34,59]. The results in Figure 5a,b show the correlation between the original normalized series and the reference series obtained based on the other stations. The reference series was constructed based on the average value of the nearest stations, which is weighted by the inverse of the distance from the analysis station [34,39,41]. The daily-scale correlation results present a maximum value of 0.40 and a minimum below zero ( Figure 5a); the monthly-scale results reach values close to 1.0 (Figure 5b). This analysis was carried out for all the stations. project.org/web/packages/climatol/index.html accessed on: 5 May 2020) [34,59]. The results in Figure 5a,b show the correlation between the original normalized series and the reference series obtained based on the other stations. The reference series was constructed based on the average value of the nearest stations, which is weighted by the inverse of the distance from the analysis station [34,39,41]. The daily-scale correlation results present a maximum value of 0.40 and a minimum below zero ( Figure 5a); the monthly-scale results reach values close to 1.0 ( Figure 5b). This analysis was carried out for all the stations. Climatol provided overall absolute maximum autocorrelation (ACmx), SNHT, root mean square error (RMSE), and percentage of original data (POD) results. The ACmx values are not significant until the third quartile of the series (0.34); the values are below 60% autocorrelation, which indicates that the series are non-seasonal (Figure 6a). The series present anomalies in SNHT values between the original and homogenized series; the values range from 9.10 to 80.90, with the exception of the Cañete station, which reaches a maximum of 228, creating a rather wide variation spectrum (Figure 6b). RMSE presents high variation, with a minimum value of 1.26 and a maximum of 4.79 (Figure 6c). Finally, POD, which compares the original and homogenized data series, presents high values, meaning that the original data available are of good quality (Figure 6d). In addition, results of the analysis of homogeneity by station were obtained ( Table 2). The Cañete, Socsi, and Pacaran stations presented ACmx values above 0.60, SNHT values above 90.0, and POD values above 10%. Only RMSE presented low values. Climatol provided overall absolute maximum autocorrelation (ACmx), SNHT, root mean square error (RMSE), and percentage of original data (POD) results. The ACmx values are not significant until the third quartile of the series (0.34); the values are below 60% autocorrelation, which indicates that the series are non-seasonal (Figure 6a). The series present anomalies in SNHT values between the original and homogenized series; the values range from 9.10 to 80.90, with the exception of the Cañete station, which reaches a maximum of 228, creating a rather wide variation spectrum (Figure 6b). RMSE presents high variation, with a minimum value of 1.26 and a maximum of 4.79 (Figure 6c). Finally, POD, which compares the original and homogenized data series, presents high values, meaning that the original data available are of good quality (Figure 6d). In addition, results of the analysis of homogeneity by station were obtained (

Regionalization Analysis
The results provided by ward showed four groups of regions for the 17 stations (Figure 7). This process, implemented based on R code, allowed the initial clustering to be ascertained. Clustering analysis with KM is a method that creates the most heterogenous clusters possible; that is, the objects in the k-clusters must be as similar as possible to those that belong to their cluster and completely unlike the objects in other clusters [11]. A fundamental point in the application of KM is to ascertain the optimum number of clusters. There are many criteria for choosing the optimal number of clusters, however; for this study, the elbow method (EM) was used due to its extensive application in diverse hydrological studies with good results. The optimal cluster or region value is shown in Figure 8. According to EM analysis, the optimal number of regions was four. In addition, KM was used to define the stations belonging to each homogenous region. Table 3

Regionalization Analysis
The results provided by ward showed four groups of regions for the 17 stations (Figure 7). This process, implemented based on R code, allowed the initial clustering to be ascertained. Clustering analysis with KM is a method that creates the most heterogenous clusters possible; that is, the objects in the k-clusters must be as similar as possible to those that belong to their cluster and completely unlike the objects in other clusters [11]. A fundamental point in the application of KM is to ascertain the optimum number of clusters. There are many criteria for choosing the optimal number of clusters, however; for this study, the elbow method (EM) was used due to its extensive application in diverse hydrological studies with good results. The optimal cluster or region value is shown in Figure  8. According to EM analysis, the optimal number of regions was four. In addition, KM was used to define the stations belonging to each homogenous region. Table 3      The results obtained with ward and KM indicate that precipitation during t uated period was not similar at every station throughout the watersheds. The app of the ward and KM methods was performed using code written in R, and for code written in Python.
Finally, the RVM method was applied to validate the results obtained base described models. The Hydraccess program was used to apply RVM (https://hy smip.fr/es/hydraccess-3 accessed on: 5 May 2020). The results show clustering of with similar behaviors in terms of interannual precipitation variation, taking the s deviation and correlation coefficient/vector as indicators. The regions are consid mogenous if the values of the standard deviation (SD) are lower than 0.4 and the tion coefficient/vector values are above 0.7 [11]. The final results show the clust rainfall stations into homogenous regions.
The RVM method was used to obtain three final clusters that, in accordan their statistics and analysis of the results, included the stations that are shown in and Tables S1 and S2 (Supplementary Material), and Figure 9, and Figures S1 and plementary Material). It was not possible to analyze cluster 3, as its stations pre high percentage of missing data.  The results obtained with ward and KM indicate that precipitation during the evaluated period was not similar at every station throughout the watersheds. The application of the ward and KM methods was performed using code written in R, and for EM, the code written in Python.
Finally, the RVM method was applied to validate the results obtained based on the described models. The Hydraccess program was used to apply RVM (https://hybam. obsmip.fr/es/hydraccess-3 accessed on: 5 May 2020). The results show clustering of stations with similar behaviors in terms of interannual precipitation variation, taking the standard deviation and correlation coefficient/vector as indicators. The regions are considered homogenous if the values of the standard deviation (SD) are lower than 0.4 and the correlation coefficient/vector values are above 0.7 [11]. The final results show the clustering of rainfall stations into homogenous regions.
The RVM method was used to obtain three final clusters that, in accordance with their statistics and analysis of the results, included the stations that are shown in Table 3 and Tables S1 and S2 (Supplementary Material), and Figure 9, and Figures S1 and S2 (Supplementary Material). It was not possible to analyze cluster 3, as its stations presented a high percentage of missing data. The results obtained with the ward and KM methods are consistent in the number of homogenous regions. However, the number of stations in Regions 1 and 2 presented a slight discrepancy between the results obtained with ward and KM; therefore, the results obtained with RVM were used for verification, showing accord between the KM and RVM results. Table S3 (Supplementary Material) shows the final results of the homogenous region clustering. In addition, Figure 10 shows the regionalization of rain gauge stations based on the KM and RVM results.   The results obtained with the ward and KM methods are consistent in the number of homogenous regions. However, the number of stations in Regions 1 and 2 presented a slight discrepancy between the results obtained with ward and KM; therefore, the results obtained with RVM were used for verification, showing accord between the KM and RVM results. Table S3 (Supplementary Material) shows the final results of the homogenous region clustering. In addition, Figure 10 shows the regionalization of rain gauge stations based on the KM and RVM results.

Analysis of the Series Gap-Filling Process
In Table 4 and Tables S4 and S5 (Supplementary Material), the correlation values corresponding to the stations clustered by homogenous region are shown. The correlations were below 0.60 and above 0.38; below 0.58 and above 0.32, and below 0.45 and above 0.37 in Regions 1, 2, and 4, respectively. These coefficients are considered acceptable given the dry conditions, with more than 90% of the rain gauge records close to zero throughout the year due to the hydroclimatic conditions, with any value greater than zero causing high variability [11]. Therefore, this analysis allowed the level of representation using Pearson coefficient correlations within a region to be highlighted. In the application of RM for filling missing precipitation data, the models were generated based on the homogenous regions. For the LRM algorithm, the Ayaviri station was designated as variable Y and the Huancata station was designated as variable X based on the greater Pearson coefficient correlations value. The other stations with missing data were selected in a similar manner. Table 5 shows the Y and X variables for each region. Meanwhile, for MRM, the procedure was similar to that of the previous case; the Ayaviri station was identified as the Y variable and all the remaining stations (Huancata, Langa, San Lazaro de Escomarca, Huañec, Huarochiri, and Carania) were identified as X m (see Table 5). Table 5. Identification of target stations (Y) and predictor stations by homogenous region. For the application of the different gap-filling algorithms, this process was carried out independently in each homogenous region. The Y stations with missing data in each homogenous region were identified, as were the X m stations corresponding to each target station. The Y and X m stations for each homogenous region are shown in Table 5.

Regions Target Station (Y) Predictor Station (X) Multiple Predictor Stations (X m )
For the filling of missing precipitation data with ML, the analysis was also carried out independently in each homogenous region. First, the available data were divided, with one portion for training and another for testing (75% and 25% respectively); this division was performed randomly. Then, the ML-MRM, ML-KNN, ML-GBT and ML-RF were selected along with their respective parameters (see Table 6). In addition, considering that many models contain parameters that cannot learn from training data, it was necessary to carry out an optimization process. To this end, it was important to ascertain the hyperparameter values using the Bayesian Optimization method. The results of a model can depend largely on the values taken by its hyperparameters; however, it cannot be known beforehand what values are suitable. The most common means of finding optimal values is testing different possibilities; in this study, optimization processes were carried out for the OML-MRM, OML-KNN, OML-GBT, and OML-RF algorithms (Table 7). Based on the Y and X m variables, ML was first applied for default parameter values using the ML-MRM, ML-KNN, ML-GBT, and ML-RF models. It was also applied using parameters called hyperparameters, generating the OML-MRM, OML-KNN, OML-GBT, and OML-RF models. This process allowed the model parameters to be optimized. The parameter and hyperparameter values used in the algorithms created in Python are shown in Table 6. Table 6 describes the parameter and hyperparameter values used in each algorithm in ML. It is observed that only one parameter value was assigned when using the default algorithm. However, for the algorithm optimization process, a wide range of values was defined, and using the Bayesian optimization method, the optimal hyperparameters were estimated.

Assessment of Model Performance
To assess the performance of the models, different statistical metrics-R 2 , RMSE, NSE and PBIAS-were used for both datasets (training and test). The obtained results are presented in Table 7 and Tables S6 and S7 (Supplementary Material). These statistics were calculated for the 2001-2019 period; periods with missing data were not considered.
Many linear models, among them LRM, contain parameters that cannot learn from training data, making it necessary for the modeler to establish them. In addition, to establish the predictive capacity of ML, which consists of testing how close its predictions are to the actual values of the response variable, a set of observations is needed, with its corresponding response variables, but that the model has not "seen", that is, which have not participated in its initial fitting. Finally, to assess the performance of models by comparing predicted and actual precipitation values, the use of statistical metrics is important. The R 2 values for the dataset (training and test) present a correlation between the Y and X variables in each model. For the Ayaviri station, which belongs to homogenous region 1 (see Table 7), the ML-RF model gives the best R 2 value (0.89) for the training data; however, for the test dataset, this value is reduced by nearly half (R 2 = 0.45). For optimized ML, the training and test R2 values are close to each other, and in some cases, the R 2 values are better for the test datasets than the training datasets. RMSE is a measure of the variance of residuals, which allows the magnitude of deviation of simulated values from observed values to be quantified; the LRM model presents the greatest RMSE (3.15) for the Ayaviri station. It was also observed that the test dataset generally presents a lower RMSE, particularly with the optimized ML models (OML-GBT and OML-RF).
The NSE is a tool that measures the predictive capacity of a model, which can take values between −∞ and 1.0, with 1.0 being the optimal value. Values between 0.0 and 1.0 are generally seen as acceptable performance levels, while values equal to or less than 0.0 indicate that the mean of the observed values is a better predictor than the simulated value, indicating inadequate performance [60]. In accordance with the results shown in Table 7, values for the Ayaviri station are between 0.36 and 0.88 for both datasets (test and training). However, the ML models present values very close to 1.0 (ML-RF, NSE = 0.88) for the training dataset, indicating an acceptable level of performance.
PBIAS measures the tendency of simulated data to be larger or smaller than their observed counterparts; its optimal value is 0. Positive values indicate a model with an underestimation bias and negative values indicate an overestimation bias. For the Ayaviri station, the OML-KNN presents high underestimation (PBIAS = 21.64), while the ML-RF model presents high overestimation (PBIAS = −10.65). However, the LRM, MRM, ML-MRM, OML-MRM, and OML-GBT models present an optimal PBIAS value for both datasets (training and test).

Discussion
Based on the results obtained in the exploratory analysis, the Cañete, Socsi and Pacaran stations presented large quantities of missing values (over 10%); they also failed the homogeneity test. Therefore, the initial number of stations (17) was reduced to 14. In this study, the RM and ML methods were used to fill gaps in daily precipitation series at stations located in the MOC watersheds on the Peruvian Pacific Slope and coast. The procedure was carried out in three stages: collection of information on daily precipitation series, exploratory analysis, and homogenization. Therefore, it is essential to implement quality control procedures for raw rainfall data to ensure their reliability for use. In addition, preliminary ward cluster analysis, followed by KM and RVM analysis, through which three homogenous regions that concisely represent the relationship between precipitation variability and altitude were identified; and, finally, RM and ML were applied as a method of filling gaps in precipitation series.
RM and ML are customizable and easy-to-implement techniques that seek the best performance for a given problem among numerous algorithms. ML analyses with hyperparameter values (OML-MRM, OML-KNN, OML-GBT, and OML-RF) presented the best data recovery performance, demonstrating that ML models can extract additional information from data that by nature present noisy characteristics due to their high spatial and temporal variability [8,34,39]. In general terms, the decision tree methods (OML-GBT and OML-RF) perform the regression task well; however, some variations are observed that demonstrate that not all ML algorithms are equal in datasets that are superficially similar and can vary widely in terms of their prediction power. This also underlines the variation in the mechanisms of ML algorithms, even though all of them are capable of extracting information from non-linear and noisy datasets. Figure 11a shows the Taylor diagram for the Ayaviri station for the training dataset; the ML-RF model presents the best results, with prediction precipitation the most consistent with observed precipitation (R 2 = 0.89, RMSE = 1.36, NSE = 0.88, and PBIAS = −1.73). Figure 11b shows the Taylor diagram for the Ayaviri station for the test dataset; the OML-GBT and OML-RF present the best results (R 2 = 0.71, RMSE = 2.05, NSE = 0.71, and PBIAS = 0.00 and R 2 = 0.70, RMSE = 2.14, NSE = 0.68 y PBIAS = 1.01, respectively). The analyses of the other stations (Huarochiri, San Lazaro de Escomarca, and Langa), are shown in Figure 11c-h; all these stations are located in homogenous region 1, and the values of the results obtained for them are similar to those of the Ayaviri station. Likewise, it is observed that in terms of the statistical metrics for the training and test datasets, the optimized ML models present the best results, particularly the OML-GBT and OML-RF models. The results of the analysis of the statistical metrics are shown in the figures. For the Ayaviri station, the OML-RF model presents a slight underestimation, while the results of the OML-GBT model are more efficient. Finally, in regions 2 and 4, Figures S3 and S4 respectively (Supplementary Material), the OML-GBT and OML-RF present the best results in terms of statistical metrics.  Figure 11a shows the Taylor diagram for the Ayaviri station for the training dataset; the ML-RF model presents the best results, with prediction precipitation the most consistent with observed precipitation (R 2 = 0.89, RMSE = 1.36, NSE = 0.88, and PBIAS = −1.73). Figure 11b shows the Taylor diagram for the Ayaviri station for the test dataset; the OML-GBT and OML-RF present the best results (R 2 = 0.71, RMSE = 2.05, NSE = 0.71, and PBIAS = 0.00 and R 2 = 0.70, RMSE = 2.14, NSE = 0.68 y PBIAS = 1.01, respectively). The analyses of the other stations (Huarochiri, San Lazaro de Escomarca, and Langa), are shown in Figure 11c-h; all these stations are located in homogenous region 1, and the values of the results obtained for them are similar to those of the Ayaviri station. Likewise, it is observed that in terms of the statistical metrics for the training and test datasets, the optimized ML models present the best results, particularly the OML-GBT and OML-RF models. The results of the analysis of the statistical metrics are shown in the figures. For the Ayaviri station, the OML-RF model presents a slight underestimation, while the results of the OML-GBT model are more efficient. Finally, in regions 2 and 4, Figures S3 and S4 respectively (Supplementary Material), the OML-GBT and OML-RF present the best results in terms of statistical metrics.

Conclusions
This study has demonstrated the performance advantages of ML techniques for filling gaps in daily precipitation series as well as the potential of ML models in the optimization process using hyperparameter values for training (75%) and test datasets (25%), based on the efficiencies of the statistical metrics. However, it is important to note that a quality control raw rainfall data and regionalization process are necessary, which allows homogenous regions to be identified. Precipitation along the Peruvian Pacific Slope is highly influenced by El Niño, with marked positive asymmetry of strong events, and La Niña, with non-Gaussian distribution of precipitation data, which limits to a certain extent the linear analysis approach [9]. Finally, the results obtained in this study showed that the OML-GBT and OML-RF models presented the least variability in estimation errors and the best approximation to the actual data, efficiently interpreting the spatiotemporal variability of precipitation, as demonstrated by the analyzed statistical metrics.
Supplementary Materials: The following supporting information can be downloaded at: www.mdpi.com/xxx/s1, Figure S1: Annual indices of the regional vector and stations in Region 2;

Conclusions
This study has demonstrated the performance advantages of ML techniques for filling gaps in daily precipitation series as well as the potential of ML models in the optimization process using hyperparameter values for training (75%) and test datasets (25%), based on the efficiencies of the statistical metrics. However, it is important to note that a quality control raw rainfall data and regionalization process are necessary, which allows homogenous regions to be identified. Precipitation along the Peruvian Pacific Slope is highly influenced by El Niño, with marked positive asymmetry of strong events, and La Niña, with non-Gaussian distribution of precipitation data, which limits to a certain extent the linear analysis approach [9]. Finally, the results obtained in this study showed that the OML-GBT and OML-RF models presented the least variability in estimation errors and the best approximation to the actual data, efficiently interpreting the spatiotemporal variability of precipitation, as demonstrated by the analyzed statistical metrics.