Applying Cascade-Correlation Neural Networks to In-Fill Gaps in Mediterranean Daily Flow Data Series

: The analyses of water resources availability and impacts are based on the study over time of meteorological and hydrological data trends. In order to perform those analyses properly, long records of continuous and reliable data are needed, but they are seldom available. Lack of records as in gaps or discontinuities in data series and quality issues are two of the main problems more often found in databases used for climate studies and water resources management. Flow data series from gauging stations are not an exception. Over the last 20 years, forecasting models based on artiﬁcial neural networks (ANNs) have been increasingly applied in many ﬁelds of natural resources, including hydrology. This paper discusses results obtained on the application of cascade-correlation ANN models to predict daily water ﬂow using Julian day and rainfall data provided by nearby weather stations in the Ebro river watershed (Northeast Spain). Five unaltered gauging stations showing a rainfall-dominated hydrological regime were selected for the study. Daily ﬂow and weather data series covered 30 years to encompass the high variability of Mediterranean environments. Models were then applied to the in-ﬁlling of existing gaps under di ﬀ erent conditions related to the characteristics of the gaps (6 scenarios). Results showed that when short periods before and after the gap are considered, this is a useful approach, although no general rule applied to all stations and gaps investigated. Models for low-water-ﬂow periods provided better results ( r = 0.76–0.8). Author Contributions: Conceptualization, C.V.-G. and J.A.; Data curation, M.D.; Formal analysis, M.D.; Funding acquisition, J.A.; Methodology, C.V.-G. Supervision, C.V.-G. Writing—original draft, C.V.-G., J.A.


Introduction
Water resources management is based on the study over time of meteorological and hydrological data trends. In order to evaluate resource availability and possible impacts, long records of continuous and reliable data are needed, but they are seldom available. Lack of records (gaps) or discontinuities in data series and quality issues are two of the main problems more often found in databases used for climate studies and water resources management [1][2][3][4][5][6], especially in mountain regions with limited meteorological monitoring and abundant precipitation often associated to extreme events [7]. Flow data series from gauging stations are also affected by these problems. Unusual flood events may cause breakdowns and failures in the gauging stations which usually result in gaps in the daily flow data series, for instance. Moreover, even very short gaps may compromise the calculation of statistics and data utility [8].
The problem of gaps in data series may be solved theoretically by completing daily flow records from existing data at gauging stations nearby, either upstream or downstream of the same watercourse (e.g., interpolation techniques), although the election of the donor station may be a critical factor affecting the results [8]. Several methods have been used during the last decade for infilling missing

Study Area
The domain of study was the Ebro river basin in Spain which is located in the northeast of the country ( Figure 1) and has an approximate area of 85,550 km 2 [21]. It has a total length of 910 km, and it is the most important river in Spain in terms of flow, with an average water discharge of 430 m 3 /s. The Ebro river is of major importance for ecological and human purposes, being subject to substantial demands from hydropower generation, irrigation of agricultural fields, and recreation and urban uses [1].
Water 2019, 11, x FOR PEER REVIEW 3 of 12 The domain of study was the Ebro river basin in Spain which is located in the northeast of the country ( Figure 1) and has an approximate area of 85,550 km 2 [21]. It has a total length of 910 km, and it is the most important river in Spain in terms of flow, with an average water discharge of 430 m 3 /s. The Ebro river is of major importance for ecological and human purposes, being subject to substantial demands from hydropower generation, irrigation of agricultural fields, and recreation and urban uses [1]. Figure 1. Location of the gauging stations, watersheds, and climate stations used for this study at the Ebro river basin (Spain).

Database
Five gauging stations (GS) out of the 240 existing within the watershed were selected for this study ( Figure 1, Table 1). They were selected because all of them presented unimpaired, natural flow regimes with a reliable data range of 30 years (1976-2005) of daily weather and flow records (average daily water discharge in m 3 /s), and no more than three gaps. In the context of this study, we define a gap as a discontinuity in the daily flow data series due to missing data of at least one or more consecutive days. The continuous period of years used for the study was selected based on data reliability and availability, but also to account for the variability of the Mediterranean climate in this area. The study period included dry spells in the 1980s and 1990s, but also periods in the 1970s and 2000s that were humid in Spain. The selected gauging stations showed a mainly rainfall-dominated hydrological regime. Table 1 and Figure 1 show the location and main characteristics of the weather and gauging stations.

Database
Five gauging stations (GS) out of the 240 existing within the watershed were selected for this study ( Figure 1, Table 1). They were selected because all of them presented unimpaired, natural flow regimes with a reliable data range of 30 years (1976-2005) of daily weather and flow records (average daily water discharge in m 3 /s), and no more than three gaps. In the context of this study, we define a gap as a discontinuity in the daily flow data series due to missing data of at least one or more consecutive days. The continuous period of years used for the study was selected based on data reliability and availability, but also to account for the variability of the Mediterranean climate in this area. The study period included dry spells in the 1980s and 1990s, but also periods in the 1970s and 2000s that were humid in Spain. The selected gauging stations showed a mainly rainfall-dominated hydrological regime. Table 1 and Figure 1 show the location and main characteristics of the weather and gauging stations. The GS databases presented a total of 11 gaps (Table 2) in flow records ranging from 6 to 272 days (nine months) out of a possible maximum of 11,950 records (30 years × 365 days). Missing daily flow data values for each gap would be modeled from meteorological data provided by four nearby weather stations. The variables used for completing gaps in the database were Julian day (JD), the precipitation (mm) on the day we had the missing value (PP), and the precipitation on the five days before the date of the missing value (PP-1 to PP-5). The calculated time of concentration using Kirpich's equation [39] for the different watersheds ranged from 1 to 9 h. Furthermore, from the analysis of the precipitation and streamflow data series, we found that the response of watersheds to rainfall events translates in the streamflow data series as peak flows always within a period of 5 days after the weather event. So, given the size of the watersheds and giving time enough for water from the watershed divide to reach the watershed outlet (time of concentration), a period of 5 days was considered appropriate for the particularities of shape, topography, vegetation, and soil characteristics of the watersheds studied.

ANN Models
Besides the fact that ANN models have proved highly accurate in many previous hydrological applications [6,[21][22][23], in this case, the use of ANN had additional advantages since weather variables would be expected to be highly correlated spatially and temporally, violating assumptions required for traditional statistical model building [21].
There are many types of artificial neural networks, but a specific multilayered feed-forward type of network was used in this study, the cascade-correlation model defined by Fahlman and Lebiere [37] (CCANN). This algorithm was successfully used by Alcázar et al. [21] for the estimation of environmental flows, and a similar model building procedure was followed here. This algorithm has the advantage of optimizing network architecture in the so-called 'training' or 'learning' process, so it does not rely on trial-and-error for final architecture like back-propagation.
Any net model was initially built with an input layer (with nodes for JD and PP variables) and an output layer (with one output node for mean daily flow), and no nodes in the hidden layer. Learning proceeded by testing nodes for the hidden layer, altering their weights iteratively, and adding these new nodes when they produced an improved net performance. We tested up to three new nodes (instead of just one) for addition at each step of the iterative process, until no improvement in performance was gained. An important difference with the previous environmental flow work (with only 46 training cases) was that in the current gaps problem the number of cases was on the order of magnitude of the thousands, so they did not limit network architecture and size. No constraints were imposed for restricting the number of weights in the model, though this algorithm was designed for optimal structure, and resulting models are usually parsimonious. As the number of weights (degrees of freedom) was not an issue, nodes of our models were always fully connected by weights in three layers (input-to-hidden, hidden-to-output, input-to-output).
We randomly split the databases for analysis in training (56%), test (24%), and validation (20%) groups, as usual in neural network modeling (e.g., [12,27]). To avoid effects of this type of hold-out method of cross-validation in model performance and testing [40], the random splitting was repeated at least three times for each model, and 5 initial replicas were built with different sets of random weights for each group at the beginning of training. Convergence of the 15 trials for each model to a same or similar structure was considered a trait of robustness of the solution.
The iterative learning algorithm was based on an adaptative gradient learning rule [37,41], a variant of the general algorithm of back-propagation [42,43]. Training performance was set to optimize the Pearson product-moment correlation (r, Equation (1)) between observed and predicted outcomes: the known flow values (average daily water discharge in m 3 /s) and the output of the net. The test dataset was used to prevent overtraining; learning from the training set was periodically stopped to compute r for the test set and did not continue updating weights once test r started to decrease and diverge from training r. Once training was concluded, the validation dataset was run through the network and its r computed. Predictive model performance was evaluated based on balanced r's for the three datasets, as r is the best known quantitative measure of performance among the group of measures that preserve the pattern of data [40]. Additional criteria used to evaluate the models were based on model residuals, or differences between observed and predicted mean daily flows, specifically the root mean square error (RMSE), mean absolute error (MAE), and absolute maximum error (AME) (Equations (1)-(4)) [40].
Absolute Maximum Error (AME) max y i − y i where y is the mean of observed values (y i observations) and y i are the predicted values. Several transformations were applied to the independent variables and tested through a genetic algorithm (GA, [38,44]) provided by Predict ® 3.24 software [45] to determine the optimal inputs to the models, previously to model building. In this case, the GA evolved a population of variable sets that were selected, mated, and mutated for a maximum number of 50 generations (patience = 7) in order to find the fittest combination of independent variables for each network model. Reproduction parameters applied were: cross-over probability = 0.7, mutation = 1.0, and elitist fraction = 0.05. Fitness was evaluated by a logistic multiple regression function for each individual (or subset of the model's input variables, between 2-30 transformed variables) in the population. Once a best model was obtained, a sensitivity analysis through partial derivatives [45] of the variables included as inputs was used to rate their relative importance in the models [21].

Scenarios
Six different approaches were defined to optimize the predictive models of daily water flow for all and each gap in any gauging station (Table 3). They were required because the natural variability of flow regime (inter and intra-annual variability) is a key aspect in defining the functioning and structure of a river (e.g., [46][47][48][49]), especially in Mediterranean ecosystems. However, this high variability of the streamflow database could induce large errors when developing prediction models from large time spans (30 years, Scenario 1). Consequently, several scenarios were proposed to reduce errors induced by a high inter-and intra-annual daily flow variability while keeping the observations for model building representative (Scenarios 2, 3, 4, 6).
We also considered that the hydrological response of watersheds is influenced by the basin characteristics that regulate runoff, such as geomorphology, geology, and vegetation cover. Changes in land cover over time due to either natural causes or human activities may vary the hydrological behavior of the watershed, i.e., the relationship between precipitation and runoff. Then, the accuracy of streamflow prediction models based only on precipitation data over a long period of time may be lowered by errors related to vegetation cover changes, for instance. Scenario 5 tried to reduce this possible source of error by reducing the time span of the observations used for modeling to the short-term. Table 3. Definition of the scenarios considered in the predictive models.

Scenario 1
General model, the whole range of 30 years of weather and flow data was used for all gaps independently of year or season.

Scenario 2
Extreme years (outliers), if present, were removed, only when the gap was not located within those years.

Scenario 3
Inter-annual variability of flow regimes was considered. All available years were classified into three types, wet 1 , normal 2 , and dry 3 year, based on the characterization of the regime's inter-annual variability, and according to the following criteria [50]. Only years of the same type as those where the gaps were occurring were used for the corresponding analysis.

Scenario 4
Intra-annual variability of flow regimes was considered. Annual flow regimes were divided in low, medium, and high flow periods based on an analysis of seasonal flow variability (similar criteria than in scenario 3 was followed but applied to average monthly values instead of annual values), and only data of the same seasonal period as the gaps were used.

Scenario 5
Short periods of time we selected where the basin characteristics could be considered invariable and therefore the hydrological response of the watershed did not change.
Only data of the two years before and after the gap were used for the analysis.

Scenario 6
Combined scenarios 4 and 5. Only data of the same seasonal flow period of the two years before and after the gap were used for the analysis.
Consequently, we built models for each GA (5 GS models), for each gap in any GS (11 gap models), and for these six scenarios, which raised the number of models built, validated, and analyzed for best performance to 54, with 15 replicas each.

Results
Best models for each gauging station (GS), scenario, and gap are presented in Tables 4-8, if the models were judged robust and adequate according to the performance criteria (r and RMSE in tables, AME and MAE data not shown).
The most important results for the three gaps in GS-7 Jalón-Cetina can be examined in Table 4. The results of GS-7 were based on climate stations 354 and 350A, but in some models one whole station's variables were left out. Models for scenarios S1, S2, and S3 all had r < 0.50, with S3 < 0.2. The scenario S2 was the best for the longer 88' spring gap (87 days), but performance was not very good in the validation dataset, probably for the difficulty in modeling variable high water flows. The seasonal scenario S4 was best for the small 1991 summer gap (17 days). Best results were achieved for the 55-days gap in summer 1984 in which the short-term S5 model reached r = 0.76 for the training dataset and r = 0.82 for the test and validation datasets. Station GS-57 Deza-Embid de Ariza was nearby and we used the same climate stations as for GS-7. Analysis of the water flow curves over all the years based on the average monthly volume showed no extreme years. No homogenous period within the year could be identified (heterogeneous data across all months), so scenarios S2, S3, S4, and S6 could not be applied. The other scenarios gave unsatisfying correlation values r < 0.50 for all replicas. Best results can be seen in Table 5. Station GS-78 Cemborain-Garinoain water flow did not show any extreme years. The two existing gaps overlapped the periods with homogenous very low water flow and the period with high heterogeneous water flow so also here S2, S3, and S6 could not be applied. S1 and S4 gave low r values. The S5 model produced the best results for the 28-day 1992 winter gap with an r = 0.60 for the trained dataset and an r = 0.69 for the validation dataset ( Table 6). The 2004 spring-summer gap (134 days) was not successfully modeled under any of the scenarios.

GS-86
Zidacos-Barasoaínis shared climate station with GS-78. In this gauging station, all scenarios could be applied for modeling but none of them gave good results (r values ≤ 0.50). The best model results can be seen in Table 7, under S1. GS-150 Salado-Estenozis was modeled with independent weather variables from climate station 279. S1 showed r values close to 0.56 for training, test, and validation data for all three gaps. S2, S3, and S6 could not be applied to the longer gaps (49-days and 77-days) because the gaps overlapped both high and low flow periods. Models for the gap in the high period (32-days) gave bad correlations in general (r values ≤ 0.50), except for S5. S5 provided reasonably good results for all the gaps (Table 8). In most models, and in all best models, architectures were parsimonious and solutions converged to similar nets ( Table 9). The differences between the r values of the training, test, and validation datasets were well balanced (similar values for training, test, and validation groups) indicating good reliability in the best models, and RMSEs were low (metric in the same units as the flow data).
The sensitivity analysis of the variables in the best models allowed to identify the most relevant in predicting daily flows. Variables excluded from most models or with partial derivatives that did not indicate relevant contribution were usually the precipitation values 4 and 5 days before the gaps, but there were no clear trends across gauging stations. PP-3 was present in many models as quite influential. Julian day was always in the models, but usually not as the most influential variable. Table 9. Main characteristics of the models selected for the different gaps.

Discussion
Natural and rainfall-dominated flow regime watersheds with a reliable data range of 30 years of daily weather and flow records were not abundant in the Ebro river watershed. These conditions were not easily met and consequently, the data used in this study was limited to five gauging stations and the period 1976-2005. Nevertheless, the selected gauging stations were located north and south of the main stream and include paired, nearby gauging stations. Models to fill eleven gaps representative of different conditions of inter and intra-annual flow variability were built.
The variables used to build the models were purposely few, because Julian day and precipitation data are considered the most important variables to make predictions on streamflow [2,23,51] and they are usually available or easy to gather. Data availability is always an issue in this type of studies. Existing data at gauging stations nearby, either upstream or downstream of the same watercourse, are rarely available. Complex rainfall-runoff models can be built but they also require an abundance of data, and watershed characteristics (such as soil and vegetation influence on water cycle) are not always easily acquired.
Instead, we tested simple models based on generally available weather data and Julian day. In many instances the cascade-correlation network models further reduced the number of variables in order to improve training results; mainly precipitation values delayed 4 and 5 days from the gaps. This fact backed our assessment of the influence of the relatively small size of most of the watersheds and our selection of precipitation variables (previous 1-5 days) based on time of concentration and trends in the data flow series.
Like in other hydrological problems [6,[21][22][23]51], ANNs have proven their potential value for modeling complex hydrological processes with limited data (variables), but the variability of the Pearson r correlation values between observed and predicted outcomes under different scenarios and gauging stations indicate that procedures cannot be generalized. Not all CCANN models performed well enough for their intended in-filling gaps application, even with the advantages provided by this algorithm compared to backpropagation, and by the GA used for selecting the best combination of inputs as a preprocessing technique.
The results of the study suggested that there was not one single scenario suitable for filling up gaps in all gauging stations, but the short-term S5 (two years before and after the gap) gave the best results. Different scenarios would have to be tested, if applicable, but our approach seemed promising if seasonal variability is accounted for and short periods before and after the gap are considered. Using the full 30 years of data (S1) did not give satisfactory results which probably were related to changes in water flow over the years, with higher variability more likely within longer time spans. Gaps in low water flow periods apparently gave better modeling results, probably caused by a lower variability in the data typical of these periods. High heterogeneity in the water flow data negatively influenced the training of suitable models, like in case of GS-57 and GS-86 gauging stations, where suitable models were not found for any scenario or gap. Future work may have to look into neural network algorithms better suited to identify extreme values instead of general trends.

Conclusions
Our purpose was to complete discontinuous streamflow data series using a simple filling procedure based on readily available variables. Cascade-correlation neural network models (CCANNs) were built for the estimation of daily water flow in five gauging stations with rainfall-dominated natural hydrological regime located in watersheds of the Ebro river. Models were based on Julian day and precipitation variables from weather stations nearby. We explored the use of a 30-year database under different conditions related to existing gaps in the five gauging stations. Scenarios were defined in order to analyze the performance of the models in different conditions related to intra-and inter-annual natural variability of flow regime as well as database length and characteristics. We concluded that when seasonal variability is accounted for and short periods before and after the gap are considered, CCANNs models can be a very useful predicting tool for filling gaps in streamflow series. No general rule applied to all stations and gaps investigated, on the contrary, individual models had to be built for individual gauging stations using the most appropriate scenario (database length and characteristics, and variables included in the model) to provide best results for each flow series and gap. Models for low water flow periods apparently performed better, probably because of the lower variability in the data typical of these periods. Funding: This study was partially funded by a research contract between ENDESA S.A. and the University of Lleida.