Machine Learning Modeling of Climate Variability Impact on River Runoff

: The hypothesis of this study was one of existence of spatially organized links between the time series of river runoff and climate variability indices, describing the oscillations in the atmosphere–ocean system: ENSO (El Niño–Southern Oscillation), PDO (Paciﬁc Decadal Oscillation), AMO (Atlantic Multidecadal Oscillation), and NAO (North Atlantic Oscillation). The global river ﬂow reconstructions (ERA-20-CM-R) for 18 study areas on six continents and climate variability indices for the period 1901–2010 were used. The split-sample approach was applied, with the period 1901–2000 used for training and 2001–2010 used for testing. The quality measures used in this paper were mean absolute error, dynamic time warping, and top extreme events error. We demonstrated that a machine learning approach (convolution neural network, CNN) trained on climate variability indices can model the river runoff better than the long-term monthly mean baseline, both in univariate (per-cell) and multivariate (multi-cell, regionalized) settings. We compared the models to the baseline in the form of heatmaps and presented results of ablation experiments (test time ablation, i.e., jackkniﬁng, and training time ablation), which suggested that ENSO is the primary determinant among the considered indices.


Introduction
Destructive floods kill, on average, thousands of people worldwide per year, and cause material loss of the order of tens of billions to hundreds of billions of USD [1]. There have been many reports on dramatic deluges, particularly in less developed countries. Very high flood risk, in terms of both percentage of national population and national GDP affected, has been found to exist in Southeast Asia (Vietnam, Cambodia, Bangladesh) [2]. However, large floods have been also reported in recent years in large countries of Asia-China, India, and Pakistan, as well as in Africa, the Americas, and Australia. Moreover, in Europe, many destructive floods have been recorded in recent decades, with the costliest one, in August 2002, affecting Germany, Austria, and the Czech Republic in particular. Despite all the progress made in monitoring, forecasting, and safeguarding, there is no place on earth where this challenge has been definitely solved [3].
It is well known that flood risk has been increasing with time in virtually all spatial scales [1,2], mainly due to increasing exposure and vulnerability of people and property. However, there are huge gaps in understanding of changes in flood hazard and flood risk. Increase of frequency and amplitude of intense precipitation is expected in warmer climates. However, no general, ubiquitous, and statistically significant trend in the observed maximum river flows has been found yet, and model-based projections of changes in flood hazard are highly uncertain [4]. Clearly, there are several non-climatic factors influencing 2 of 29 flood risk of terrestrial/hydrological, as well as social/economic nature, e.g., changes in catchments and rivers as well as changes in flood damage potential [5].
In this study, we considered climate variability as one of the prime drivers inducing high river discharges. Freshwater transport in the drainage basins shows pronounced inter-annual and multi-decadal oscillations and trends. Perhaps part of this variability is random or chaotic, but there are serious hints that climate variability can play an important role. In the "Interpretation of Change in Flood-related Indices based on Climate Variability" (FloVar) project, we examined the variability related to high river runoff and floods and sought relationships between it and the indices of climate variability. The present study was focused on the climate variability track that is likely to play a dominant role in the interpretation of the variability of river runoff globally. The research hypothesis was one of existence of such spatially organized links between recorded time series of river runoff and climate variability indices. In this study, we considered the following indices of climate variability describing the oscillations in the atmosphere-ocean system: three indices of ENSO (El Niño-Southern Oscillation), and one each of PDO (Pacific Decadal Oscillation), AMO (Atlantic Multidecadal Oscillation), and NAO (North Atlantic Oscillation).
Emerton et al. [6] demonstrated that ENSO climate variability influences river flow and flooding at the global scale. Examining the ERA-20CM-R river flow reconstruction for the 20th century, they showed that the likelihood of increased or decreased flood hazard differs for El Niño and La Niña events. In this study, we extended the use of the ERA-20CM-R dataset for multiple indices of ocean-atmosphere oscillation.
The aim of the present study was to analyze the linkages between selected climatic indices and river runoff by means of machine learning models. Six climate variability indices were assumed to be independent variables, while the deviations from the monthly averages of the river runoff values arranged in a spatio-temporal grid were treated as dependent variables. While it is possible to train a model with only a single time series for a specific output cell (univariate setting), we focused on models that handle multiple cells in a region simultaneously (multivariate, regionalized setting). The working hypothesis was that grid cells in close proximity may share regional characteristics, and thus models that collect data from regions (groups of neighboring cells) rather than single cells may model runoff more accurately. The empirical analysis summarized in Sections 3 and 4 confirmed this hypothesis, showing that the multivariate approach offered better accuracy of models and improved significantly upon the baseline.
A number of studies engaged machine learning (ML) in water sciences, thus complementing numerous other works that provide more analytic and often more domainand knowledge-driven perspectives (in contrast to the predominantly data-driven stance adopted in the mainstream ML). This applies in particular to neural networks (NN), which have been covered in dedicated books (e.g., [67]) already 20 years ago. In [68], it was noted that viewing forced climate patterns through an AI lens demonstrates the potential power of ML. This approach augurs well for studying climate and water interfaces. However, to our knowledge, there have been relatively few attempts that use ML for determina-tion of a link between climate variability and river runoff. In one of the relevant studies of that type [69], a recurrent NN (hardly used in ML nowadays) was trained to model low-frequency climatic variability and regional annual runoff in Quebec and Labrador. The authors attempted to link the runoff to selected climatic indices (ENSO, NAO, Pacific-North American (PNA), Baffin Island-West Atlantic (BWA), and sea-level pressure (SLP) at Iceland), and concluded that using BWA, PNA, and ENSO improved the quality of forecasts more than SLP or NAO.
Among numerous types of NNs used in water science, the convolution neural networks (CNN) architectures have become increasingly popular in recent years, perhaps due to their impressive record of achievements in image analysis, then corroborated for various types of time series (e.g., speech recognition and synthesis), and due to the quickly growing repertoire of extensions (e.g., batch normalization, dropout, modern self-tuning training algorithms) and software libraries that made them effective at larger scales (more complex, deeper models, and efficient learning from large data volumes) and deployable on highly efficient hardware architectures (GPUs and dedicated platforms). Only within the last two years have CNNs been used for estimation of discharge for an ungauged watershed [70]; for estimation of pollutant loads (biochemical oxygen demand and total phosphorus) in ungauged watersheds [71]; for statistical downscaling of precipitation from multi-model ensembles [72]; and, in combination with the popular variant of recurrent NN, long short-term memory (LSTM), to simulate water quality (including total nitrogen, total phosphorus, and total organic carbon) in Korea [73]. This list is by no means exhaustive. It merely serves as a small set of examples.
Given the overwhelming capability of NNs to model complex dependencies (and their theoretically shown capability of universal approximation), they are nowadays the primary tool of choice in water sciences. One should, however, not neglect many valuable contributions that have examined other ML techniques, which can be consider more appropriate in certain contexts by being, e.g., easier to interpret or less time-consuming to train or query. One of them, still in extensive use nowadays, are support vector machines (SVMs), which map the learning task to a high-dimensional space and approach it as an optimization problem. A review of many such contributions in hydrology, published in 2014 [74], shows the breadth of research efforts conducted for just one type of ML model. While this and many other studies engage ML to analyze and model the hydrological phenomena at the global scale, many contributions focus on specific regions and/or smaller time scale where more detailed and multi-aspect data can be available. For instance, an analysis for a small watershed in British Columbia, comparing Bayesian NN, support vector regression (a variant of SVMs capable of handling regression tasks), and Gaussian processes was provided in [75]. All these methods outperformed the baseline in terms of accuracy of models, while the NN slightly outperformed the other methods. Their analysis took into account the Pacific-North American teleconnection (PNA), the Arctic Oscillation (AO), and the North Atlantic Oscillation (NAO), and were able to show that their combination with local observations worked best for longer lead times.
The class of NN models used in this study are widely considered as non-interpretable for humans, or at least not easily interpretable given the number of model parameters. Many research projects have aimed to address this limitation and understand the pros and cons of various types of ML models. In [76], there was a focus on the tension between the interpretable and non-interpretable ML models by examining the predictive capabilities of three models of the former type (linear regression, random forest (RF; ensembles of decision trees), and extreme gradient boosting) and three models of the latter type (SVM and two NN architectures: a LSTM and a dense, fully-connected neural network, also known as perceptron). The hydro-climatological problem they tackled concerned modeling of crop evapotranspiration. The conducted empirical evaluation indicated that the interpretable models (RFs in particular) do not yield significantly to more complex, hard-to-interpret models.
Last but not least, there was a large number of studies that, rather than proposing overarching models that link conceptually distant concepts, investigate specific focused aspects of water-related phenomena and attempt to link them to the large-scale circulation of water using, among others, ML techniques. For instance, SVMs were used [77] to model the crop coefficient evapotranspiration in sub-tropical environment. Interestingly, the authors found the SVM to be superior in this particular setting to artificial neural networks. In a more recent study [78], the authors coupled a base flow separation method to three machine learning models and showed that separating streamflow into base flow and surface flow improves the quality of forecasts for a range of ML models (SVR, NN, and RF) for four rivers in the United States. This last study is an interesting example how combining the domain knowledge (on flow separation in this case) with contemporary, domain-agnostic ML models can prove beneficial for the quality of modeling.
Studies on applications of ML in water science should be seen as a part of the research agenda of using ML in climate sciences in general. This includes, among others, works on earth system models on various time scales. As an example of those efforts, a range of state-of-the-art NNs was applied [79] to model the dynamics of the non-seasonal changes in the data collected in the Community Earth System Model (CESM2) that covers records on monthly air temperature evolution in the 1200 years of the pre-industrial era. The authors compared a range of multi-layer perceptrons and instances of LSTMs combined with principal component analysis and convolutions. Of the compared models, the one combining LSTM with convolutions fared the best in terms of the prediction error for a broad range of prediction lead times. In terms of spatial distribution of errors, that model offered best predictive capability at low latitudes and over oceans.
In summary, the related works clearly suggest the usefulness of ML for modeling of phenomena studied in water science, and in particular the usefulness of NNs as a "model substrate", the importance of global climatic indices for the modeling, and the capability of modeling of river runoff. In this context, our submission remains original, among others, by involving an NN-based multivariate, regionalized approach.

Data
The data used in this study embrace a long time series of climate variability indices and globally gridded river runoff records, with a monthly resolution.
The sources of climate variability indices used in this study are listed in Table 1. Three indices in the top rows of Table 1, i.e., Nino 3.4, EMI Modoki, and SOI, belong to the group of ENSO (El Niño-Southern Oscillation) indices. Nino 3.4 is defined as sea surface temperature (SST) anomaly averaged over the area extending from 5 • S to 5 • N and from 170 • W to 120 • W. This is indeed a primary index used for analysis of El Niño phenomena. The ENSO Modoki Index (EMI), also called EMI Modoki, is a more complex index defined as a combination of sea surface temperature anomalies from 3 different  Table 1) can be found in [80]. Global runoff data used in this project (ERA-20CM-R) forms a grid of variables representing the estimated runoff, covering the period 1901-2010, with daily temporal resolution and spatial resolution of 0.5 • × 0.5 • . The data were made available to scientists collaborating in this study by Dr Rebecca Emerton and co-workers who prepared this dataset [6]. The data were produced by the Global Flood Awareness System (GloFAS). Each estimated daily river runoff in a cell is originally represented by 10 values that reconstruct the past river runoff, each produced by 1 model of 10-member ensemble. While Emerton et al. [6] used these data to assess the variance for each cell in order to estimate the uncertainty, in the present study, we aggregated them with arithmetic mean into a single time series per grid cell. While we acknowledge the fact that the global river runoff data used in this paper are model results themselves, we still claim that this is probably the best global-scale runoff data collection available for the scientific community at the time of writing.
To sum up, while our time series of indices of climate variability formed one-dimensional time-ordered sequences, river runoff records formed a 3D array, with the first 2 dimensions indexed by longitude and latitude of grid cells, and the third dimension indexed by time.
In order to match the monthly time resolution of climate indices, we integrated the daily runoff data values into monthly buckets for each cell independently.

Modelling
Our ultimate goal was to model the impact of global climate variability indices (6 independent input variables) on river runoff for each grid cell of selected areas. The target value is the deviation from the monthly mean of river runoff, where the latter is calculated by averaging values of river flow by time for each month separately. Therefore, we trained the model to learn to calculate an adjustment for an expected value for a particular month.
In the basic variant, we designed univariate models that directly matched this problem statement, i.e., a dedicated local model was trained for each grid cell. Concerning time alignment, a model has access to independent variables in a time horizon [t − k + 1, t], where t is the "current" time period (month) and k represents the length of the accessible time window. On the basis of the 6 input variables observed in this window, we would expect a model to produce the hindcast of river runoff in time period t + 1 for a given cell. Therefore, our univariate models have the mathematical signature R 6×k → R.
As signaled in the Introduction, we hypothesized that taking into account the spatial nature of the runoff data may bring us some benefits. Runoff in neighboring cells may share some similarities, reacting to similar fluctuations in the climate system, and may be thus affected in a similar way. This observation justifies our striving to propose a family of multivariate models, which could be employed to learn simultaneously from multiple cells in a selected region. For a region comprising n grid cells, this translates into building multivariate models of the mathematical form R 6×k → R n . In other words, the target value is a vector of the deviation from monthly means for each cell in a region. It is worth mentioning here that multivariate models do not perceive those cells as a sequence, but they treat them as a collection, without concern for the spatial neighboring relationship.
Our expectation is that multivariate models will tend to learn/form internal representations (features) that may capture regional patterns of dependency and thus facilitate more accurate modeling. A multivariate model, by observing all n variables in the training process, will also obtain more statistical evidence for estimation of its parameters, and should in principle be less likely to overfit to the training data. It is important to emphasize that a multivariate model is here far more than a mere collection of univariate models-all models considered here are implemented as neural networks, and thus parts of a multivariate model are explicitly shared between the predicted variables (corresponding here to cells).
Another advantage of multivariate models is that training n univariate models can be computationally costly, especially for large n. We show in the following sections that our multivariate models are only slightly more complex than univariate ones, and therefore training a multivariate model for an n-cell region is approximately n times less costly in terms of computational overhead. Similarly, querying such a model produces hindcasts for all cells in a region, thus bringing similar computational savings.
As signaled earlier, we implement our models as deep neural networks. This choice is motivated by the recent deluge of successful applications of neural models in a broad range of areas involving time series processing, including speech synthesis; natural language processing; interpretation of video sequences; and, last but not least, meteorology, climatology, and hydrology. In preliminary experiments, we considered 2 major deep-learning neural network architectures, namely, recurrent neural networks (RNNs) and convolution neural networks (CNNs). As CNNs turned out to provide better accuracy of models, and are also conceptually simpler, we decided to limit the following presentation to this category of models.

Convolution Neural Networks (CNNs)
Convolution neural network (CNN) is a class of artificial neural networks that use the concept of convolution as its foundation. The essence of processing is convolving the input data with a suite of kernel functions and repeating this process at multiple levels of temporal resolution, with the levels separated with nonlinear processing (activation functions).
The technical implementation of convolution in CNNs is by necessity discrete (due to discrete time) and controlled by a number of hyperparameters: receptive field size (the width of the kernel on the time axis), depth (the number of kernels applied in parallel to the same input data), stride (the ratio at which the kernel samples the input signal), and padding (the way one deals with the border cases when the kernel extends beyond the available data). For more details on these aspects, see, e.g., [81]. Crucial, however, is the fact that kernels are trainable (in contrast to classical signal processing, where they are typically assumed to be given, e.g., derived from first principles). In a discrete realization, a kernel is simply a vector of m real numbers, where m is kernel width. These parameters are initialized at random, and then optimized with a variant of stochastic gradient descent (SGD) to minimize a task-specific loss function that expresses the discrepancy between the actual output of the model and the target values.
Convolutions prove universally useful in various domains because they can model local patterns and interdependencies. While traversing through the input signal, a kernel is interacting with a small number of elements that form a neighborhood, which for the models considered here is a time window of predefined length.
The number of concrete neural architectures that can be composed from convolutional building blocks is combinatorially vast, and the rich literature on this topic has been sampling this design space intensely in the past decade. The CNN architecture we propose is inspired by the work by Van den Oord et al. [82], where it proved very effective for speech synthesis. The main idea was to use many one-dimensional convolutional layers with the receptive field's size of 2, stacked on top each other, and the stride parameter (shift in the axis of the kernel's movement) set to 2 (see Figure 1). As a result, each convolutional layer doubles the effective receptive field (window) compared to the previous layer, i.e., the temporal extent of observed input data it depends on, and thus that one element is visited only once in a single convolution (but is actually visited by each filter). As each kernel has the size of 2, the first element of this kernel always perceives the former observation, while the second one sees the later one. This allows this architecture to access a long range of input data (exponentially long in the function of the number of layers) while maintaining a low total number of parameters. This latter aspect is particularly important, as overparameterized models are in general more likely to overfit and thus fail to generalize well.
Importantly, the successive convolutional layers feature an increasing number of kernels, each of them giving rise to a separate channel. Complexity of the model (number of filters) grows linearly on each layer. Therefore, for the model trained on the records for 8 months (as can be seen in Figure 1), with the complexity hyperparameter set as 30, the first layer (the closest to raw features) will perform the convolution with 10 filters, next layer on 20, and the last one will make use of 30 filters. This allows the model to form a rich repertoire of diversified features, which in turn facilitate making more accurate hindcasts. The stack of convolution is followed by a fully connected layer (also known as the dense layer). Contrary to the convolution, it considers signals as a set, disregarding their relative positions. The responsibility of the dense layer is to weigh those values making a mixture of signals, which will result in the final value returned by the model.
All neural models are trained to minimize the loss measured as the mean absolute error (MAE) with respect to the target value. We used the RMSProp variant of SGD, which improves SGD algorithm with the mechanism of the guided gradient. During the consecutive training steps, the gradient is not immediately applied for the model improvement (setting new weights) but is added to accumulator values, which act as an exponentially weighted moving average over the past gradients. This accumulator is used to set a new The stack of convolution is followed by a fully connected layer (also known as the dense layer). Contrary to the convolution, it considers signals as a set, disregarding their relative positions. The responsibility of the dense layer is to weigh those values making a mixture of signals, which will result in the final value returned by the model. All neural models are trained to minimize the loss measured as the mean absolute error (MAE) with respect to the target value. We used the RMSProp variant of SGD, which improves SGD algorithm with the mechanism of the guided gradient. During the consecutive training steps, the gradient is not immediately applied for the model improvement (setting new weights) but is added to accumulator values, which act as an exponentially weighted moving average over the past gradients. This accumulator is used to set a new direction for the gradients considering the previous training steps.

Training and Testing Regime
The goal of a good machine learning solution is to fit unseen data. A trained model has to perform well on both data available in the training stage and data available in the future. These future data have to be simulated in some way. A usual practice is to follow a split-sample approach, where a larger subset of the existing records is used for learning and a smaller subset for testing (in the hindcast mode). These are usually called training and test sets, respectively.
In this project, 2 ways of dividing data into training and testing sets were used. In order to find good hyperparameters for the model, we tested different values in the process of cross-validation. This technique was used to assess performance of the model on the whole dataset. As the first step, data were divided into small subsets. Next, in the iterative process, each subset was used as a test set, while all other subsets were used to train the model. This approach was used for the analysis of the models. By performing analysis on the whole dataset, we gained a better approximation of the relations present in the data compared to analysis performed on a smaller period. The other way of dividing is simply to decide which period is interesting in the research and using data from this period as a test set. In our case, it was the last decade of the available data, i.e., the years 2000-2010. Both mentioned techniques were burdened with the additional constraint-the climate data used in the training set could not be used in the test set, as this could be seen as a data leak. For this reason, there was always some data that were not used in the process.

Evaluation and Metrics
We assess the quality of models with 3 measures used for this task. Those are as follows: mean absolute error (MAE), dynamic time warping (DTW), and top extreme events error (TEEE).
Mean absolute error (MAE) is measured in the same way as the loss function used to train the models. This measure has the advantage of better interpretability, compared to, among others, mean square error.
Dynamic time warping (DTW) is a technique for finding the best alignment between elements of the 2 time series. In contrast to classical measures that compare only the observations from the same time period, DTW is able to match points in sequences that reside at different positions. Because of this, DTW is more capable of detecting similarities between time series with some shift in their pattern, compared to methods that accumulate errors element-wise (e.g., MAE). A more detailed description can be found in [83]. In this project, we used a tool developed by Gorgino [84].
The last measure, top extreme events error (TEEE) indicates the fraction of erroneously located top values in a series. It was developed in order to measure how bad a model is in predicting extreme/high runoff events. The TEEE n (x1, x2) is defined as follows: first, the set S1 of locations of top n extreme values in x1 is calculated, and analogously S2 for x2. Then, the accordance of these sets is calculated as the size of the intersection of S1 and S2, divided by n. Finally, TEEE is defined as 1 minus accordance. In this study, we used n = 20, and thus TEEE considers roughly 17% of 120 months that form the training set (2001-2010) as extreme events.
To sum up, MAE is a commonly used metric for measuring model performance in regression tasks. The DTW indicates how much of the target signal characteristic is present in the predicted sequence, even if patterns are shifted or stretched in time. The TEEE has a Water 2021, 13, 1177 9 of 29 complementary characteristic as it penalizes predicting extreme values at wrong locations in time while not caring at all about the actual magnitude of values. As an illustration, a model that makes perfect hindcasts but lags them by only 1 time step will obtain DTW of 1 and TEEE of 0 (the last assuming that all values in actual and predicted sequence are unique, and therefore there are no ties when determining the n most extreme events).
The above measures are absolute, and their values can be biased by the actual difficulty of the modeling task, which may significantly vary depending on the geographical region, time period, and other characteristics. In order to assess how much a model is able to learn about the characteristic of the observed phenomenon, in the following, we focus on reporting relative improvements on those metrics compared to a baseline. We express this using percentage error reduction defined as Error accumulation consists of summing the error over the selected part of available data, e.g., across time period or space (e.g., a region). It is worth noting that PER(M) can be negative, indicating that the model fares worse than the baseline.

Baseline Models
Our baseline-naïve model is a strictly periodic, stationary model with no trend. For each of 12 months, the respective mean value is calculated from the training set. When asked for a prediction for a given month, the model returns the respective mean value, disregarding any other features. The predictions of this model are thus identical for all years (for a given training period). Notice that this is the only model considered in this study that uses the historical values of river runoff to determine the value of the dependent variable-all remaining methods use the climate indices only.

Software
The software framework for this study was implemented in Python with the help of general-purpose libraries (NumPy, Pandas, Scikit-learn, Matplotlib), a popular deep learning framework TensorFlow, and domain-specific tools: Ncview (a NetCDF file visualizer) and GeoPandas (a map visualization library). While TensorFlow makes it possible to delegate computation to GPUs, this option was not exploited in this study due to the nature of the used computer infrastructure.
The river flow information, as used by Dr Rebecca Emerton [6], was received from the European Centre for Medium-Range Weather Forecasts (ECMWF). It came in the NetCDF format (Network Common Data Form), which is a part of a broad framework of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
In order to provide our models with access to river flow data, we developed an independent software layer. This involved solving several technical challenges, part of them stemming from the sheer volume of data, which was too large to fit to the RAM (raw data have the size of roughly 416 GB).

Comparison to the Baseline
We compared the performance of two deep-learning models across 18 different regions, as shown in Figure 2. The four leftmost columns of Table 2 present the ranges of latitude and longitude and the number of valid cells for each studied region, where a cell is considered valid if data are available for it. The regions were defined so as to provide a representative sample of various climatic environments and locations on the planet, scattered on six continents: Europe, Asia, North America, South America, Africa, and Australia.   For each region independently, we first ran a series of preliminary fivefold crossvalidation experiments to tune the number of convolutional filters used by the neural models (this hyperparameter was found to have the strongest impact on the accuracy of models). The obtained numbers of filters were subsequently used in the experiments that follow. All models took as input the six climate indices presented in Section 3.1, observing them in an 8-month period (window size k = 8, cf. Section 3.2) preceding the predicted/modeled time moment (month).
Training of a neural network starts with random initialization of its parameters. To accommodate this stochastic element, the main experiment consisted in repeating the training process five times. In what follows, we report the averages of the metrics achieved by those models in a hindcasting regime when trained on the first 100 years of data  and evaluated on the last decade of the studied data (2001-2010). Therefore, the prevailing part of the original data was used to fit the model, and the last decade served exclusively its scoring, and therefore the metrics reported in the following reflect the generalization capability of models, not the goodness of fit.
The six rightmost columns of Table 2 present the quality of models assessed using the three relative metrics defined in Section 3.4, i.e., PER (MAE), PER (DTW), and PER (TEEE), averaged over each region. Recall that, for the multivariate models, the averages report the quality of five independent training and testing rounds mentioned earlier (cf. Section 3.2). However, for the univariate models, the averages are taken over those five rounds and over the models trained for individual (valid) cells; for instance, 5 × 140 = 700 models for region #1.
For multivariate models, there are three regions (#1 in North America, and #12 as well as #15 in Asia) with PER (MAE) exceeding 10 percent points. For three regions (#1 in North America, #15 in Asia, and #18 in Australia), PER (DTW) exceeded 10 pp, and four regions (#1 in North America, #12 and #15 in Asia, as well as #17 in Australia) had PER (TEEE) exceeding 10 pp. There were two regions (#1 in North America and # 15 in Asia) for which all three metrics exceeded this threshold.
For univariate models, the 10 pp mark was achieved in #1, #12, and #15 for PER (MAE); in #1, #15, and #18 for PER (DTW); and in #1, #8, #15, and #18 for PER (TEEE). There were two regions (#1 and # 15) for which all three metrics exceeded this threshold. Interpretation of PER (TEEE) requires special care, as it is a discrete metric based on only n top (extreme) events in a time series. Here, we used n = 20, and therefore roughly 17% of 120 test-set months (2001-2010) were treated as extreme events. As argued in Section 3.4, TEEE can achieve the exact value of zero when all of the n most extreme hindcasts of the model coincide in time with the actual extreme events. This causes division by zero in the PER(M) formula, which we prevented by adding an epsilon value to the denominator. Nevertheless, the result of division can still be very high. This happened for region #5, and hence we replaced the largely meaningless overflow value obtained there with a dash.
As the metrics reported in Table 2 are averages over five randomly initialized models, we may thus confidently claim that the majority of models managed to outperform the baseline on a regular basis.
The pattern of scores over regions was clearly very similar for multivariate and univariate models; indeed, the Pearson correlations between the observed values were 0.975, 0.982, and 0.971 for MAE, DTW, and TEEE, respectively. Nevertheless, the multivariate models were systematically better, as evidenced by the averages in the last row of the table. Out of 18 regions, the univariate approach was better only in 5, 6, and 8 of them, for MAE, DTW, and TEEE, respectively. According to the conservative Wilcoxon signed-rank test, the differences were significant at 0.1 level for MAE and DTW (p-values of 0.064 and 0.09, respectively). This corroborates our rationale expressed in Section 3.2, i.e., that multivariate neural models can capture the inter-cell patterns and thus offer better quality of hindcasting. Therefore, in the following, we focus exclusively on multivariate models.
In some regions, some metrics dropped below zero, which indicated a failure-the learning process could not model the dependency better than the baseline model. In other words, taking the long-term mean as the runoff value in the next month was more advantageous at such locations than hindcasts of the machine learning model. One possible interpretation is that there are regions where the influence of selected six climate variability indices was very weak or virtually non-existent. However, such cases were relatively rare in the multivariate variant: three regions for PER (MAE), two for PER (DTW), and two for PER (TEEE). Only one region (#11) experienced all three metrics as negative. Figure 3 illustrates the time series dynamics for the cell in region #15 (Celebes Sea) that scored high on all three metrics. Three time series were plotted: the actual, true value; the baseline (monthly mean of long-term record); and the hindcast of the multivariate model. It can be observed that model's hindcasts closely matched the true values and reflected quite accurately the quasi-periodic characteristic of the phenomenon. The alignment of predicted local extremes with their actual counterparts was also quite good, in contrast to the baseline. We found this closeness remarkable, given that the model had no access to the historic values of the river runoff-all variability it produced was based solely on the six input variables (climate variability indices) observed in the 8-month long period, directly preceding the hindcast time period.  The heatmaps suggest that, at least for some regions, there seems to be a meaningful link between independent variables and the river runoff. A particularly strong signal model was synthesized for region #15 (Figure 4f), where the average value of PER (MAE) was 22.5, the largest among the considered regions ( Table 2). We hypothesized that this may have stemmed from this region being located in the zone of potential strong influence of ENSO.  The heatmaps suggest that, at least for some regions, there seems to be a meaningful link between independent variables and the river runoff. A particularly strong signal model was synthesized for region #15 (Figure 4f), where the average value of PER (MAE) was 22.5, the largest among the considered regions ( Table 2). We hypothesized that this may have stemmed from this region being located in the zone of potential strong influence of ENSO.

Ablation Experiments
With the proposed models surpassing the baseline by substantial margins in most cases, it becomes interesting to ask about the underlying causes of this capability. Artificial neural networks are notorious for inherent inability to explain their hindcasts. Nevertheless, in the usage scenario we adhered to in this study, some degree of insight can be still achieved by ablation experiments, i.e., studying reduced data sources and/or models and comparing them to the original models that have full access to available data. We considered two types of such approaches here. In both of them, the starting point was the original model M, which had been fit to the training data on all available input variables. We performed this analysis for multivariate models only, for the reasons outlined in Section 4.1.
In the test time ablation (a.k.a. jackknifing), we kept the original model M and replaced the value of a chosen input variable (a climate index) with a fixed estimate; here, that estimate was the variable's average over the entire training set. We then queried M on the test data using that average and the true values of all remaining variables, and then assessed its performance. Apart from being computationally cheap, this technique has the advantage of using the same model M in the reduced setting, which warrants direct comparison. On the other hand, however, replacing a variable with its expected value tends to seriously disrupt the distribution of data observed at the test time (compared to the distribution observed at training/fitting time), and may thus impact the hindcast substantially.
In the training time ablation, we also hid one of the input variables, but did that already at the time of training/fitting. Thus, both training and testing were conducted on the reduced dataset. The advantage of this technique is that the absence of the concealed variable is known to the model already at the stage of training, and therefore it attempts to produce accurate hindcasts using the remaining variables, trying to maximize the "information yield" that can be extracted from them. The downside of this approach (apart from being computationally more expensive than jackknifing) is that it is hard to juxtapose the reduced model M' with the original M. This is particularly true for neural models, in which each change of the input (here, reduction of the number of input variables) requires architectural changes to the model (technically reducing the number of input units). As M and M' vary in architecture and start their learning with a different, random set of parameters (weights), their final forms cannot be directly juxtaposed, other than by simply comparing their hindcasts.    Given the mutually complementary advantages of these ablation techniques, we conducted both of them. In Figure 5a,b, we present the results for the eight selected regions of interest for test time and training time ablations, respectively. Both figures present the PER (MAE) metric for the original, non-ablated model M in the leftmost column (All), and for the ablated configurations in the successful columns (All-NAO, All-PDO, etc.). Each cell in these heatmaps summarizes the test-set PER (MAE) of 11 models in an 11-fold cross validation experiment conducted on 110 years of available data partitioned into 11 decades, as described in Section 3.3.
For most input variables, their removal has s limited effect for the accuracy of the models, i.e., the values remain roughly consistent across rows. Nevertheless, most ablations are detrimental. For the test time ablation (Figure 5a), out of the 48 considered configurations (6 variables × 8 regions), we observed deterioration in 31 cases (65%) and improvements in 17 cases (35%), the latter specifically for NAO: in five out of eight regions, PDO: four, AMO: three, EMI: two, SOI: three, and Nino: zero. For training time ablations (Figure 5b), the analogous figures were NAO: six, PDO: six, AMO: three, EMI: four, SOI: two, and Nino: zero, which totalled to 21 improvements (44%) and 27 deteriorations (56%). The negative effect of variable hiding in the training time scenario was thus smaller, which was expected; the absence of the ablated variable was in part compensated in training by eliciting useful information (e.g., partial correlates) from the remaining variables.
The Nino index clearly stood out in this analysis: its absence never improved PER (MAE), and the incurred loss in comparison to all-variables scenario (All) was typically substantial. This suggests that Nino was an essential determinant for prediction of river runoff for all eight regions of interest spread across the globe. The Nino index clearly stood out in this analysis: its absence never improved PER (MAE), and the incurred loss in comparison to all-variables scenario (All) was typically substantial. This suggests that Nino was an essential determinant for prediction of river runoff for all eight regions of interest spread across the globe.  The above observations should be taken with a grain of salt, as the averages in Figure  5a,b were estimated from an 11-fold cross validation experiment, and were thus burdened by substantial variance. To address this aspect, we proceed now with a more in-depth analysis for individual regions. For brevity of presentation, however, we focused on two regions, Celebes Sea Region (#15) and Georgia (#8), and show the detailed results for them in Figures 6-9. The analogous plots for the remaining regions can be found in the Appendix A, Figures A1-A12.
In Figures 6 and 7, we present the results of this process for the Celebes Sea Region (region #15), for test time and training time ablations, respectively. In analogy to Figure  5a,b, both figures had the same form and presented the PER (MAE) metric for the original, non-ablated model M on the entire region as the leftmost box-and-whisker plot (All), and for the ablated configurations in the successful plots (All-NAO, All-PDO, etc.). We kept the same ordering of variables as in Figure 5 to ease comparison.
Expectedly, the models performed best on average when given access to the full set of variables-the expected improvement of MAE compared to the baseline was just short of 20% for this region. Under test time ablations (Figure 6), the jackknifed configurations The above observations should be taken with a grain of salt, as the averages in Figure 5a,b were estimated from an 11-fold cross validation experiment, and were thus burdened by substantial variance. To address this aspect, we proceed now with a more in-depth analysis for individual regions. For brevity of presentation, however, we focused on two regions, Celebes Sea Region (#15) and Georgia (#8), and show the detailed results for them in Figures 6-9. The analogous plots for the remaining regions can be found in the Appendix A, Figures A1-A12.
In Figures 6 and 7, we present the results of this process for the Celebes Sea Region (region #15), for test time and training time ablations, respectively. In analogy to Figure 5a,b, both figures had the same form and presented the PER (MAE) metric for the original, nonablated model M on the entire region as the leftmost box-and-whisker plot (All), and for the ablated configurations in the successful plots (All-NAO, All-PDO, etc.). We kept the same ordering of variables as in Figure 5 to ease comparison.
Expectedly, the models performed best on average when given access to the full set of variables-the expected improvement of MAE compared to the baseline was just short of 20% for this region. Under test time ablations (Figure 6), the jackknifed configurations fared worse, but the effect size varied substantially by the removed variable. For NAO, PDO, AMO, and EMI Modoki, the effect was negligible. Only removal of SOI and Nino 3.4 had strong observable impact. In particular, removal of Nino 3.4 reduced PER (MAE) by over 8 percentage points, which suggests a very strong dependency of models on this indicator.
In training time ablations (Figure 7), the effect size was lower overall and did not exceed 4 pp on median. The reason was that models can compensate for the missing variable by learning as much as possible from the other variables. Interestingly, the ability of compensation does not correlate in general with the impact of a variable in jackknifing. The most striking example was SOI: while its absence affected the jackknifed models by almost 2 percentage points ( Figure 5), it turned out to be less essential here. This suggests that the lack of information provided by SOI could be almost entirely compensated by the other variables. Note, however, that this does not necessarily mean that SOI depends on the remaining variables in any simple way (i.e., via high linear correlation/dependency), as neural networks are capable of modeling and exploiting complex, nonlinear, and noisy relationships between variables, even if they hold only in parts of their domains.
Water 2021, 13,1177 19 of 33 that the lack of information provided by SOI could be almost entirely compensated by the other variables. Note, however, that this does not necessarily mean that SOI depends on the remaining variables in any simple way (i.e., via high linear correlation/dependency), as neural networks are capable of modeling and exploiting complex, nonlinear, and noisy relationships between variables, even if they hold only in parts of their domains.   The overall wider variance of PER (MAE) for training time ablations (Figure 6 vs. Figure 5) may reflect the fact that reducing the dimensionality of data (from six to five input variables) imposes fewer constraints on the fitting process and enlarges the space of models that can be fit to them (in training time ablations, we applied only minimal modifications to the architecture of neural networks, i.e., removed only the corresponding input units so that the overall number of parameters of the model remained largely unchanged). This may lead to overfitting (and consequently lower PER (MAE) in comparison to the All scenario), but occasionally translates into improvements-which seemed to be the case for the All-SOI.
Figures 7 and 8 present the analogous results for region #8 (Georgia, USA) for jackknifing and training time ablations, respectively. In this region, the gain elaborated by models trained on the complete set of features (All) was lower than for the Celebes Sea-≈14 pp on average. Except for Nino 3.4, the impact of jackknifing (Figure 7) seemed less prominent (cf. Figure 5). For training time ablations (Figure 8), the differences were more prominent, and the causes may have been analogous to those discussed above for Figure  6. Interestingly, however, we may observe a similar pattern in both figures, with the models fitted to the All-PDO and All-SOI datasets faring comparably well to the rest, often better than the base models (All). This may suggest that PDO and SOI may be in a sense detrimental for modeling of river runoff in this region, i.e., they may carry certain spurious patterns that deceive the fitting process and cause it to underperform at test time. The overall wider variance of PER (MAE) for training time ablations (Figure 6 vs. Figure 5) may reflect the fact that reducing the dimensionality of data (from six to five input variables) imposes fewer constraints on the fitting process and enlarges the space of models that can be fit to them (in training time ablations, we applied only minimal modifications to the architecture of neural networks, i.e., removed only the corresponding input units so that the overall number of parameters of the model remained largely unchanged). This may lead to overfitting (and consequently lower PER (MAE) in comparison to the All scenario), but occasionally translates into improvements-which seemed to be the case for the All-SOI.
Figures 7 and 8 present the analogous results for region #8 (Georgia, USA) for jackknifing and training time ablations, respectively. In this region, the gain elaborated by models trained on the complete set of features (All) was lower than for the Celebes Sea-≈14 pp on average. Except for Nino 3.4, the impact of jackknifing (Figure 7) seemed less prominent (cf. Figure 5). For training time ablations (Figure 8), the differences were more prominent, and the causes may have been analogous to those discussed above for Figure 6. Interestingly, however, we may observe a similar pattern in both figures, with the models fitted to the All-PDO and All-SOI datasets faring comparably well to the rest, often better than the base models (All). This may suggest that PDO and SOI may be in a sense detrimental for modeling of river runoff in this region, i.e., they may carry certain spurious patterns that deceive the fitting process and cause it to underperform at test time.
Importantly, all results presented in Figure 5a,b, Figures 6-9 indicate the critical impact of Nino 3.4 on the quality of modeling. The effect size on median when concealing this variable varied from 3.5 pp (Figure 7) to ≈11 pp (Figure 8). In summary, Nino 3.4 is thus clearly the strongest determinant of the quality of models considered here. On the other hand, the impact of removal of the remaining indices was moderate and did not exceed 2 pp (SOI in the jackknifing experiment, Figure 6). This suggests that, except for Nino 3.4, all subsets of indices considered in this experiment carried en gross a similar amount of information useful for the modeling of river runoff.  (Figure 8). In summary, Nino 3.4 is thus clearly the strongest determinant of the quality of models considered here. On the other hand, the impact of removal of the remaining indices was moderate and did not exceed 2 pp (SOI in the jackknifing experiment, Figure 6). This suggests that, except for Nino 3.4, all subsets of indices considered in this experiment carried en gross a similar amount of information useful for the modeling of river runoff.  Figures A1-A6   Results of ablation experiments for the remaining six study areas represented in Figure 4 are provided in the Appendix A. Figures A1-A6

Discussion and Conclusions
The work conducted within the "Interpretation of Change in Flood-related Indices based on Climate Variability" (FloVar) project demonstrated that the natural climate variability (i.e., oscillations in the ocean-atmosphere system) alone carries important and useful information relevant to the spatio-temporal field of river runoff, being of considerable and broad relevance.
We demonstrated that machine learning (ML) lends itself well to model the links between climate variability indices and the process of river runoff. In most of the 18 studied regions scattered over six continents, the hindcast based on the values of six climate variability indices alone, observed in the preceding 8-month time window, outperformed the baseline, i.e., the naїve hindcast resulting from the long-term mean for a particular month. The superiority of multivariate models (Section 4.1) signals that the CNNs used here can effectively capture complex patterns of events and dynamics co-occurring in multiple cells in a region.
Even if NNs are notorious for inherent inability to explain their predictions, we attempted to provide some interpretations by conducting ablation experiments. Our ablation experiments (Section 4.2) suggest the critical impact of the El Niño phenomenon as captured by the Nino 3.4 index on the modelling performance, while the impact of removal of the remaining five indices was found to be moderate. This points to significant degree of redundancy among these five indices. The sample of heatmaps bode well as to the possibility of more extended hindcasting, which would use not only the past climate variability indices but also the past values of river runoff in every grid cell of concern. However, this was beyond the object of the project.
Direct comparison of our results of ablation experiments with the existing scientific literature focused on the application of teleconnection indices to the study of regional climatic conditions does not seem possible for at least three reasons: (i). Our study areas ("rectangular" sets of cells) did not really map regions analyzed in the literature. (ii). The existing references do not broadly agree on the links between indices of oceanatmosphere oscillations and runoff fields, as demonstrated in [65,66]. (iii). No reference known to the authors includes all six climate variability indices considered in our paper.
Our results demonstrate that indeed climate variability indices, representing the stage of the ocean-atmosphere oscillation, carry relevant information that can be associated with the river runoff process in many, but not all, studied areas on six continents. This is quite understandable. There is no question that tele-connections may express relations between variables over a very large distance, but a particular climate variability index is likely to have a specific, spatially organized zone of influence. Beyond this zone, the links between a specific climate variability index and the river runoff process are likely to be weak, very weak, or non-existent. A difficult challenge for the future would be to reliably determine the zones influence of specific climate variability indices.

Author Contributions:
The idea of the work was proposed by K.K. and Z.W.K. The data were acquired and processed by M.N., who performed all the calculations, prepared the artwork, and produced a Master of Engineering thesis. Z.W.K. drafted the initial outline of this paper and, together with M.N. and K.K., prepared preliminary and incomplete text. Next, M.N. made additional calculations and prepared additional artwork and, together with K.K. and Z.W.K., provided several rounds of amendments to the text. All authors have read and agreed to the published version of the manuscript.  Table 1. The ERA-20CM-R river runoff data have been obtained upon request from Rebecca Emerton and Ervin Zsoter from the European Centre for Medium-Range Weather Forecasts (ECMWF).

Acknowledgments:
The authors acknowledge the receipt, on request, of ERA-20CM-R river runoff data from Rebecca Emerton and Ervin Zsoter from the European Centre for Medium-Range Weather Forecasts (ECMWF). The authors also thank the academic editor and three anonymous reviewers for their constructive comments that helped us enrich our paper.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Water 2021, 13, 1177 25 of 33 Appendix A Figure A1. Results of test time ablation (jackknifing) for the study area including Germany (region #6). Figure A1. Results of test time ablation (jackknifing) for the study area including Germany (region #6). Figure A1. Results of test time ablation (jackknifing) for the study area including Germany (region #6). Figure A2. Results of training time ablation for the study area including Germany (region #6). Figure A2. Results of training time ablation for the study area including Germany (region #6).
Water 2021, 13, 1177 26 of 33 Figure A3. Results of test time ablation (jackknifing) for the study area including Croatia (region #7). Figure A3. Results of test time ablation (jackknifing) for the study area including Croatia (region #7). Figure A3. Results of test time ablation (jackknifing) for the study area including Croatia (region #7). Figure A4. Results of training time ablation for the study area including Croatia (region #7). Figure A4. Results of training time ablation for the study area including Croatia (region #7).
Water 2021, 13, 1177 27 of 33 Figure A5. Results of test time ablation (jackknifing) for the study area including Poland (region #8). Figure A5. Results of test time ablation (jackknifing) for the study area including Poland (region #8). Figure A5. Results of test time ablation (jackknifing) for the study area including Poland (region #8). Figure A6. Results of training time ablation for the study area including Poland (region #8). Figure A6. Results of training time ablation for the study area including Poland (region #8).
Water 2021, 13, 1177 28 of 33 Figure A7. Results of test time ablation (jackknifing) for the study area including Singapore (Asia, region #12). Figure A7. Results of test time ablation (jackknifing) for the study area including Singapore (Asia, region #12). Figure A7. Results of test time ablation (jackknifing) for the study area including Singapore (Asia, region #12). Figure A8. Results of training time ablation for the study area including Singapore (Asia, region #12). Figure A8. Results of training time ablation for the study area including Singapore (Asia, region #12).     Water 2021, 13, 1177 30 of 33 Figure A11. Results of test time ablation (jackknifing) for the study area including Brisbane, Queensland (region #18). Figure A11. Results of test time ablation (jackknifing) for the study area including Brisbane, Queensland (region #18). Figure A11. Results of test time ablation (jackknifing) for the study area including Brisbane, Queensland (region #18). Figure A12. Results of training time ablation for the study area including Brisbane, Queensland (region #18). Figure A12. Results of training time ablation for the study area including Brisbane, Queensland (region #18).