Short-Term River Flood Forecasting Using Composite Models and Automated Machine Learning: The Case Study of Lena River

Mikhail Sarafanov; Yulia Borisova; Mikhail Maslyaev; Ilia Revin; Gleb Maximov; Nikolay O. Nikitin

doi:10.3390/w13243482

,

and

National Center for Cognitive Research, ITMO University, 49 Kronverksky Pr., 197101 Petersburg, Russia

^*

Authors to whom correspondence should be addressed.

Water2021, 13(24), 3482;https://doi.org/10.3390/w13243482

This article belongs to the Topic Recent Advances in Hydroinformatics: Focusing on Machine Learning and Remote Sensing in Hydrology

Version Notes

Order Reprints

Abstract

The paper presents a hybrid approach for short-term river flood forecasting. It is based on multi-modal data fusion from different sources (weather stations, water height sensors, remote sensing data). To improve the forecasting efficiency, the machine learning methods and the Snowmelt-Runoff physical model are combined in a composite modeling pipeline using automated machine learning techniques. The novelty of the study is based on the application of automated machine learning to identify the individual blocks of a composite pipeline without involving an expert. It makes it possible to adapt the approach to various river basins and different types of floods. Lena River basin was used as a case study since its modeling during spring high water is complicated by the high probability of ice-jam flooding events. Experimental comparison with the existing methods confirms that the proposed approach reduces the error at each analyzed level gauging station. The value of Nash–Sutcliffe model efficiency coefficient for the ten stations chosen for comparison is 0.80. The other approaches based on statistical and physical models could not surpass the threshold of 0.74. Validation for a high-water period also confirms that a composite pipeline designed using automated machine learning is much more efficient than stand-alone models.

Keywords:

flood forecasting; automated machine learning; composite artificial intelligence

1. Introduction

1.1. Modelling of the River Floods

River floods can be considered as a crucial type of dangerous event. The damage caused by a river flood may reach hundreds of thousands of dollars [1,2]. Moreover, natural hazards take lives and affect a large number of people every year [3]. Because it is vital to predicting such floods successfully, there are a lot of scientific works dedicated to flooding operational modeling based on various solutions [4].

To reduce the damage caused by natural disasters, forecasting systems are used. They are based on models that predict water levels and flow rates. Classical methods used in hydrology are based on determining the dependence between meteorological data, basin and substrate characteristics, and modeled target values [5]. There are widely used approaches to modeling nature processes, which were implemented in the form of hydrological models. Such models mainly rely on the deterministic approach, but some models operate with elements of stochastic nature [6]. It allows us to reproduce a wide class of processes with suitable hydrological models. The advantage of such models is that they are can be well interpreted by experts [7]. Most models are not implemented for a specific region but represent the bonds between the components of almost any river system.

Model calibration is used to take into account the specific characteristics of a particular region [8]. Adapting the model to the desired domain allows reducing forecast error. However, for complex models, calibration can be too complicated and computationally expensive. Moreover, complex physical models require a lot of data that may not exist or may not be insufficient quantity for the modeled river basin. Due to the difficulties with calibration, the application of such models becomes inefficient in some cases [9].

Another way of physical modeling for flood prediction is explaining flow patterns through differential equations. The disadvantages of the equation-based models are the unstable solutions caused by errors accumulation [10] and high computational and algorithmic requirements. Moreover, such approaches may be challenging to generalize to other rivers and basins, for which they require the addition of new parameters.

To take local effects into account, data-driven approaches based on machine learning (ML) methods can be used [11]. They are less interpretable but can produce high-quality forecasts [12]. On the other hand, complex machine learning models, such as neural networks, require a lot of data to train.

To account for the advantages of physical models (interpretability) and machine learning models (low forecasting error), ensembling is used [13]. Such hybrid models allow for a refinement of predictions derived from hydrological models [14] using ML techniques. To improve such multi-level forecasting models data fusion methods [15] can be used. Such approaches based on the ensemble of different models (machine learning- and physics-based) are becoming more popular in the design of predictive systems. Also, the perspective approach that allows to train and calibrate models differently for each data source can be proposed. In that case, distortions and errors in one source cannot lead to critical errors in the whole system [16]. Unfortunately, such complex hybrid pipelines are hard to identify and configure.

To simplify the identification task, automated machine learning (AutoML) can be applied. Methods and algorithms of AutoML are devoted to automated identification and tuning of ML models. A main concept of the AutoML approach is the building of ML models without the involvement of an expert or with their minimal involvement. However, most of the existing AutoML tools cannot be used to obtain heterogeneous (composite) pipelines, that combined models of different nature (for example, ML models and hydrological models). At the same time, composite modeling is especially promising for flood forecasting since it allows combining the state-of-the-art data-driven and physics-based methods and models according to the Composite Artificial Intelligence concepts [17].

In the paper, we propose an automated approach for short-term flood forecasting. The approach is based on the application of automated machine learning, composite models (pipelines), and data fusion. Application of the various data sources (meteorological, hydrological, and remote sensing measurements) in combination with automated identification and calibration of the models make it possible to construct composite pipelines. Machine learning methods are used to forecast time series of daily water levels. Short-term prediction of this variable allows us to avoid negative effects from its rapid increase through prompt decision-making. Involvement of the physical models allows improving the interpretability of the final pipeline. This solution can be used for the analysis of the river’s condition.

1.2. Existing Methods and Models for Flood Forecasting

In this subsection we are analyzing the most significant causes of floods and the specific features of the Lena River basin, which do not allow to apply of existing approaches for flood modeling in this area.

Siberia region floods are of interest to researchers because of their destructive power and annual occurrence [18,19]. Regarding the Lena River, its condition has been studied and modeled by researchers as classical methods of expert search for dependencies between the components of the system [20] as numerical methods. Because the river flows in a cold climate, most of the year channels are covered with ice and snow. So numerical models for this region are created to assess both ice conditions [21] and channel changes during the ice-free period [22].

The major difficulty in modeling floods on the Lena river is in a high variety of flood causes. The largest floods occur when the superposition of several factors of water level rise at once: a regular rise associated with melting snow in spring and extreme rises in from ice-jam. It is difficult to predict such floods on the basis of snowmelt or on retrospective statistical information alone.

Based on an extensive literature review [23], we can consider the following causes of river flooding (most relevant for the Lena river):

Floods caused by heavy rainfall [24];
Floods due to melting snow [25];
Surges from the sea [26];
Extreme rises in water levels due to ice jams, debris flows, rocks blocking the riverbed, etc. [27,28];
A combination of several factors.

For the Lena River basin, flooding in the spring period is most typical due to snow melting, precipitation and a high probability of ice-jam formation [28]. The following ways of modeling each type of flood are presented in Table 1.

Table 1. Methods and models for flood forecasting. The “Flood type” column shows the flood factors that the model is capable of incorporating.

As can be seen from the table, there are several approaches that can be suitable for Lena River floods forecasting. Nevertheless, each of them has disadvantages that do not allow to use of only one method. Therefore, we decided to build a composite model (pipeline) based on several blocks. Each of the individual blocks is modeling the desired component of the flood.

It is worth adding that the frequency of occurrence of each cause is different [40]. The rise in water levels associated with melting snow can be attributed to regular (annual) phenomena. At the same time, conditions associated with high precipitation and the formation of ice jams are non-periodic. Such causes are highly dependent on the synoptic situation in the region. In addition, the precipitation field is spatially heterogeneous, which makes it impossible to predict the average increase in water at all hydro gauges. Precipitation may cause localized increases in water levels in some parts of the river due to different rainfall intensities and soil characteristics. The same problem exists for ice-jam floodings. The rise of the level at a specific level gauge not always leads to an increase in the level at the neighboring ones. So, it becomes extremely important to consider local features not only for the basin but for specific sections of the river and watersheds.

The second important factor is the remoteness of the Lena River from major cities. The river basin is very large, and at the same time, the number of hydro gauges is not high enough. For this reason, the models that are based on remote sensing data are often used for flood forecasting on the Lena River [41]. A widespread method is the application of models or manual interpretation to remote sensing data for the tasks of detecting and predicting river spills. The example is presented in article [42], which demonstrates the calculation of a set of river state parameters, such as discharge, propagation speed, and hydraulic geometry with Moderate Resolution Imaging Spectroradiometer (MODIS) satellite imagery. Also, Landsat satellite images were used for monitoring the spatial and temporal extent of spring floods [43] on the Lena River. Also, remote sensing data are calibrated with in situ observations [44,45], which allows supplementing the data. The context of such works implies the detection of the current state of the river. The emphasis on predictive models is usually not made. Numerical models allow reproduction of the state of a watercourse at any period of time, based on some input set of actual data.

We can conclude that the rise in water levels in the Lena River depends on many factors. Each of these causes contributes to the overall increase in the level at all hydro gauges on the river, as well as at only some of them. In the paper, we refer to the floods that are caused by a combination of several causes as multi-causes flooding. For example, when the water level rises due to both snow melting and the ice-jams formation. Successfully managing all of these factors requires a comprehensive approach. The described approaches are (1) too complicated to prepare and calibrate, or (2) do not take all crucial flood causes into account. Therefore, in the proposed approach we design an ensemble of several models to combine the advantages of different methods in a single composite pipeline. The novelty of this approach is the automatic identification of individual blocks in the predictive system, as well as the hybridization of models for multi-causes flood forecasting.

Each of the individual blocks in the pipeline is designed to predict a flood based on one or several of the described causes. Each block is specialized to reproduce the required patterns from the provided data only. The simulation blocks can be physical models as well as ML models. Since the identification and tuning of ML models requires the involvement of an expert and takes a lot of time, special algorithms for automation are used. The proposed AutoML approach is based on the evolutionary algorithm for identifying ML models. It allows to automatically identify models, tune hyperparameters, and combine stand-alone models into ensembles. Moreover, the optimal ML model structure search process is designed in such a way that can simplify the structure of the pipelines. This allows obtaining stand-alone models as the final solution if that is sufficient for successful forecasting. In addition, modern AutoML solutions have already shown their high accuracy compared to stand-alone models [46]. It also makes the approach more flexible, since identifying individual blocks becomes possible without involving an expert. Consequently, the forecasting system is easier to transfer to new basins and rivers if AutoML algorithms are used.

We validated this approach on the Lena River basin. Floods in this territory are an actively and sufficiently studied common occurrence. Such phenomenon is annual and directly depends on runoff factors also including ice-jam formations [18]. For this reason, the analysed area is a suitable for the validation of constructing multi-causes flood system.

The paper is organized as follows. Section 1 contains the description of the existing approach for flood forecasting. Section 2 described the different aspects of the proposed automated approach and validation methods. Section 3 provides the extensive experimental validation of the proposed approach for the Lena River case study; Section 4 described the main limitation of the study and obtained insight. Finally, Section 5 summarizes the achieved results and provides references to the code and data used in the paper.

2. Materials and Methods

2.1. Input Data for Case Study

The input data for modeling and validation are provided by The Russian Federal Service for Hydrometeorology and Environmental Monitoring and Federal Agency of Water Resources [47,48]. For both physical modeling and the machine learning approach, data from hydro gauges and weather stations were used. Locations of stations and zone of interest are shown in Figure 1.

Figure 1. Spatial position of observation stations and Lena River basin.

The Lena River is located in Eastern Russia. The area of the basin is 2,430,000 km

^{2}

[49]. This region can be characterized by a Siberian climate with extremely low temperatures during winter. The river flows entirely in permafrost conditions [50] from south to north, which is increasing the probability of ice-jam flooding during spring periods [28,51]. As was mentioned earlier, the following causes of spring floodwater on the Lena River include: [28]:

Large amounts of precipitation during autumn;
Low temperatures during winter with extensive accumulation of snow;
Early prosperous spring with rapid increasing of daily temperature;
Heavy precipitation during snowmelt.

The rise in the water level, as well as the increase in flow, typically occurs during the beginning of May to mid-July (Figure 2).

Figure 2. Flow graph for “Tabaga” hydro gauge on the Lena River. The blue line shows the average values for the period from 1985 to 2011. The black lines show the values of river flow for specific times each year.

The successful prediction of extreme water level rises requires using not only the prehistory of water level but also information from weather stations and date and time values. It is also vital to aggregate the features for the previous periods which show the conditions in which the snow cover formed in the river basin, etc. To get the most complete and operational information about the modeled basin, various data sources (tables, time series, images) were used:

Water levels at gauging stations in the form of time series with daily time resolution;
Meteorological parameters measured near the level gauging stations and additional information about the events at the river;
Meteorological parameters from weather stations. Time resolution for parameters may differ from three hours to daily. This data refers to points located at some distance from level gauging stations;
Additional information, derived from open sources, such as remote sensing data from MODIS sensor.

Parameters that were recorded at the weather stations and hydro gauges are shown in Table 2.

Table 2. Data sources and parameters that were used for fitting and validation of the models.

Remote sensing data from the MODIS sensor was used as an additional data source to train the physical model. If necessary, interpolation and gap-filling were applied to the values of the meteorological parameters from tables.

2.2. Composite Modelling Approach

Composite models (pipelines) show strong predictive performance when processing different types of data [52]. Therefore, to build a precise and at the same time robust model, it was decided to use a composite approach with automatic machine learning techniques and evolutionary computing. Each of the composite pipeline blocks was compiled either completely using automatic model design generation or using optimization algorithms for block (model) calibration. We propose to build a composite flood forecasting model consisting of three blocks:

Machine learning model for time series forecasting. Predictions of such a model based on time series of water levels. The number of such models equals the number of gauging stations;
Machine learning model for multi-output regression. Predictions of such a model based on meteorological conditions. The number of such models equals the number of gauging stations;
Snowmelt-Runoff Model. The physical model is implemented and configured for the most critical level gauges;

The target variable for such a system to predict is the maximum daily water level. The configuration of the final model together with the model blocks can be seen in Figure 3.

Figure 3. Data processing scheme in the composite pipeline. The ensemble consists of three models, each of which is trained and calibrated on separate data source.

The pipeline presented in Figure 3 allows using as much data as possible. The primary reason to use an ensemble in a composite pipeline is a reduction of forecast error compared to stand-alone models [53]. Also, it allows specialization of individual blocks of the composite ensemble to solve the utilitarian task.

As an example, the multi-output regression model predicts water levels based on mostly meteorological data. The time series forecasting model predicts only on the basis of previous level values using autoregressive dependencies. Then, if incorrect values of meteorological parameters are obtained and the multi-output regression model is producing incorrect forecasts, the time series forecasting model still can preserve the stability of the ensemble. The use of the physical model makes it possible to interpret the forecast.

Therefore, if any of the models produce an unrealistic result, the expert can refer to the snow-melt runoff model on the critical (main) gauging station and check the adequacy of the system output.

Another advantage of the composite approach is its robustness - not only to distorted measurements, but also to the lack of data sources. If multiple data sources exist, it is possible to re-configure the pipeline and use the blocks with available input data only.

Individual blocks based on machine learning models are identified using the evolutionary algorithm for automatic machine learning described in the paper [52]. Then, the ensembling was used to combine the predictions of the individual blocks in the composite pipelines. Most solutions with using model ensembling are aimed at ensuring that the ensemble is “sufficiently diverse”. Therefore, the errors of individual algorithms on particular objects will be compensated for by other algorithms. When building an ensemble, the main task is to improve the quality of modeling of basic algorithms and increasing the diversity of basic algorithms. This effect is achieved through the following methods: changing the volume of the training data set, target vector, or set of basic models. In this paper, the Random Forest model was used as meta-algorithm for ensembling. The predictors were the following: the predictions of the individual blocks, month and day. Next, a general scheme for optimizing the resulting ensemble is applied. The rationale for using the meta-algorithm is the fact that an ensemble can represent a significantly more complex function than any basic algorithm in many cases. So, ensembles allow us to solve complex problems with simple models. The effective adjustment of these models to the data occurs at the level of a meta-algorithm.

To find the optimal hyperparameters values for the ensemble model we apply Bayesian optimization [54]. The Bayesian optimization is a modification of random search optimization. The main idea is to choose the best area of the hyperparameter space, based on taking into account the history of the points at which the models were trained and the values of the objective function were obtained. This hyperparameter optimization method contains two key blocks: a probabilistic surrogate model and an acquisition function for the next point. We can consider this as an ordinary usual machine learning task, where the selection function is a quality function for the meta-algorithm used in Bayesian hyperparameter optimization, and the surrogate model is the model that we get at the output of this meta-algorithm. The widely-used functions are Expected Improvement, Probability of Improvement, and, for surrogate models, Gaussian Processes (GP) and Tree-Structured Parzen Estimators (TPE) [55]. At each iteration, the surrogate model is trained for all previously obtained outputs and tries to get a simpler approximation of the objective function. Next, the selection function evaluates the “benefit” of the various following points using the predictive distribution of the surrogate model. It balances between the use of already available information and the “exploration” of a new area of space.

2.3. Data Preprocessing

Qualitative preparation of data before fitting the machine learning models or calculating physical models provides the greatest possibility to improve the result of prediction.

As was mentioned before, the primary data set contains observation from two sources: weather stations and hydro gauges, which are located at some distance from each other. The problem can be illustrated by Figure 4. Thereby, some parameters require interpolation to neighboring points for forming the complete dataset for each gauging station.

Figure 4. The task of the meteorological parameters values interpolation to the level gauges. The example of mean daily air temperature is presented. Sign “?” means that the value of the parameter in the station is unknown.

Even a simple interpolation algorithm can be used to resolve this problem. But there is a need to keep the approach flexible: some weather stations that are close (by longitude and latitude coordinates) to the gauging station can represent the real situation in a non-precise way. It can be caused by topography inhomogeneities.

For example, the points are close to each other, but the nearest weather or gauging station is located in the lowlands, and the two neighboring ones are located in the hollows among the ridges. In this case, the best solution would be to take values only from the nearest weather station and not to interpolate at all. In the other case, a good solution would be to average the values from five neighboring weather stations.

Bilinear interpolation and nearest-neighbor interpolation are not suitable for such tasks. Therefore, it was decided to implement an approach where the K-nn (K-nearest neighbor) machine learning model would interpolate the values. The prediction of the new value of the meteorological parameter is based on two predictors in this case: latitude and longitude. The number of neighboring weather stations for the required gauging stations can be changed. The categorical indicators can also be “interpolated” with this approach. In this case, it is necessary to replace the K-nn regression model with the classification model.

The other problem that can be found during data preprocessing is the presence of gaps. Various ways of gap-filling can be applied to time series of meteorological, hydrological, or other types of environmental data. Such gap-filling methods include relatively simple ones, based on linear or spline interpolation. In more advanced approaches linear regression such as ARIMA [56], ensemble methods, gradient boosting [57] or even neural networks approach [58,59] can be used.

Also, more complicated gap-filling approaches exist. the paper [60] shows that one of the most accurate ways is a bi-directional time series forecast adapted for gap-filling. So, it was decided to fill the gaps based on the available preceding and succeeding parts of the time series. Also, there are ‘native’ gaps in the indicators related to ice conditions and snow cover—for the warm season it was not filled for obvious reasons.

For this research, two methods of gap-filling were used depending on the size of the gap. There were two general types of data: with pronounced seasonal component and possessing some stationarity. While the size of the gap is small and calculations are in values of up to ten or the time series especialty is stationarity, the autoregressive algorithm was used. It provides less evaluation time and sufficient quality in such cases. But for lengthy gaps and for time series with seasonality this approach as any other linear approaches lead to inadequate results due to its specificity. This case is quite common in the analyzed data of river monitoring-for some time series gaps reached several years.

For solving this problem, the following method was used. First of all, time series with gaps was decomposed into three components: trend, seasonality, and residuals. Decomposition was provided by the LOESS (locally estimated scatterplot smoothing) method, which is common to use with natural data [61,62].

The several years of recordings were lost in data, but all time-series include thirty years of observations, seasonal and trend components, which is the most significant to keep. The combination of trend, seasonality and median value of the data in the range of the gap fills it preserving the general peculiarities of time series. The results of the approach described above are presented in Figure 5.

Figure 5. Filling in the gaps by extracting the seasonal component. (1) seasonal component of time-series; (2) reconstructed time-series. The red boxes show the reconstructed sections of the time series.

2.4. Time Series Forecasting

Currently, an approach based on the analysis of signals produced by the system is widely used to study the properties of complex systems, including in experimental studies. This is very relevant in cases where it is almost impossible to mathematically describe the process under study, but we have some characteristic observable values at our disposal. Therefore, the analysis of systems, especially in experimental studies, is often implemented by processing the recorded signals—time series. For example, in meteorology—time series from meteorological observations, etc.

It should be noted that the time series differs significantly from tabular data since its analysis takes into account not only the statistical characteristics of the sample [63], but also the temporal relationship. There are two main directions for the time series analysis: statistical (probabilistic models, autoregressive models) and machine learning approaches.

Autoregressive models with a moving average (ARIMA) are used for statistical processing of the time series. When using such models, it is necessary to bring the initial series to a stationary form, which can be done using operations of trend removal application of statistical estimates and noise filtering.

The set of models and methods that should be developed to apply ML to real data should be selected according to the properties of the data and the modeling task. Automated machine learning (AutoML) technologies can be used to simplify the design process of such models. The automatic creation of ensemble models can improve the quality of modeling of multiscale processes. The specifics of these data require a specific approach to their processing.

This section discusses the application of composite pipelines obtained using automatic machine learning algorithms to better predict multiscale processes represented by sensor data. The proposed approach is based on the decomposition of the data stream and correction of the modeling error achieved due to the specific structure of the composite pipeline. However, simply adding or mixing multiple models may not be enough to handle real data sets.

We can analyze a model created using the AutoML is details as shown in Figure 6. It can be divided into two sub-pipelines: “smoothing-ridge-ridge regression” and “ridge-ridge regression”, which are combined using the “ridge regression” node. The nodes of the algorithm contain various operations. Using the described approach, the pipeline can be configured in the following way: one of the models analyzes the periodic component, and the other analyzes the components associated with trends.

Figure 6. Example of generated pipeline with operation names in the nodes.

The smoothing operation is a Gaussian filter. Mathematically, the Gaussian filter modifies the input signal by convolution with the Gaussian function; this transformation is also known as the Weierstrass transform. It is considered an ideal filter in the time domain.
The lagged operation is a comparison of a time series with a sequence of multidimensional lagged vectors. An integer L (window length) is selected such that. These vectors form the trajectory matrix of the original time series. This type of matrix is known as the Hankel matrix elements of the anti-diagonals (that is, the diagonals going from bottom to left to right) are equal.
Ridge regression is a variation of linear regression, specially adapted for data that demonstrate strong multicollinearity (that is, a strong correlation of features with each other). The consequence of this is the instability of estimates of regression coefficients. Estimates, for example, may have incorrect signs or values that far exceed those that are acceptable for physical or practical reasons.

The pipeline “lagged-ridge regression” transforms the maximum daily water level one-dimensional time-series into a trajectory matrix using a one-parameter shift procedure. This makes it possible to use lagged vectors as features for a machine learning model without involving exogenous data. Furthermore, ridge regression is applied to the obtained trajectory matrix, since it is more likely that some columns of the matrix are correlated with each other. It can lead to unstable estimates of regression coefficients. Adding a filtering operation allows “smoothing out” the original time series, reducing the variance and increasing the conditionality of the trajectory matrix at the stage of lagged transformation. Visualization of the composite pipeline and operations in the nodes of this model is shown in the Figure 7.

Figure 7. The detailed representation of the composite pipeline structure and operations in the nodes.

The approaches described above were implemented in the AutoML framework FEDOT. This allowed us to quickly identify models when forecasting time series. Based on this, we can say that the composite pipeline obtained using AutoML contains an interpretation of physical processes described by a time series, since lagged vectors are used as features that reflect the cyclicity and variability of processes. And at the same time, this model takes into account the features of the data from the point of view of machine learning.

2.5. Multi-Output Regression

Also, we implemented a multi-regression model as an additional block of the composite pipeline. We used the data historical data from weather stations and level gauges (Table 2) to find the dependence between the target variables and the features.

Figure 8 presents the dynamics of the daily water level and the snow height simultaneously. The peaks of the orange curve can be interpreted as periods of snow accumulation and extremes below zero can be called periods of snow melting. When the snow reaches a minimum, the water level begins to rise.

Figure 8. The dependence of the amount of water on the snow height.

The target variable is a water level that depends on the features. In our case, it is important to use not only the values at a current time of these features but also to take into account the values over a period of time.

As an example, the correlation between the target variable and snow height can be analyzed. To take into account whether the amount of snow decreased or increased over the past few days, data aggregation can be used. We use several aggregation functions: average, amplitude, most frequent, and summation value. The features in Table 3 are calculated as an aggregation state for the past seven days, and the columns (“1”, “2” etc.) are water level predictions for the seven days ahead.

Table 3. Example of aggregated data for forecasting of seven days ahead.

It can be useful to take a closer look at the example with the amplitude of snow cover height. We want to calculate the amount of melted snow over the past few days, as it strongly correlates with the water level. In the case of snow height, this is not exactly the amplitude, where the value is calculated by the formula max–min. For snow height, we count the amount of melted precipitation over a certain time. The considered example covers the period from 10 April 2020, to 30 April 2020, and the time interval for aggregation is equal to seven days. In this case, we can find the difference between the current and past day observations and accumulate difference if it is less than zero and continue accumulating for the aggregating interval. So we can know “how many cm of snow melted during the seven days”.

Figure 9 can be used to explain the process of features aggregation. The red line shows the actual values of the snow height and the green line shows the aggregation state of the accumulated melted snow over the past seven days. Let’s pay attention to the time period indicated by the orange line, when the actual amount of snow decreases, the amount of melted snow increases, which well describes the water level. In this case, accumulated value allows estimating the amount of snow that has melted over a period of time and using it as a feature for prediction.

Figure 9. The amount and aggregated state of snow from 10 April 2020 to 30 April 2020. Green is the real amount of snow, red is aggregated state for past 7 days.

We have made similar transformations for other meteorological parameters. However, there is a categorical feature describing the events on the river, and they are recorded in the form of phrases including “Ice drifts”, “Excitement”, “Incomplete freeze-up”, “Jam above the station”, etc. We decided to rank the events (Table 4) according to the degree of influence on the water level in the river and have already applied aggregation-sum to the ranked estimates. If the sum of the ranks is large, then we should expect a noticeable change in the water level in the river and vice versa.

Table 4. Some events and importance ranks for them.

The set of additional features were obtained during the process of data preprocessing and aggregation. Regular features are:

temp (FLOAT)—average daily water temperature.
ice-thickness (INT)—thickness of ice.
snow-height (INT)—the height of snow on ice.
discharge (FLOAT)—average daily water consumption, m $^{3}$ /s.
water-level (INT)—is the target variable indicating the maximum water level.

Aggregated features are:

events—(TEXT)—the sum of events ranked by the degree for 30 days, correlated to the water level.
discharge-mean (FLOAT)—average for five days water consumption, m $^{3}$ /s.
temp-min (FLOAT)—minimum temperature on water level for 7 days. At low temperatures, ice formation, slowing of the river flow and, as a result, an increase in the water level is possible.
ice-thickness-amplitude (INT)—amplitude thickness of ice for seven days shows an increase or decrease in the thickness of the ice.
snow-height-amplitude (INT)—amplitude of snow height. Directly proportional to the water level.
water-level-amplitude (INT)—the prediction of the water level depends heavily on the data for the past few days. By aggregating the values for the past seven days, we are essentially trying to predict a time series.

The prepared set of features allow us to build a multi-target regression model to predict the water level for several days ahead. We decided to use the AutoML-based approach to generate the multi-target regression pipelines that consist of models and data operations. The best-found pipelines, based on validation data for flood forecasting, are presented in Figure 10.

Figure 10. Several modelling pipelines generated by AutoML-based approach and used for the flood forecasting.

The presented pipeline consists of two types of nodes: primary (blue) and secondary (green); raw data is received in primary nodes. Each node is a model or an operation on data. In the upper left corner, there is a pipeline, where the primary node is the principal component analysis block. It is an operation on data that reduces the feature space while not losing useful information for prediction. The secondary node in the pipeline is a random forest regression, which is a model to give a final prediction. The combining models and operations make it possible to build the sub-optimal pipeline that is effective in terms of the predictive accuracy for the water level for seven days ahead. The identification and optimization of the structure were conducted using the composite AutoML approach described above.

2.6. Physical Modelling

In addition to the presented data-driven approaches, we chose to include a physics-based element into the ensemble. As the base for our model we have utilized SRM (Snowmelt-Runoff Model) [33], which is designed to calculate water discharge for the mountainous river basins, where water, incoming into the river, originates from melting snow and rain. The following assumption is stated: all discharge is generated by sources inside the processed basin. It is limiting the model scope to the minor rivers. However, in the analyzed case, the Lena river has a large water catchment area, that has varying conditions and additional gauges located upstream which can be used for discharge evaluations. We tried to alleviate the scope restriction by introducing a transfer term, that connects the discharge in the downstream gauge with the discharge of the upstream one in the previous day.

The main input variables, used in the SRM model, include temperature, from which the number of degree-days

T_{n} (^{\circ} C \cdot d)

is calculated, snow cover

S_{n}

(%)

, and rainfall

R_{n} (cm)

. To initialize the modeling, a number of basin parameters shall be determined: snowmelt and rain runoff coefficients (

α_{1}

and

α_{3}

correspondingly, dimensionless parameters), degree-day factor (

cm \cdot^{\circ}

C

^{- 1} \cdot d^{- 1}

), and area A (

{km}^{2}

). Additionally, we have to compute the recession coefficient

α_{4} = \frac{Q_{k}}{Q_{k - 1}}

, and

α_{5}

defining the transfer of the river water between gauges (dimensionless), where k and

k - 1

have to be evaluated during the periods with no additional water intake into the river. The final equation for the discharge calculation took form (1), where we make the prediction for the m-th gauge at n-th time point.

Q_{n + 1}^{m} = [α_{1} α_{2} (T_{n} + Δ T_{n}) S_{n} + α_{3} R_{n}] A \frac{10000}{86400} (1 - α_{4}) + α_{4} Q_{n}^{m} + α_{5} (Q_{n}^{m} - Q_{n}^{m - 1})

(1)

To calibrate the model we have utilized the differential evolution algorithm, that searched for the model parameters, that produce the minimal error between model predictions and observed values on the historical observations. The optimized function took form of discrepancy between prediction

Q_{n p r e d}^{m}

, calculated by the Equation (1), and observed value

Q_{n o b s}^{m}

, that has to be optimized in response to the parameter vector

α = (α_{1}, . . ., α_{5})

. The specifics of the problem have introduced a number of constraints, caused by reasonable boundaries of parameters.

| Q_{n p r e d}^{m} - Q_{n o b s}^{m} | ⟶ min_{α}

(2)

The calibration was performed in two stages. At first, the data from the processed period was split into two datasets by the presence of water input from the catchment. The following assumption was stated: in periods with no snow cover and no rainfall, the river discharge is determined only by the recession coefficient and the transfer. The stage of the calibration was performed on this partial data for the reduced Equation (3). On the data, that describes periods of water intake from the part of catchment, corresponding to the studied gauge, the search of the remaining parameters was initialized.

Q_{n + 1}^{m} = α_{4} Q_{n}^{m} + α_{5} (Q_{n}^{m} - Q_{n}^{m - 1})

(3)

Due to the simplicity of the proposed model, it can be relatively easy to interpret the forecasts. By evaluating the terms of the Equation (1), model’s users can analyze which factor between rainfall or snow melt contributes to the alteration of river discharge in the particular moment. The variation of the river discharge, caused by these factors, can be fully interpreted.

The final stage of the SRM-based approach is the transition from river discharge to the water levels. To discover the dependency between these variables, the decision tree model has been trained with the date and the discharge as the independent variables and the maximum daily water level as the dependent one. Generally, the dependency is one-to-one correspondence with the exception of cases, when the river flow is blocked by ice jams.

The input variables, used in the simulation are taken from the field observations and from satellite data. Temperature and precipitation are measured at the meteorological stations and then interpolated to the hydrological gauge in the manner, described in the Section 2.3.

To assess the amount of snow cover, it was decided not to be limited only to measurement data due to imperfect equipment, as well as the low quality of the results of values interpolation from a point to an area. Therefore for calculating snow cover fraction, remote sensing data from MODIS MOD10A1 product [64] was used. It provides spatial data, which is more qualitative and complete than information from the meteorological station (Figure 11).

Figure 11. NDSI spatial field obtained from the MODIS sensor for 28 February 2010. Base map source: Bing Satellite. The date of capture for the base map is different.

Accumulation of runoff in each section of the watercourse due to melting of snow cover occurs from a certain territory due to the relief. Extracting watershed by expert manual solution is a laborious process, therefore the segmentation of territory on catchments for each watercourse tributary was implemented with a capacity of GRASS geographic information system (GIS). The algorithm r.watershed works on digital elevation model (DEM) as input data, during processing, calculates water streams and returns the labels for each catchment. Its core is an A^T least-cost search algorithm [65] designed to minimize the impact of DEM data errors. In this work open-source DEM - Shuttle Radar Topography Mission (SRTM) [66] was used. Steps of calculating the watersheds for two hydro gauges are presented in Figure 12.

Figure 12. Steps of watershed segmentation. (1) streamlines, generated on relief model; (2) catchment segmentation for each tributary; (3) watershed segmentation for hydro gauges 3036 and 4045.

The Normalized-Difference Snow Index (NDSI), that was obtained from the MODIS remote sensing and used for evaluation of the snow covered areas in the catchment, has reliability issues, when applied to the evergreen coniferous forests, that are common in the Eastern Siberia [67]. Additionally, the remote sensing data had to be filtered from the measurements, taken in the conditions, when the view from the satellite is obstructed: over the clouds, in nights, etc.

2.7. Quality Metrics for Forecasts

The following metrics were used:

NSE—Nash–Sutcliffe model efficiency coefficient. The metric changes from $- \infty$ to 1. The closer the metric is to one, the better;
MAE—Mean Absolute Error. The metric changes from 0 to ∞. The closer the metric is to zero, the better. The units of this metric are the same as the target variable—centimeters;
SMAPE—Symmetric Mean Absolute Percentage Error. Varies from 0 to ∞. The closer the metric is to zero, the better. Measured as a percentage.

The equations for NSE (Equation (4)), MAE (Equation (5)) and SMAPE (Equation (6)) are provided below. The following notation is used: n = number of elements in the sample,

y_{i}

is the prediction,

x_{i}

the true value and

\bar{x}

is averaged observed water level.

N S E = 1 - \frac{\sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2}}{\sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}}

(4)

M A E = \frac{\sum_{i = 1}^{n} | y_{i} - x_{i} |}{n}

(5)

S M A P E = \frac{100 %}{n} \sum_{i = 1}^{n} \frac{| y_{i} - x_{i} |}{(| x_{i} | + | y_{i} |) / 2}

(6)

The specific approach was used to validate the algorithm using the metrics described above. The composite pipeline is able to provide forecast

y_{n}

for n days ahead with the following time steps for forecast

{t + 1, . . ., t + n}

. In this case, the retrospective data X is used to train the models in the pipeline. To prepare a forecast for the interval from

t + 1

to

t + n

the m values of the predictor

X_{m}

for time steps

{t, . . ., t - m}

are required. Respectively, the sub-sample of retrospective data

{t + n, . . ., t + n - m}

is required to obtain the forecast from

t + n

to

t + 2 \times n

. The trained pipeline can predict the values for n next elements if the set of predictors with the specified length is provided. So, the length of the validation part of the time series can be arbitrary. In the paper, the length of the validation horizon was equal to 805 elements (days).

The distribution of error across forecast horizons was also additionally investigated. For this purpose Metrics were measured only on forecasts of 1, 2, 3, 4, 5 and 7 days in advance. In addition to calculating metrics, an extensive analysis of graphical materials: predictions and error plots are presented.

3. Results

3.1. Validation

Validation of the results was carried out by comparing the predicted water level values and the actual values recorded at the hydro gauges. Below all models have been configured to give a forecast for seven days ahead. Since preparing and tuning a composite pipeline requires a multi-step approach, the following strategy was used. Models identification (search for optimal structure and hyperparameters values) in an automated way was performed on data for the period from 1985 to 2006. From 2006 to 2010, the ensemble model was prepared and tuned. Finally, validation was performed with daily data collected from 2010 to 2012. These years were chosen for validation because this period was relatively challenging spring floodwater on Lena River [51].

Hydro gauges with the following identifiers were used for validation: 3019, 3027, 3028, 3029, 3030, 3035, 3041, 3045, 3050, 3230—ten posts in total. The validation of individual models and the ensemble is presented below.

Several examples of the forecasts obtained from a model based on time series are shown in Figure 13. For prediction on each subsequent validation block, the actual values of the series were used. It is possible to estimate the model quality for 115 validation blocks of seven elements—a total of 805 elements for each post.

Figure 13. Validation of time series forecasting model for several hydro gauges. Forecasting horizon—seven days.

As can be seen from Figure 13, the proposed model can efficiently predict the level rise during floods (e.g., beginning of summer when rapid raise of water level is observed). The average value of the metrics confirms that the time series model is relatively accurate even without including it in the ensemble: NSE—0.74.

For the multi-target regression model, the forecast efficiency was also estimated: NSE—0.72. “Actual vs. predicted” plots for comparison forecasts obtained from different models provided in Figure 14.

Figure 14. Actual values vs. predicted for the two hydro gauges. Time series and multi-target models forecasts are shown for validation part. Forecast horizon is seven days.

As can be seen from Figure 14, there is no model that is significantly more accurate than the others. The same is confirmed by the metric values. So, it was decided to use ensemble to remove drawbacks from individual blocks predictions (Figure 15).

Figure 15. Forecasts of individual blocks and ensemble on the validation part. The most relevant parts of time series are shown in the sub-figures. Forecast horizon is seven days.

As can be seen from the figure, the most accurate forecast is given by the ensemble of models. Also, validation was performed not only on the entire validation period but only on the parts when the extreme water level rises occur (April–July). This period is referred to below as “Spring/Summer”. Results of validation composite pipeline and individual blocks are shown in Table 5.

Table 5. Validation metrics for ensemble and individual machine learning models obtained for ten hydro gauges.

Additionally, experiments were conducted to determine how different the values of the metrics on the training (Train), the test (Test) and the validation sample (Validation). The results for all model blocks for station 3045 are shown in Table 6.

Table 6. The value of the NSE and MAE metrics for stand-alone models and for ensemble.

To better appreciate the robustness of the forecast, the distribution of error across forecast horizons was investigated. For that purpose, the following forecast horizons (days) were tested: 1, 2, 3, 4, 5, 6, 7 (Figure 16).

Figure 16. Error metrics (NSE and SMAPE) of ensemble model for several forecasting horizons.

As can be seen from the Figure 16, as the forecast horizon increases, the NSE value falls the most. At the same time, the MAE metric increases slightly. In can be concluded that the predictive efficiency of the model decreases in the domain of level variance prediction when the forecast horizon increases.

Finally, experiments with different gap sizes in training sample were conducted to find out how robust the model is to the presence of gaps. For example, the water level data without gaps were prepared for station 3045. In these series, the gaps were generated and reconstructed using the methods described in the current paper. Then, the ensemble trained on recovered data. The results of error estimation presented in Table 7.

Table 7. Dependence of the forecast error on the gap size in the training sample for ensemble model. Metrics are shown for test part.

It is clear from the table, that as the size of the gaps increases, the ensemble error on the test sample increases. However, this increase can be considered acceptable. Note that successful identification of machine learning models usually requires a lot of data.

3.2. Comparison

The composite pipeline obtained with the AutoML-based approach can be quite complex. Therefore, for a more objective assessment, we compared the quality of forecasting results with the most common autoregressive models and physical-based model.

It should be noted that the seasonal component clearly exists in the data. For there reason, using only autoregression methods can not bring a quality result inherently. So seasonal and trend decomposition described above was used to separate seasonal components predetermined by frequency estimation period. Two models were used for the prediction of series without seasonality: AR and ARIMA. After the evaluation, the seasonal component was returned to prediction.

Water level prediction and time-series analysis via ARIMA is common practice [39,68]. AR is more often found as part of ensembles [38], due to its naivety to use separately. The hyperparameters for the models were chosen according to the lowest value of the Akaike criterion (AIC) on the training sample. So, a prediction for the validation sample was made for each station with the calculation of the previously described metrics. Examples of the forecast by two models are presented in Figure 17. The average values of metrics for AR: NSE—0.59, ARIMA: NSE—0.56.

Figure 17. Forecast of statistical models (AR and ARIMA) on the validation part of water level time series.

The results of validation for all models are summarized in a common Table 8 for comparison.

Table 8. Summarized table for models for all period and spring/summer period validation.

4. Discussion

The proposed approach for the automated design of composite forecasting pipelines was validated on ten hydro gauges and two years of observations. In total, more than 8000 points were involved in the validation and comparison experiments. Figure 13 and Figure 14 demonstrate that even individual models can produce forecasts with small errors (SMAPE less than 32%) and relatively good efficiency metric NSE—bigger than 0.72. The use of ensembling made it possible to obtain an even better value: NSE 0.80. At the same time, comparison between predictions of the individual models and ensemble highlights that the composite pipeline produces a higher forecast error on the validation set by MAE and SMAPE. Note that the values of the metrics differ insignificantly. This may be because the ensemble has become more robust to distortions in the data, but is worse at reproducing absolute values in peaks during high water. So, NSE increased for ensemble, but error metrics such as MAE and MAPE became a little worse or stayed almost the same.

Metrics were also calculated for the most important period: from April to July. At this time, the highest water levels are typically observed on each hydro gauge. And the results showed that the ensemble metrics at these time sites sections are noticeably better (Table 5): NSE for time series model—0.60, for multi-target model—0.61, and for ensemble—0.71. So, the ensemble model allows achieving a significant increase of forecast precision exactly during the most difficult periods for forecasting.

The robustness of the algorithm to the presence of gaps in the initial data has been investigated (Table 7). According to the results of the experiments, it is shown that an increase in the gaps in the original data (up to 40%), leads to a slight decrease in NSE—12%. At the same time, other error metrics increased compared to training the algorithm on data without gaps: MAE increased by 42% and SMAPE by 54%. It is worth noting that the gap ratio of 40% is quite a large number. So, it is worth using ML and AutoML algorithms on data where the proportion of omissions is much smaller (up to 20%).

Additionally, the dependence of ensemble and individual blocks error on different parts of the data was investigated: on the training sample, the test sample, and the validation sample (Table 6). It is shown that the use of an ensemble of models avoids a significant increase in the forecast error on new data. So, the NSE for the composite pipelines decreased from 0.94 on the training sample to 0.84 on the validation sample (0.10 units). At the same time, the average decrease in NSE for individual blocks was 0.15.

A comparison with competing algorithms showed that the proposed approach is more efficient than the expert-configured AR (NSE 0.67) and ARIMA (NSE 0.69) models. Moreover, the physical SRM (NSE 0.74) is also inferior in all considered metrics to the composite approach.

In the tasks related to the modeling of hazardous events, the ability to interpret the forecast and understand how it was obtained from the model by an expert is highly desired. While the majority of the introduced ensemble elements lack the interpretability of their modeling results, the forecasts of the physics-based block (e.g., SRM), can be explained. Therefore, by obtaining the separate forecasts of the introduced models and the combined ensemble output, the expert will be able to analyze the nature of the river level changes by comparing tendencies with the SRM block.

Disadvantages of the proposed approach include the difficulty of combining forecasts between blocks. Other limitations in the applicability of the modeling approach can be traced back to the general limitations of data-driven methods. The insufficient or inaccurate measurements, prevalent in sparsely-populated regions, can result in the necessity to use limited training data or apply gap-filling or filtering algorithms. Any alterations in the river flow, e.g., damming of the river or melting of the glaciers in the catchment, can disturb the observed hydrological patterns and make the development of data-driven models impossible.

The snowmelt-runoff model can be introduced into the ensemble to model the gauges, that have both discharge and water level measurements. Such gauges are relatively uncommon in the available data due to the difficulties of measuring the discharge of the large rivers.

On the other hand, the advantages are low forecasting error, native interpretability due to the presence of a physical model in its structure, and robustness to distortions in data and lack of information. Moreover, individual blocks in the composite pipeline are proposed to be identified using an evolutionary algorithm. This allows automating the system deployment process. Another important feature of the composite approach is modularity. It is possible to use only subset of suitable models and to combine them into an ensemble if necessary.

The proposed approach can be improved. To increase the quality of the ensemble results, two opposite approaches can be used. The first approach is aimed at the enhancement of the individual forecasts of the ensemble blocks. Deep learning techniques, e.g., LSTM networks, can be used to improve the quality of the time-series forecasts. Moreover, different architectures of neural networks and different ML models can replace the multi-output regression. Furthermore, more sophisticated physics-based models that take into consideration the water dynamics can be added. Other data sources can be involved as the features for the data-driven elements.

The second approach is based on the development of better forecast integration techniques to form ensemble predictions. For example, the meta-algorithm, that is currently used, may be replaced by the softmax techniques, heuristic methods of combining the answers of basic algorithms (using visualizations, learning in special subspaces, etc.). Other special encodings of target values and reduction of the problem solution to the solution of several tasks also can improve ensemble. One of the most popular techniques here is ECOC (encoding output with error correction), the other is tuning for different transformations of the target feature.

5. Conclusions

In the paper, the automated approach for short-term river flood forecasting is proposed. It allows (1) identifying the composite pipelines that include machine learning methods and domain-specific models; and (2) using it for the prediction of future floods or the reconstruction of historical events. The implemented combination of data-driven and physical models in a single pipeline makes it possible to enrich the models by existing knowledge (as it is proposed in concepts of composite AI).

The validation of this approach is conducted using the Lena River case study. As a result of the application of the proposed approach, a composite pipeline for flood forecasting was identified. The presented model consists of three blocks, two of which are based on machine learning methods and one on the physical Snowmelt-Runoff-Model. The described approach relies on AutoML algorithms to identify the structure of machine learning models. So, this allows fast construction of composite pipelines on different data sources and at the same time identifies robust models. A Random Forest model was used to ensemble the predictions, which hyperparameters were tuned using a Bayesian approach.

The obtained pipeline was validated on ten hydro gauges and two years of daily observations of maximum water levels. The composite pipeline showed the best metrics on the validation set: NSE 0.80. At the same time, individual blocks of such ensemble got the following metrics: 0.74 for time series and 0.72 for the multi-target regression model. Validation was also performed for the high water period (April–July) to ensure that the model predicts spring floods well enough. It was shown that the quality of the forecast is slightly lower in this case, but still remains at a fairly high level—0.71. Comparative analysis with algorithms-competitors proves that the proposed ensemble is more effective than expert-configured AR (NSE 0.67), ARIMA (NSE 0.69), or stand-alone SRM (NSE 0.84).

Also, we can conclude that the large size of the retrospective data makes it is possible to build a sufficiently effective time series forecasting model based on only autoregressive dependencies. The high value of NSE—0.74 and low values for other metrics (MAE—45.02 and SMAPE—28.44) confirm this hypothesis. However, results for the ensemble of models prove its ability to significantly reduce forecast error.

In this study, we applied an approach to predict water levels for both the high water period during spring and the entire year. Although it was validated on the Lena River, this approach can be generalized for other rivers as well. For example, Canadian rivers such as the Mackenzie, Peace-Athabasca and Slave River have common regimes with significant floodings due to ice jams [69]. It was achieved by automating the process of building composite data-driven models with the help of AutoML techniques (implemented as a part of the open-source FEDOT framework available in https://github.com/nccr-itmo/FEDOT (accessed on 6 December 2021)).

Author Contributions

Conceptualization, M.S. and Y.B.; methodology, Y.B.; software, G.M. and M.S.; validation, M.S., Y.B. and M.M.; formal analysis, M.M. and Y.B.; investigation, M.M.; resources, M.S.; data curation, G.M.; writing—original draft preparation, I.R., M.S. and Y.B.; writing—review and editing, M.S., M.M. and N.O.N.; visualization, M.S. and Y.B.; supervision, N.O.N.; project administration, N.O.N.; funding acquisition, N.O.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research is financially supported by the Ministry of Science and Higher Education, Agreement FSER-2021-0012.

Institutional Review Board Statement

The study did not involve humans or animals.

Informed Consent Statement

The study did not involve humans.

Data Availability Statement

The source code used to build the composite modeling pipeline and the final pipeline itself is available in the open repository https://github.com/ITMO-NSS-team/composite-flood-forecast (accessed on 6 December 2021). Due to license restrictions, there is no opportunity for us to share full data in a repository (only synthetic samples are presented in it). The full dataset can be requested from the owners of the data [47,48].

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ML	Machine learning
AutoML	Automated machine learning
AIC	Akaike criterion
SRM	Snowmelt-Runoff Model
NSE	Nash–Sutcliffe model efficiency coefficient
MAE	Mean Absolute Error
SMAPE	Symmetric Mean Absolute Percentage Error
MODIS	Moderate Resolution Imaging Spectroradiometer
AR	Autoregression model
ARIMA	Autoregressive Integrated Moving Average
DEM	Digital elevation model
SRTM	Shuttle Radar Topography Mission
NDSI	Normalized-Difference Snow Index
LOESS	locally estimated scatterplot smoothing
GIS	geographic information system.

References

Davies, J.B. Economic analysis of the costs of flooding. Can. Water Resour. J. Rev. Can. Des Ressources Hydriques 2016, 41, 204–219. [Google Scholar] [CrossRef]
Liu, J.; Hertel, T.W.; Diffenbaugh, N.S.; Delgado, M.S.; Ashfaq, M. Future property damage from flooding: Sensitivities to economy and climate change. Clim. Chang. 2015, 132, 741–749. [Google Scholar] [CrossRef]
Otto, K.; Boos, A.; Dalbert, C.; Schöps, D.; Hoyer, J. Posttraumatic symptoms, depression, and anxiety of flood victims: The impact of the belief in a just world. Personal. Individ. Differ. 2006, 40, 1075–1084. [Google Scholar] [CrossRef]
Speight, L.J.; Cranston, M.D.; White, C.J.; Kelly, L. Operational and emerging capabilities for surface water flood forecasting. Wiley Interdiscip. Rev. Water 2021, 8, e1517. [Google Scholar] [CrossRef]
Ramírez, J.A. Prediction and modeling of flood hydrology and hydraulics. In Inland Flood Hazards: Human, Riparian and Aquatic Communities; Cambridge University Press: Cambridge, UK, 2000; p. 498. [Google Scholar]
Jain, S.K.; Mani, P.; Jain, S.K.; Prakash, P.; Singh, V.P.; Tullos, D.; Kumar, S.; Agarwal, S.; Dimri, A. A Brief review of flood forecasting techniques and their applications. Int. J. River Basin Manag. 2018, 16, 329–344. [Google Scholar] [CrossRef]
Agudelo-Otálora, L.M.; Moscoso-Barrera, W.D.; Paipa-Galeano, L.A.; Mesa-Sciarrotta, C. Comparison of physical models and artificial intelligence for prediction of flood levels. Tecnol. Cienc. Agua 2018, 9, 209–235. [Google Scholar] [CrossRef]
Sahraei, S.; Asadzadeh, M.; Unduche, F. Signature-based multi-modelling and multi-objective calibration of hydrologic models: Application in flood forecasting for Canadian Prairies. J. Hydrol. 2020, 588, 125095. [Google Scholar] [CrossRef]
Aqil, M.; Kita, I.; Yano, A.; Nishiyama, S. Analysis and prediction of flow from local source in a river basin using a Neuro-fuzzy modeling tool. J. Environ. Manag. 2007, 85, 215–223. [Google Scholar] [CrossRef]
Amarís Castro, G.E.; Guerrero Barbosa, T.E.; Sánchez Ortiz, E.A. Comportamiento de las ecuaciones de Saint-Venant en 1D y aproximaciones para diferentes condiciones en régimen permanente y variable. Tecnura 2015, 19, 75–87. [Google Scholar] [CrossRef][Green Version]
Bui, D.T.; Panahi, M.; Shahabi, H.; Singh, V.P.; Shirzadi, A.; Chapi, K.; Khosravi, K.; Chen, W.; Panahi, S.; Li, S.; et al. Novel hybrid evolutionary algorithms for spatial prediction of floods. Sci. Rep. 2018, 8, 15364. [Google Scholar] [CrossRef]
Noymanee, J.; Nikitin, N.O.; Kalyuzhnaya, A.V. Urban pluvial flood forecasting using open data with machine learning techniques in pattani basin. Procedia Comput. Sci. 2017, 119, 288–297. [Google Scholar] [CrossRef]
Wu, W.; Emerton, R.; Duan, Q.; Wood, A.W.; Wetterhall, F.; Robertson, D.E. Ensemble flood forecasting: Current status and future opportunities. Wiley Interdiscip. Rev. Water 2020, 7, e1432. [Google Scholar] [CrossRef]
Siam, Z.S.; Hasan, R.T.; Anik, S.S.; Noor, F.; Adnan, M.S.G.; Rahman, R.M. Study of Hybridized Support Vector Regression Based Flood Susceptibility Mapping for Bangladesh. In Proceedings of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Kuala Lumpur, Malaysia, 26–29 July 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 59–71. [Google Scholar]
Muñoz, D.F.; Muñoz, P.; Moftakhari, H.; Moradkhani, H. From local to regional compound flood mapping with deep learning and data fusion techniques. Sci. Total Environ. 2021, 782, 146927. [Google Scholar] [CrossRef]
Zhou, Z.H. Ensemble learning. In Machine Learning; Springer: Berlin/Heidelberg, Germany, 2021; pp. 181–210. [Google Scholar]
Kalyuzhnaya, A.V.; Nikitin, N.O.; Vychuzhanin, P.; Hvatov, A.; Boukhanovsky, A. Automatic evolutionary learning of composite models with knowledge enrichment. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion, Cancún, Mexico, 8–12 July 2020; pp. 43–44. [Google Scholar]
Korytny, L.M.; Kichigina, N.V. Geographical analysis of river floods and their causes in southern East Siberia. Hydrol. Sci. J. 2006, 51, 450–464. [Google Scholar] [CrossRef]
Tei, S.; Morozumi, T.; Nagai, S.; Takano, S.; Sugimoto, A.; Shingubara, R.; Fan, R.; Fedorov, A.; Gavrilyeva, T.; Tananaev, N.; et al. An extreme flood caused by a heavy snowfall over the Indigirka River basin in Northeastern Siberia. Hydrol. Process. 2020, 34, 522–537. [Google Scholar] [CrossRef]
Gautier, E.; Brunstein, D.; Costard, F.; Lodina, R. Fluvial dynamics in a deep permafrost zone—the case of the middle Lena river (Central Siberia). In Permafrost; Phillips, M., Springman, S.M., Arenson, L.U., Eds.; Swets & Zeitlinger: Lisse, The Netherlands, 2003; pp. 271–275. [Google Scholar]
Ma, X.; Fukushima, Y. A numerical model of the river freezing process and its application to the Lena River. Hydrol. Process. 2002, 16, 2131–2140. [Google Scholar] [CrossRef]
Golovlyov, P.; Kornilova, E.; Krylenko, I.; Belikov, V.; Zavadskii, A.; Fingert, E.; Borisova, N.; Morozova, E. Numerical modeling and forecast of channel changes on the river Lena near city Yakutsk. Proc. Int. Assoc. Hydrol. Sci. 2019, 381, 65–71. [Google Scholar] [CrossRef]
Tarasova, L.; Merz, R.; Kiss, A.; Basso, S.; Blöschl, G.; Merz, B.; Viglione, A.; Plötner, S.; Guse, B.; Schumann, A.; et al. Causative classification of river flood events. Wiley Interdiscip. Rev. Water 2019, 6, e1353. [Google Scholar] [CrossRef]
Korytny, L.; Kichigina, N.; Gartsman, B.; Gubareva, T. Rain Floods of The Far East and East Siberia. In Extreme Hydrological Events: New Concepts for Security; Vasiliev, O., van Gelder, P., Plate, E., Bolgov, M., Eds.; Springer: Dordrecht, The Netherlands, 2007; pp. 125–135. [Google Scholar]
Struchkova, G.; Lebedev, M.; Timofeeva, V.; Kapitonova, T.; Gavrilieva, A. Neural Network Approaches to Modeling of Natural. In Emergencies. Prediction of Lena River Spring High Waters. In Proceedings of the IOP Conference Series: Earth and Environmental Science, Vladivostok, Russian, 8–10 December 2020; IOP Publishing: Bristol, UK, 2021; Volume 666, p. 032084. [Google Scholar]
Chen, W.B.; Liu, W.C. Modeling flood inundation induced by river flow and storm surges over a river basin. Water 2014, 6, 3182–3199. [Google Scholar] [CrossRef]
Carling, P.; Villanueva, I.; Herget, J.; Wright, N.; Borodavko, P.; Morvan, H. Unsteady 1D and 2D hydraulic models with ice dam break for Quaternary megaflood, Altai Mountains, southern Siberia. Glob. Planet. Chang. 2010, 70, 24–34. [Google Scholar] [CrossRef]
Buzin, V.; Klaven, A.; Kopaliani, Z. Laboratory modelling of ice jam floods on the lena river. In Extreme Hydrological Events: New Concepts for Security; Vasiliev, O., van Gelder, P., Plate, E., Bolgov, M., Eds.; Springer: Dordrecht, The Netherlands, 2007; pp. 269–277. [Google Scholar]
Sankaranarayanan, S.; Prabhakar, M.; Satish, S.; Jain, P.; Ramprasad, A.; Krishnan, A. Flood prediction based on weather parameters using deep learning. J. Water Clim. Chang. 2020, 11, 1766–1783. [Google Scholar] [CrossRef]
Mitra, P.; Ray, R.; Chatterjee, R.; Basu, R.; Saha, P.; Raha, S.; Barman, R.; Patra, S.; Biswas, S.S.; Saha, S. Flood forecasting using Internet of things and artificial neural networks. In Proceedings of the 2016 IEEE 7th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 13–15 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–5. [Google Scholar]
Kan, G.; Liang, K.; Yu, H.; Sun, B.; Ding, L.; Li, J.; He, X.; Shen, C. Hybrid machine learning hydrological model for flood forecast purpose. Open Geosci. 2020, 12, 813–820. [Google Scholar] [CrossRef]
Adamowski, J.F. Development of a short-term river flood forecasting method for snowmelt driven floods based on wavelet and cross-wavelet analysis. J. Hydrol. 2008, 353, 247–266. [Google Scholar] [CrossRef]
Martinec, J.; Rango, A.; Major, E. The Snowmelt-Runoff Model (SRM) User’s Manual; NASA: Washington, DC, USA, 1983. [Google Scholar]
World Meteorological Organization. Intercomparison of Models of Snowmelt Runoff; WMO: Geneva, Switzerland, 1986. [Google Scholar]
Li, L.; Simonovic, S. System dynamics model for predicting floods from snowmelt in North American prairie watersheds. Hydrol. Process. 2002, 16, 2645–2666. [Google Scholar] [CrossRef]
Leavesley, G.; Lichty, R.; Troutman, B.; Saindon, L. Precipitation-runoff modeling system: User’s manual. Water-Resour. Investig. Rep. 1983, 83, 207. [Google Scholar]
Lindenschmidt, K.E.; Das, A.; Rokaya, P.; Chu, T. Ice-jam flood risk assessment and mapping. Hydrol. Process. 2016, 30, 3754–3769. [Google Scholar] [CrossRef]
Chen, Y.; Gan, M.; Pan, S.; Pan, H.; Zhu, X.; Tao, Z. Application of auto-regressive (AR) analysis to improve short-term prediction of water levels in the Yangtze estuary. J. Hydrol. 2020, 590, 125386. [Google Scholar] [CrossRef]
Yu, Z.; Lei, G.; Jiang, Z.; Liu, F. ARIMA modelling and forecasting of water level in the middle reach of the Yangtze River. In Proceedings of the 2017 4th International Conference on Transportation Information and Safety (ICTIS), Banff, AB, Canada, 8–10 August 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 172–177. [Google Scholar]
Takakura, H. Local Perception of River Thaw and Spring Flooding of the Lena River. In Global Warming and Human—Nature Dimension in Northern Eurasia; Hiyama, T., Takakura, H., Eds.; Springer: Singapore, 2018; pp. 29–51. [Google Scholar] [CrossRef]
Kapitonova, T.; Lebedev, M.; Timofeeva, V.; Nogovitsyn, D.; Struchkova, G. Flood forecasting on river lena during spring high water in area of location of potentially dangerous objects. In Proceedings of the International Multidisciplinary Scientific GeoConference: SGEM, Albena, Bulgaria, 30 June–6 July 2016; Volume 1, pp. 329–334. [Google Scholar]
Smith, L.C.; Pavelsky, T.M. Estimation of river discharge, propagation speed, and hydraulic geometry from space: Lena River, Siberia. Water Resour. Res. 2008, 44. [Google Scholar] [CrossRef]
Sakai, T.; Hatta, S.; Okumura, M.; Hiyama, T.; Yamaguchi, Y.; Inoue, G. Use of Landsat TM/ETM+ to monitor the spatial and temporal extent of spring breakup floods in the Lena River, Siberia. Int. J. Remote Sens. 2015, 36, 719–733. [Google Scholar] [CrossRef]
Horritt, M.; Bates, P. Evaluation of 1D and 2D numerical models for predicting river flood inundation. J. Hydrol. 2002, 268, 87–99. [Google Scholar] [CrossRef]
Kornilova, E.; Krylenko, I.; Golovlyov, P.; Sazonov, A.; Nikitsky, A. Verification of the two-dimensional hydrodynamic model of the Lena River near Yakutsk by time-varying satellite data. Sovrem. Probl. Distantsionnogo Zondirovaniya Zemli Kosmosa 2018, 15, 169–178. [Google Scholar] [CrossRef]
Ferreira, L.; Pilastri, A.; Martins, C.M.; Pires, P.M.; Cortez, P. A Comparison of AutoML Tools for Machine Learning, Deep Learning and XGBoost. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–8. [Google Scholar]
Roshydromet; Rosvodresursy. Daily Hydrological Data and Ice Weakening Activities for the Lena River and Tributaries, data-processing: Research Development Infrastructure, CAG. 2021. Available online: http://data-in.ru/data-catalog/datasets/172/ (accessed on 9 November 2021).
Roshydromet. Meteorology of the Lena River and Tributaries Area: Monthly, Daily and Three Hourly Weather Characteristics for 1985–2020, data-Processing: Research Development Infrastructure, CAG. 2021. Available online: http://data-in.ru/data-catalog/datasets/173/ (accessed on 9 November 2021).
Yang, D.; Kane, D.L.; Hinzman, L.D.; Zhang, X.; Zhang, T.; Ye, H. Siberian Lena River hydrologic regime and recent change. J. Geophys. Res. Atmos. 2002, 107, ACL 14–1–ACL 14–10. [Google Scholar] [CrossRef]
Zhang, T.; Barry, R.G.; Knowles, K.; Heginbottom, J.; Brown, J. Statistics and characteristics of permafrost and ground-ice distribution in the Northern Hemisphere. Polar Geogr. 1999, 23, 132–154. [Google Scholar] [CrossRef]
Kil’myaninov, V. The effect of meteorological conditions prior to ice run on the extent of ice jam floods on the Lena River. Russ. Meteorol. Hydrol. 2012, 37, 276–278. [Google Scholar] [CrossRef]
Nikitin, N.O.; Vychuzhanin, P.; Sarafanov, M.; Polonskaia, I.S.; Revin, I.; Barabanova, I.V.; Maximov, G.; Kalyuzhnaya, A.V.; Boukhanovsky, A. Automated evolutionary approach for the design of composite machine learning pipelines. Future Gener. Comput. Syst. 2022, 127, 109–125. [Google Scholar] [CrossRef]
Sagi, O.; Rokach, L. Ensemble learning: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1249. [Google Scholar] [CrossRef]
Bergstra, J.; Komer, B.; Eliasmith, C.; Yamins, D.; Cox, D.D. Hyperopt: A python library for model selection and hyperparameter optimization. Comput. Sci. Discov. 2015, 8, 014008. [Google Scholar] [CrossRef]
Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. In Proceedings of the International Conference on Advances in Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; Volume 24. [Google Scholar]
Ponkina, E.; Illiger, P.; Krotova, O.; Bondarovich, A. Do ARMA Models Provide Better Gap Filling in Time Series of Soil Temperature and Soil Moisture? The Case of Arable Land in the Kulunda Steppe, Russia. Land 2021, 10, 579. [Google Scholar] [CrossRef]
Körner, P.; Kronenberg, R.; Genzel, S.; Bernhofer, C. Introducing Gradient Boosting as a universal gap filling tool for meteorological time series. Meteorol. Z. 2018, 27, 369–376. [Google Scholar] [CrossRef]
Moffat, A.M.; Papale, D.; Reichstein, M.; Hollinger, D.Y.; Richardson, A.D.; Barr, A.G.; Beckstein, C.; Braswell, B.H.; Churkina, G.; Desai, A.R.; et al. Comprehensive comparison of gap-filling techniques for eddy covariance net carbon fluxes. Agric. For. Meteorol. 2007, 147, 209–232. [Google Scholar] [CrossRef]
Ren, H.; Cromwell, E.; Kravitz, B.; Chen, X. Using deep learning to fill spatio-temporal data gaps in hydrological monitoring networks. Hydrol. Earth Syst. Sci. Discuss. 2019, 1–20. [Google Scholar] [CrossRef]
Sarafanov, M.; Nikitin, N.O.; Kalyuzhnaya, A.V. Automated data-driven approach for gap filling in the time series using evolutionary learning. arXiv 2021, arXiv:2103.01124. [Google Scholar]
Rojo, J.; Rivero, R.; Romero-Morte, J.; Fernández-González, F.; Pérez-Badia, R. Modeling pollen time series using seasonal-trend decomposition procedure based on LOESS smoothing. Int. J. Biometeorol. 2017, 61, 335–348. [Google Scholar] [CrossRef] [PubMed]
Chen, D.; Zhang, J.; Jiang, S. Forecasting the short-term metro ridership with seasonal and trend decomposition using loess and LSTM neural networks. IEEE Access 2020, 8, 91181–91187. [Google Scholar] [CrossRef]
Golyandina, N.; Korobeynikov, A.; Zhigljavsky, A. SSA for Multivariate Time Series. In Singular Spectrum Analysis with R; Springer: Berlin/Heidelberg, Germany, 2018; pp. 189–229. [Google Scholar] [CrossRef]
Hall, D.K.; Riggs, G.A. MODIS/Terra Snow Cover Daily L3 Global 500 m SIN Grid, Version 61. MOD10A1; NASA: Washington, DC, USA, 2021. [Google Scholar]
Ehlschlaeger, C.R. Using the Aˆ T search algorithm to develop hydrologic models from digital elevation data. In Proceedings of the International Geographic Information System (IGIS) Symposium, Baltimore, MD, USA, 18–19 March 1989; pp. 275–281. [Google Scholar]
Rodriguez, E.; Morris, C.S.; Belz, J.E. A global assessment of the SRTM performance. Photogramm. Eng. Remote Sens. 2006, 72, 249–260. [Google Scholar] [CrossRef]
Wang, X.Y.; Wang, J.; Jiang, Z.Y.; Li, H.Y.; Hao, X.H. An Effective Method for Snow-Cover Mapping of Dense Coniferous Forests in the Upper Heihe River Basin Using Landsat Operational Land Imager Data. Remote Sens. 2015, 7, 17246–17257. [Google Scholar] [CrossRef]
Ghashghaie, M.; Nozari, H. Effect of dam construction on Lake Urmia: Time series analysis of water level via ARIMA. J. Agric. Sci. Technol. 2018, 20, 1541–1553. [Google Scholar]
Beltaos, S. Effects of climate on mid-winter ice jams. Hydrol. Process. 2002, 16, 789–804. [Google Scholar] [CrossRef]