Application of Hybrid Prediction Methods in Spatial Assessment of Inland Excess Water Hazard

Inland excess water is temporary water inundation that occurs in flat-lands due to both precipitation and groundwater emerging on the surface as substantial sources. Inland excess water is an interrelated natural and human induced land degradation phenomenon, which causes several problems in the flat-land regions of Hungary covering nearly half of the country. Identification of areas with high risk requires spatial modelling, that is mapping of the specific natural hazard. Various external environmental factors determine the behavior of the occurrence, frequency of inland excess water. Spatial auxiliary information representing inland excess water forming environmental factors were taken into account to support the spatial inference of the locally experienced inland excess water frequency observations. Two hybrid spatial prediction approaches were tested to construct reliable maps, namely Regression Kriging (RK) and Random Forest with Ordinary Kriging (RFK) using spatially exhaustive auxiliary data on soil, geology, topography, land use, and climate. Comparing the results of the two approaches, we did not find significant differences in their accuracy. Although both methods are appropriate for predicting inland excess water hazard, we suggest the usage of RFK, since (i) it is more suitable for revealing non-linear and more complex relations than RK, (ii) it requires less presupposition on and preprocessing of the applied data, (iii) and keeps the range of the reference data, while RK tends more heavily to smooth the estimations, while (iv) it provides a variable rank, providing explicit information on the importance of the used predictors.


Introduction
Inland excess water (IEW) is temporary water inundation, a form of surplus surface water, which occurs in flatlands due to both precipitation and groundwater emerging on the surface as substantial sources. It occurs most frequently in local depressions of large flat areas, irrespective of river floods. A complex interaction of natural (e.g., meteorological, hydrogeological, pedological, topographical), and anthropogenic (e.g., land use, agricultural engineering) factors contribute to the occurrence of IEW [1,2]. It causes several social, economic, and environmental problems in the flat-land regions of Hungary, covering nearly half of the country [3]. Although IEW has received the most extended scientific attention in Hungary, the phenomenon is not confined to this geographic region [4]. It is observed all over the world, where soils are characterized by low water permeability/infiltration, and surface runoff is limited. Literature frequently uses the phrase 'waterlogging' as IEW, and they mostly examine crop responses for waterlogging. Occurrences were reported from, other European countries (France [5], Romania [6,7], Serbia [8,9]), but also in Africa (Egypt [10], Ethiopia [11,12], Nigeria [13], South Africa [14]), Australia [11,[15][16][17][18], Asia (Bangladesh [19], China [20,21], India [16,22,23], Russia [24,25], Uzbekistan [25]), and South and North America (Argentina, Chile, and the USA [26][27][28][29]). Climate change is having significant impact on the hydrologic cycle, affecting water resource systems [30]. In relation with this impact, the frequency of IEW inundations is likely to be concerned.
IEW inundation data can be achieved from two different sources. Traditionally, in field observations (in Hungary systematically collected since 1935; [31]) coordinated by water management authorities are summarized on (paper) maps. The Hungarian regional Water Management Directorates created IEW maps systematically based on in-situ observations collected in the course of continuous field survey [32] and 1:10,000 and 1:25,000 topographic base maps [33]. These maps are varying in geographical extent, scale, and spatial accuracy. Data collections of the IEW inundations were carried out mainly on county level, which lead to differences in the temporal resolution of the datasets. The majority of the maps are hand-drawn, displaying single inundation events. From the large-scale observations, 1:50,000 and 1:100,000 synergy maps were derived, which are sometimes the only actually available data sources.
With the appearance of publicly available remote sensing data, like aerial and space-borne imagery and the development of image processing techniques, the in situ observations were complemented and IEW could be identified and classified in a more efficient and effective way [34]. There are two drawbacks of space-and/or air-borne data acquisitions. (i) They can provide only a snapshot of the inundation areas, thus most extended phases are not necessarily represented. (ii) Furthermore, the differentiation between natural waters and wetlands or IEW inundation is problematic for image interpretation and processing [35]. Last but not least, datasets compiled by the interpretation of aerial or satellite images originate only from the last two decades. Nevertheless, numerous works were dedicated to the identification and mapping of IEW based on remotely sensed information. Csekő [36] used radar data, Csornai et al. [37] combined radar identification with optical sensors, Rakonczai et al. [38] compared Landsat-based classification of IEW with in situ measurements, Licskó [39] and Szatmári et al. [40,41] used aerial data to identify IEW inundations, Mucsi and Henits [42] used sub-pixel based classification on a Landsat time series, Csendes and Mucsi [43] used hyperspectral imaging combined with the potentials of airborne scanning to monitor environmental processes, Van Leeuwen [44] used a new approach using a combination of an artificial neural network (ANN) and a geographic information system (GIS), Van Leeuwen et al. [45] presented the first results of a system that can monitor IEW over a large area with sufficient detail at a high interval and in a timely matter. The methodology is based on freely available satellite imagery, and a map with known water bodies to train the method to identify inundations.
Due to data quality and reliability of both types of data collections, there have been initiatives on the identification of IEW forming factors together with hazard mapping based on the static and dynamic influential factors. Mezősi et al. [46] investigated potential impacts of climate change on the Great Hungarian Plain based on regional climate models. They found a slight decrease of IEW hazard. However, future prediction had high uncertainty, because IEW is an exceedingly complex phenomenon, and they involved only climatic parameters into the hazard analysis. Barta [47] measured hydro-meteorological and pedological factors that influence the formation of IEW. At his study area, he was able to differentiate the two most frequent types of IEW: (1) the upwelling or vertical type, where groundwater table is increasing, and (2) the accumulative or horizontal type, where the water accumulates under gravity in the lowest areas due to limited infiltration and/or runoff, independent from the groundwater table or communicating by capillary system. Based on his results, in case of the accumulative type, the rate and temporal progress of infiltration, its extreme values in relation to soil saturation were also estimated. The measurements and his evaluation were based on monitoring points, and the results were not spatially assessed. Van Leeuwen et al. developed an approach using a combination of an artificial neural network (ANN) and a geographic information system (GIS) in order to identify IEW, and investigate its forming factors in a study area in southern Hungary [44]. Nađ et al. [8] carried out spatial assessment of IEW risk in a study area in Serbia. The hazard map was derived from analysis of satellite images from a period of four dates, the vulnerability map is based on a land cover reclassification, the risk map was generated by the combination of the hazard and impact maps.
Spatial modelling and mapping of environmental phenomena and variables are a complex task since many of them are results of complex biological, chemical, and physical interactions between the atmosphere, biosphere, lithosphere, pedosphere, hydrosphere, and anthroposphere that may operate on different scales. As Hengl [48] pointed out, for spatial modelling of environmental variables it is better to use hybrid techniques, which combine geostatistics with classical or more advanced statistical techniques. Hatvani et al. [49] examined ice core-derived water stable isotope records in an Antarctic macro region by using multiple linear regression analysis and ordinary kriging. Fehér and Rakonczai [50] used spatio-temporal sequential Gaussian cosimulation for modelling and analyzing shallow groundwater fluctuation and its effect on Hungarian landscapes. Recently, geostatistics applied together with machine learning has gained much more attention in the spatial modelling of environmental variables. For example, Koch et al. [51,52] used random forest regression kriging and random forest combined with residual Gaussian simulation for modelling the shallow groundwater and the depth of redox interface, respectively. Szabó et al. [53] compared the performance of random-forest-based pedotransfer functions and random forest combined with kriging in deriving 3D soil hydraulic properties. Pásztor et al. [54] mapped risk of IEW of a Hungarian county, Bozán et al. [55] mapped relative frequency of IEW inundation on the Great Hungarian Plain. Both mapping processes were carried out by Regression Kriging method, based on the relationship between the occurrence of IEW inundations and its driving factors. The result map originated from the sum of the Multiple Linear Regression model and the interpolated (Ordinary Kriging) residuals.
The present paper aimed at spatial modelling of IEW hazard in a Hungarian study area by two hybrid spatial prediction approaches, which combine multivariate statistics and machine learning with geostatistics. We applied Regression Kriging (RK) and Random Forest combined with Ordinary Kriging (RFK) based on locally experienced IEW frequency observations, involving spatially exhaustive auxiliary data representing IEW forming environmental factors. We also investigated the effect of the applied predictors on the results. We ran the predictive models with two combinations of auxiliary variables to test the effect of the introduction of new predictors (linked to at least one of the determining factors).

Study Area
The investigated area (788.7 km 2 ; Figure 1) which is situated in Jász-Nagykun-Szolnok County in the lowland featured Great Hungarian Plain, is entitled "10.07. Kisújszállás Excess Water Protection Section" (EWPS). It is supervised by the Middle Tisza District Water Directorate. Motivations for selection of the studied area were as follows: • majority of the area is hazarded due to excess water inundations; • poor vertical drainage of its soils due to heavy texture (high amount of expanding clay minerals, low permeability, limited infiltration); • a significant part of the investigated area is traditionally agricultural land used for productive farming, where arable crop production has dominated since the regulation of the rivers of the Great Hungarian Plain; • there are meteorological data series available in the necessary length and quality; • the area represented a pilot for the improvement of integrated management practices for public authorities to mitigate heavy rain risks and excess water hazard in the frame of the RAINMAN project (INTERREG CE968: "RAINMAN"). The area bordered by the Tisza River has a very diverse geomorphology, which covers the middle part of the Great Hungarian Plain. Climate of the "10.07. Kisújszállás EWPS" is moderately warm-dry, the annual precipitation is 500-550 mm. The area is mainly suitable for growing droughttolerant, long growing, and high heat demanding varieties, where the importance of water retention and irrigation is increasing. In addition to the frequent droughts, the excess waters can cause considerable damage due to the lowland featured area.
A significant part (~70-80 %) of the area is covered by Vertisols and Chernozems Reference Soil Groups according to World Reference Base for Soil Resources (WRB, [56]). Solonetz soils also occur in smaller patches. Due to its lowland character, the canal density is well above the national average. These canals generally serve drainage; however, a considerable length of the canals have dual purposes to serve irrigation water. Depth of groundwater varies, generally 0.5-2.5 m with seasonal fluctuation.
Preliminary examination of the 10.07 Kisújszállás EWPS found that a significant part of this area was endangered by excess waters for some natural reasons. Especially in the eastern border areas, where 26 of the examined 39 years were inundated in a different level by excess water. It was typical for the water coverage to stay in the area for 10 to 15 days. A significant proportion of cultivated crops could be fully destroyed by such inundation. Land use is characterized by a high proportion of agricultural land (91%), and within it a very high proportion of arable land (close to 79%), well above the national average. Grassland (11.08%) and forest (7.77%) are also heavily represented.

Reference Data
The responsible Water Management Directorates provided seasonal maps of areas affected by IEW from the period between 1962 and 2014. This legacy information was digitized, vectorized, and then aggregated (Figure 2.). The area bordered by the Tisza River has a very diverse geomorphology, which covers the middle part of the Great Hungarian Plain. Climate of the "10.07. Kisújszállás EWPS" is moderately warm-dry, the annual precipitation is 500-550 mm. The area is mainly suitable for growing drought-tolerant, long growing, and high heat demanding varieties, where the importance of water retention and irrigation is increasing. In addition to the frequent droughts, the excess waters can cause considerable damage due to the lowland featured area.
A significant part (~70-80%) of the area is covered by Vertisols and Chernozems Reference Soil Groups according to World Reference Base for Soil Resources (WRB, [56]). Solonetz soils also occur in smaller patches. Due to its lowland character, the canal density is well above the national average. These canals generally serve drainage; however, a considerable length of the canals have dual purposes to serve irrigation water. Depth of groundwater varies, generally 0.5-2.5 m with seasonal fluctuation.
Preliminary examination of the 10.07 Kisújszállás EWPS found that a significant part of this area was endangered by excess waters for some natural reasons. Especially in the eastern border areas, where 26 of the examined 39 years were inundated in a different level by excess water. It was typical for the water coverage to stay in the area for 10 to 15 days. A significant proportion of cultivated crops could be fully destroyed by such inundation. Land use is characterized by a high proportion of agricultural land (91%), and within it a very high proportion of arable land (close to 79%), well above the national average. Grassland (11.08%) and forest (7.77%) are also heavily represented.

Reference Data
The responsible Water Management Directorates provided seasonal maps of areas affected by IEW from the period between 1962 and 2014. This legacy information was digitized, vectorized, and then aggregated ( Figure 2).
Map of temporally aggregated legacy information on IEW inundation frequency provided reference data as follows. Multiple conditioned random samplings were carried out on the vectorized legacy data. The first condition was to sample each patch of the map by points equal to the square root of the area (in hectares) of the inundation frequency polygons. With the second condition, we made an exception with small, but frequently inundated patches. If the polygon was smaller than 1 ha, but inundation frequency was greater than 5, the polygon got 1 sampling point. The third condition referred to the minimum allowed distance between random points, which was set to 50 m (the pixel size used in the spatial model). With the fourth condition, points lying over settlements and water bodies were omitted. As any map, the used inundation frequency map just models the reality, thus it cannot be used as absolute reference and consequently its point sampling introduces certain inaccuracies into the applied reference data set. To reduce this effect and in order to have a more balanced sampling, ten random point datasets were constructed. The generated sample sets contain~13,000 virtual observation sites. The reference data sets were compiled as relative inundation frequency, the values were extracted for each of the 10 random point data sets. In accordance with this, mapping models were run 10 times. Map of temporally aggregated legacy information on IEW inundation frequency provided reference data as follows. Multiple conditioned random samplings were carried out on the vectorized legacy data. The first condition was to sample each patch of the map by points equal to the square root of the area (in hectares) of the inundation frequency polygons. With the second condition, we made an exception with small, but frequently inundated patches. If the polygon was smaller than 1 ha, but inundation frequency was greater than 5, the polygon got 1 sampling point. The third condition referred to the minimum allowed distance between random points, which was set to 50 m (the pixel size used in the spatial model). With the fourth condition, points lying over settlements and water bodies were omitted. As any map, the used inundation frequency map just models the reality, thus it cannot be used as absolute reference and consequently its point sampling introduces certain inaccuracies into the applied reference data set. To reduce this effect and in order to have a more balanced sampling, ten random point datasets were constructed. The generated sample sets contain ~13,000 virtual observation sites. The reference data sets were compiled as relative inundation frequency, the values were extracted for each of the 10 random point data sets. In accordance with this, mapping models were run 10 times.

Environmental Co-Variables
The predictive models were run with two combinations of auxiliary variables (also called environmental co-variables). Basic set (BS) consists of variables formerly used by Bozán et al. (2018). To test the effect of the introduction of new predictors (linked to at least one of the determining factors), an extended set (ES) was also compiled and used.
The effect of soil on the occurrence of IEW was modelled and spatially represented by the soil physical property layer and the 'landscape management soil type' of the Digital Kreybig Soil Information System (DKSIS; Pásztor et al., 2012). DKSIS is one of the most important nationwide spatial soil databases of Hungary, it is the reambulated and GIS-developed version of the legacy data of the soil survey lead by Kreybig [57,58]. In the legacy data, soil physical categories were elaborated according to water retention capability, permeability, and infiltration rate. Landscape management soil types were defined from the viewpoint of crop production, aggregated from pH, CaCO3 content,

Environmental Co-Variables
The predictive models were run with two combinations of auxiliary variables (also called environmental co-variables). Basic set (BS) consists of variables formerly used by Bozán et al. (2018). To test the effect of the introduction of new predictors (linked to at least one of the determining factors), an extended set (ES) was also compiled and used.
The effect of soil on the occurrence of IEW was modelled and spatially represented by the soil physical property layer and the 'landscape management soil type' of the Digital Kreybig Soil Information System (DKSIS; Pásztor et al., 2012). DKSIS is one of the most important nationwide spatial soil databases of Hungary, it is the reambulated and GIS-developed version of the legacy data of the soil survey lead by Kreybig [57,58]. In the legacy data, soil physical categories were elaborated according to water retention capability, permeability, and infiltration rate. Landscape management soil types were defined from the viewpoint of crop production, aggregated from pH, CaCO 3 content, and soil texture [59].
For a more sophisticated characterization of soils, the extended set was completed with hydrophysical soil properties represented by continuous variables. The 3D Soil Hydraulic Database of Europe (EU-SoilHydroGrids ver1.0, [60]) provides information on the most frequently required soil hydraulic properties at 250 m resolution at 7 soil depths up to 2 m with full European coverage. The following layers were used and clipped from EU-SoilHydroGrids as co-variables: saturated water content (pF = 0), water content at field capacity (pF = 2.5), wilting point (pF = 4.2), and saturated hydraulic conductivity. We converted the available information of the 7 available soil depths (0 cm, 5 cm, 15 cm, 30 cm, 60 cm, 100 cm, 200 cm) for 0-30 cm, 30-60 cm, 60-100 cm, 100-200 cm depth intervals.
Climate was represented by four spatial layers provided by the Hungarian Meteorological Service. Average annual precipitation, average annual temperature, average annual evaporation, and average annual evapotranspiration were compiled using the MISH method elaborated for the spatial interpolation of surface meteorological elements based on 30 year observation with 0.5 resolution [61]. The layers are available at 100 m resolution.
In addition, humidity index (HUMI) layer was also applied in the prediction. HUMI is used for the characterization of water stress periods. It was calculated by a 10% possibility of occurrence of root square of sum of monthly weighted precipitation and sum of monthly weighted potential evapotranspiration ratio [62]. The Hungarian Meteorological Service and the responsible Water Management Directorates provided precipitation and evapotranspiration data of 68 meteorological observation stations, covering the period of 1961-2014.
The effect of land use was characterized and spatially represented by a numeric coefficient based on the National CORINE Land Cover 1:50,000 database (CLC50) [63]. The categories were parameterized with expert-based land use indices characterizing their role in the formation of IEW [1]. According to our method, the lower the values of land use factor (i.e., artificial areas 0. Hydrogeology was spatially represented and was taken into consideration by the depth and the thickness of the uppermost aquitard, for which data was provided by the Geological and Geophysical Institute of Hungary (predecessor of Mining and Geological Survey of Hungary), and the standard depth of groundwater calculated on the average 10 highest values within the last 50 years. The reference groundwater data of well observations were provided by the General Directorate of Water Management. In order to get spatially exhaustive groundwater data, spatial interpolation (co-kriging) had to be carried out, using elevation provided by HydroDEM as a proper spatial co-variable [64].
The movement of groundwater cannot be studied by itself, since the flow systems of the sub-surface waters have a very significant influence on it. In order to clarify the role of sub-surface waters in the formation of excess water inundations, a distinction should be made between recharge and discharge zones. Groundwater flow, i.e., the difference between the amount of water infiltrated into and out of the groundwater for 2 × 2-km cells was calculated. The map prepared and provided by the Hungarian Mining and Geological Service, shows the recharge and discharge areas in the non-productive state, where the recharge cells can be characterized with positive values and the discharge areas with negative values (mm/year). It was used in the extended predictor variable set.
ES also contained a layer displaying distance from surface water bodies. It was calculated using Euclidean distance metrics in ESRI ArcGIS 10.6 software.

Preprocessing of Environmental Co-Variables
The auxiliary data set was preprocessed for spatial analysis. Raster layers were transformed to a common grid system, masked to the study area, and resampled to 50-m spatial resolution. Data of Hungarian Meteorological Service and EU-SoilHydroGrids data were converted by cubic convolution method. Since in the case of saturated hydraulic conductivity, cubic convolution produced artifacts, we applied nearest neighbor method. Categorical maps were also converted into the 50-m grid system by maximum area method.
To reduce the number of predictor variables in RK, and to avoid their multicollinearity, principal component analysis (PCA) was carried out on the continuous variables. Before PCA, all of the auxiliary variables were normalized to a 0-255 scale. In further analysis, the first principal components were used, which together explained 99% of the variance. Categorical variables were considered as indicator variables. Every category got a new layer: in case of occurrence, the grid value was set to 255, while out-of-category areas were coded with 0. In Random Forest combined with Ordinary Kriging (RFK) models, categorical co-variables were handled as factors, therefore there was no need to distinguish them in preprocessing.

The Applied Hybrid Prediction Methods
In this study, two hybrid prediction methods were used for spatial prediction of IEW risk, namely Regression Kriging and Random Forest combined with Ordinary Kriging. Both techniques rest on the same assumption, i.e., the target variable being mapped can be described and modelled in terms of a deterministic component and a stochastic component, which is where m(u) is the deterministic component describing structural variation, ε(u) is the stochastic component consisting of random variation that could be spatially correlated, and u is the vector of the geographical coordinates.
From a practical point of view, the main difference between RK and RFK is that how they describe and model the deterministic part of variation. In case of RK, the assumption made on the linear relationship between the target variable and the environmental co-variables may be too rigorous because this relationship could be more complex and this assumption can be valid just for the first approximation (Malone et al., 2018). That was the reason why digital environmental mapping has been interested in machine learning algorithms, which are able to describe and model the deterministic component via predictive models by applying different principles. Among the machine learning algorithms used in digital environmental mapping, random forest plays an accentuated role.

Regression Kriging (RK)
Regression Kriging (RK), also termed as Universal Kriging or Kriging with an External Drift [65], combines regression of the target variable on environmental co-variables with kriging of the regression residuals [65,66]. In this study, a Multiple Linear Regression (MLR) analysis was carried out to describe and model the relationship between the reference data set (with extracted information of IEW inundation frequency, see '2.2. Reference Data' section) and the environmental co-variables listed in '2.3. Environmental Co-Variables' section. In the course of MLR analysis, 0.05 significance level was applied, furthermore, the stepwise method was used for selecting the relevant environmental co-variables. The model obtained by MLR analysis described the deterministic component of IEW risk. The stochastic component was described by geostatistical modelling, namely Ordinary Kriging, using the regression residuals, which represented the variation that could not be explained by the MLR model [67]. The RK prediction can be obtained by summing the prediction of the MLR model and the prediction of Ordinary Kriging.
Finally, ten realizations were generated with both basic and extended variable sets using the ten random point datasets. Mean of the ten realizations provided the final result map of the IEW indundation probability by RK. Standard deviation and median map were also compiled from the ten realizations to test the robustness and indicate the reliability of the aggregated models.

Random Forest Combined with Ordinary Kriging (RFK)
Random Forest combined with Ordinary Kriging (RFK) is a relatively new hybrid method used in digital environmental mapping [53,68,69], which combines predictive model of Random Forest with kriging of the Random Forest residuals [70]. In RFK, the deterministic component is described by Random Forest (RF) as opposed to RK, where the deterministic component was described by MLR. The RF algorithm generates (depending on its settings and on the type of the dependent variable) a number of regression or classification trees. The model relies on averaging the result of the trees, which are grown independently from each other [71]. In the course of RF modelling, the number of trees was set at 100. As two environmental co-variable packages were applied for the RFK method: for BS mtry was set at 7, for ES mtry was set at 14. The RF part of the RFK models provides a variable rank, reflecting which co-variables play a more important role in the prediction model. The stochastic component was described by geostatistical modelling, namely Ordinary Kriging, using the RF residuals. The RFK prediction can be obtained by summing the prediction of RF and the prediction of Ordinary Kriging.
As in the case of RK, ten realizations were generated with both basic and extended variable sets using the ten random point datasets. Mean of the ten realizations provided the final result map of the IEW indundation probability by RFK. Standard deviation and median map were also compiled from the ten realizations, to test the robustness and indicate the reliability of the aggregated models.

Validation
In the present research, airborne originated data on inundation frequency were used only in the validation of mapping results. The data originated from relative IEW inundancy layer of Lechner Knowledge Centre Nonprofit Ltd. [72], for the period 1998-2016. The dataset (made available by courtesy) consists of 1451 points (Figure 3), inundation frequency is characterized in percentage categorized by ten (0%-10%, 10%-20%, 20%-30%, etc.). ten realizations to test the robustness and indicate the reliability of the aggregated models.

Random Forest Combined with Ordinary Kriging (RFK)
Random Forest combined with Ordinary Kriging (RFK) is a relatively new hybrid method used in digital environmental mapping [53,68,69], which combines predictive model of Random Forest with kriging of the Random Forest residuals [70]. In RFK, the deterministic component is described by Random Forest (RF) as opposed to RK, where the deterministic component was described by MLR. The RF algorithm generates (depending on its settings and on the type of the dependent variable) a number of regression or classification trees. The model relies on averaging the result of the trees, which are grown independently from each other [71]. In the course of RF modelling, the number of trees was set at 100. As two environmental co-variable packages were applied for the RFK method: for BS mtry was set at 7, for ES mtry was set at 14. The RF part of the RFK models provides a variable rank, reflecting which co-variables play a more important role in the prediction model. The stochastic component was described by geostatistical modelling, namely Ordinary Kriging, using the RF residuals. The RFK prediction can be obtained by summing the prediction of RF and the prediction of Ordinary Kriging.
As in the case of RK, ten realizations were generated with both basic and extended variable sets using the ten random point datasets. Mean of the ten realizations provided the final result map of the IEW indundation probability by RFK. Standard deviation and median map were also compiled from the ten realizations, to test the robustness and indicate the reliability of the aggregated models.

2.6.Validation
In the present research, airborne originated data on inundation frequency were used only in the validation of mapping results. The data originated from relative IEW inundancy layer of Lechner Knowledge Centre Nonprofit Ltd. [72], for the period 1998-2016. The dataset (made available by courtesy) consists of 1451 points (Figure 3.), inundation frequency is characterized in percentage categorized by ten (0%-10%, 10%-20%, 20%-30%, etc.).  Another kind of, only partly independent, validation was also carried out. The vectorized legacy layer of inundation, which was also the basis of the randomly sampled reference data points, was converted to raster format. Values of the four result maps (RK BS , RK ES , RFK BS , RFK ES ) were rounded to integer. Then a pixel by pixel comparison was carried out between the legacy data and the four result layers. The differences are depicted in a bar chart.

Software Background
Morphometric derivatives were calculated by SAGA GIS tools (Conrad et al., 2015). RK modelling were run in SAGA GIS environment, while RFK modelling were carried out in R statistical software [73]. Calculation of mean, standard deviation, and median maps, as well as editing the layouts of result maps were compiled in ArcGIS 10.6. Validation queries were run in ArcGIS 10.6, evaluation and depiction of validation results were carried out in MS Excel.

Result Maps
The four final result IEW indundation probability maps (mean of the ten realizations of RK BS , RK ES , RFK BS , RFK ES ) are presented in Figure 4. There are a few white pixels in the maps created with ES, which means 'NoData' pixels. They are actually fishponds, and originate from the EU-SoilHydroGrids co-variate layers, since water bodies are masked out in these originally 250-m resolution layers.   As for visible comparison to the inundation map (Figure 4), patterns of the more frequently inundated areas are similarly noticeable in all of the four result maps. Minimum values are~0, mean values are 1.5, and standard deviations are 0.7 in all of the four result maps; however, maximum values are significantly higher in RFK predictions (9.4) than in RK predictions (8.2). The latter means that RK is capable of narrowing down the range of the reference data in the prediction, while RFK provides more reliable results.
In map RFK ES , structure of the 'distance from surface water bodies' layer become remarkably discernible. This appears in the variable importance rank of RF part in the RFK models (Table 1). RF indicated that the 'distance from surface water bodies' co-variable has the most predictive power in the prediction ten times out of ten cases. The second most important variables according to the RFK ES models (Table 1) are 'groundwater recharge and discharge areas' (eight times) and 'average annual precipitation' (two times). The third place shows not so consistent image: 'groundwater recharge and discharge areas', 'average annual temperature', and 'saturated water content in 0-30 cm soil depth' occurs in two times, 'Closed Depressions', 'Vertical Distance to Channel Network', 'average annual precipitation', and 'average annual evapotranspiration' in one-one times. The fourth place in the ranking are 'average annual precipitation' (four times), 'saturated water content in 0-30 cm soil depth', and 'Vertical Distance to Channel Network' (two times), 'Closed Depressions', and 'average annual evaporation' (one time).

Validation
Results of validation by independent data are summarized in Table 2 and Figure 5. In all four cases (RK BS , RK ES , RFK BS , RFK ES ), difference is not more than one category (−1, 0 or 1) in 72%-73% of the samples. Prediction is greater than observation in 37%-38% of the cases, observed values are greater than predicted values in 34%-36% of the points.
Results of the pixel by pixel comparison between the legacy data and the four result layers are summarized in Table 3 and Figure 6. According to the results, predicted values are significantly greater than observed values. In all four cases, 1 category is the difference in more than half of the pixels. However, in 85%-88% of the pixels, the difference is not more than one (−1, 0 or 1).  Results of the pixel by pixel comparison between the legacy data and the four result layers are summarized in Table 3 and Figure 6. According to the results, predicted values are significantly greater than observed values. In all four cases, 1 category is the difference in more than half of the pixels. However, in 85%-88% of the pixels, the difference is not more than one (-1, 0 or 1). Table 3. Difference between the legacy map's inundation frequency and the categorized inundation result maps expressed in percentage.

Discussion
Our results showed that there is no significant difference between the accuracy of the two methods used in this study suggesting that both RK and RFK could be appropriate for predicting and mapping IEW hazard. Although both methods performed equally well, we suggest the usage of RFK instead of RK. First of all, numerous studies have demonstrated that random forest commonly outperforms classical statistical techniques (e.g., [69,70,74]). This is because random forest is able to explore, describe, and model complex, non-linear relationships between the response and predictor variables, and in addition, it has been elaborated on a different philosophy aiming at giving the most accurate prediction [71]. Besides, not only does less assumption have to be made on RFK than on RK, but also less preprocessing is needed in RFK [69]. One of the main disadvantages of RK is that it can extrapolate in the feature space, whereas random forest keeps the range of the reference data. Last but not least, RFK can list the rank of the applied environmental co-variables providing explicit information on the importance of them in making prediction.
More co-variables were involved for thematic improvement; however, we did not find that involving them into the modelling would significantly increase the accuracy. On the other hand, according to the importance ranks, the variables added to the ES proved to be the most determining predictors. A possible explanation of similar accuracy of modelling with the two co-variate sets can be the relatively poor spatial resolution of the EU-SoilHydroGrids and layer on recharge and discharge areas, which thematically extended the set of the predictor environmental co-variables.
Our results on prediction importance are in accordance with those of Van Leeuwen et al. [44]. They found that relief had a very important influence in the investigations. We also found, that morphometric derivatives show significant importance in variable importance ranking. The influence of the soil was small in [44], which was assigned to the limited variation of soils on their small study area. In our pilot soil properties, neither seemed essentially important co-variables, as only one soil layer occur in the third and fourth place in the rank. Their results improved by including distance to anthropogenic objects in the training and simulation. Similarly to our findings, distance from surface water bodies proved to be the most important co-variable above all and the distance to channels proved to be the most influential anthropogenic factor.
Our results can be the basis of further investigations, it can support authorities and decision makers in water management projects and issues, improving integrated management practices. Prediction accuracy can be increased by more frequently collected reference data. Van Leeuwen et al. [45] presented a promising method that is capable of continuously identifying IEW over large areas for operative purposes. However, they concluded that more scientific research is needed to improve

Discussion
Our results showed that there is no significant difference between the accuracy of the two methods used in this study suggesting that both RK and RFK could be appropriate for predicting and mapping IEW hazard. Although both methods performed equally well, we suggest the usage of RFK instead of RK. First of all, numerous studies have demonstrated that random forest commonly outperforms classical statistical techniques (e.g., [69,70,74]). This is because random forest is able to explore, describe, and model complex, non-linear relationships between the response and predictor variables, and in addition, it has been elaborated on a different philosophy aiming at giving the most accurate prediction [71]. Besides, not only does less assumption have to be made on RFK than on RK, but also less preprocessing is needed in RFK [69]. One of the main disadvantages of RK is that it can extrapolate in the feature space, whereas random forest keeps the range of the reference data. Last but not least, RFK can list the rank of the applied environmental co-variables providing explicit information on the importance of them in making prediction.
More co-variables were involved for thematic improvement; however, we did not find that involving them into the modelling would significantly increase the accuracy. On the other hand, according to the importance ranks, the variables added to the ES proved to be the most determining predictors. A possible explanation of similar accuracy of modelling with the two co-variate sets can be the relatively poor spatial resolution of the EU-SoilHydroGrids and layer on recharge and discharge areas, which thematically extended the set of the predictor environmental co-variables.
Our results on prediction importance are in accordance with those of Van Leeuwen et al. [44]. They found that relief had a very important influence in the investigations. We also found, that morphometric derivatives show significant importance in variable importance ranking. The influence of the soil was small in [44], which was assigned to the limited variation of soils on their small study area. In our pilot soil properties, neither seemed essentially important co-variables, as only one soil layer occur in the third and fourth place in the rank. Their results improved by including distance to anthropogenic objects in the training and simulation. Similarly to our findings, distance from surface water bodies proved to be the most important co-variable above all and the distance to channels proved to be the most influential anthropogenic factor.
Our results can be the basis of further investigations, it can support authorities and decision makers in water management projects and issues, improving integrated management practices. Prediction accuracy can be increased by more frequently collected reference data. Van Leeuwen et al. [45] presented a promising method that is capable of continuously identifying IEW over large areas for operative purposes. However, they concluded that more scientific research is needed to improve the determination of the threshold for the active data processing workflow and to reduce the number of false positives.
We could provide a more specific support, if we could distinguish which type of IEW appeared at the area [47]. Furthermore, it would be useful to involve land use information and agricultural practices into the investigations: not only as co-variable, but as part of an improved risk analysis model. Since inundation of different land use categories implies different economical risks [8]. More precise analyses could be achieved if annual information of land use changes (e.g., crop rotation, agricultural engineering) were involved.
Finally, our results can contribute to climate scenario analyses. Mezősi et al. [46] involved only climatic factors into their investigations, they used neither IEW inundation data, nor other environmental factors. We assume, that involving more data and environmental factors can decrease the uncertainty of future predictions.
According to our former experiences [54,55], the applied hybrid spatial prediction approaches can be suggested to be used not only in relatively small, but in larger study areas as well.

Conclusions
To summarize in brief, our aim was spatial modelling of IEW hazard in a Hungarian study area with two hybrid spatial prediction approaches, which combine multivariate statistics and machine learning respectively with geostatistics. We applied Regression Kriging (RK) and Random Forest combined with Ordinary Kriging (RFK) based on locally experienced IEW frequency observations, involving spatially exhaustive auxiliary data representing IEW forming environmental factors. We also investigated the effect of the applied predictors on the results. We ran the predictive models with two combinations of auxiliary variables to test the effect of the introduction of new predictors. According to the results of the two approaches, we did not find significant differences in their accuracy. Although both methods are appropriate for predicting inland excess water hazard, we suggest the usage of RFK, since (i) it is more suitable for revealing non-linear and more complex relations than RK, (ii) it requires less presupposition on and preprocessing of the applied data (iii) keeps the range of the reference data, while RK tends more heavily to smooth the estimations and (iv) it provides a variable rank, providing explicit information on the importance of the used predictors. Involving more co-variables into the mapping process for thematic extension did not prove to be effective according to the accuracy assessment, presumably due to the poor spatial resolution of soil hydrophysical data and the layer on recharge and discharge areas. We attribute this failure to the inference of the expected improvement in thematic extension with the relatively poor spatial representation of the potential key (inundation forming) factors.
Based on our results, we conclude that area-based conditioned random sampling on vectorized legacy data is appropriate as reference data for IEW indundation probability mapping, if modelling is based on multiple datasets. It would be interesting to make further investigations on the accuracy of the results running the models not only with 10, but 20 or even more randomly sampled reference datasets. In the recently occurring nationwide IEW inundation hazard mapping we have increased this number to 20.
Although we did not find significant difference in accuracy provided by the two co-variable packages, we consider the co-variables in the ES package more than useful. We are planning to make further investigations on IEW hazard mapping with the application of more detailed spatial soil hydrophysical data, when it is available for the territory of Hungary.
Significant improvement in prediction accuracy could be expected from more frequently collected reference data. If IEW inundation events were monitored continuously, at unified spatial and temporal resolution, both inundation probability and hazard could be predicted more accurately. The recently developed National Earth Observation Information System [75] and its services could provide and are also expected to make a significant step forward in this field.