Estimating Rainfall Erosivity from Daily Precipitation Using Generalized Additive Models

: One of the most important natural processes responsible for soil loss is rainfall-induced erosion. The calculation of rainfall erosivity, as defined in the Universal Soil Loss Equation, requires the availability of rainfall data, either continuous breakpoint, or pluviograph, with sampling intervals on the order of minutes. Due to the limited temporal coverage and spatial scarcity of such data, worldwide, alternative equations have been developed that utilize coarser rainfall records, in an effort to estimate erosivity equivalently to that calculated using pluviograph data. This paper presents the application of generalized additive models (GAMs) to estimate erosivity utilizing daily rainfall records. As a case study, pluviograph data with a time step of 30 min from the Water District of Thrace in Greece were used. By applying GAMs, it became possible to model the nonlinear relation between daily rainfall, seasonal periodicity, and rainfall erosivity more effectively, in terms of accuracy, than the application of two well-known nonlinear empirical equations, both on a daily and an annual basis.


Introduction
Rainfall erosivity concerns the ability of rainfall to cause erosion on the surface of the earth [1]. Unsustainable land management, growing human pressure on ecosystems, and possible higher erosion rates due to climate change [2] can lead to soil degradation and decreasing agricultural productivity, setting the ecologic balance in danger, and even reducing the efficiency of adaptation measures [3]. Accelerated soil losses and the menace of desertification, which many countries face [4], pose a danger to the achievement of the Sustainable Development Goals of the United Nations to end poverty [5], because these goals depend on a healthy environment and sustainable agricultural production.
Regarding the calculation of soil erosion rates, it is a necessity to compute rainfall erosivity and simulate the effects of land use and the appliance of different land practices [6]. Due to the difficulty in modeling the distribution, size, and terminal velocity of raindrops in relation to the detachment of soil particles, more tractable rainfall indices have been utilized for the calculation of erosivity [7]. Therefore, in the Universal Soil Loss Equation (USLE) [8], the rainfall erosivity factor R was introduced as the product of the rainfall kinetic energy of a storm, classified as erosive by certain criteria, and its maximum 30 min intensity. USLE with its computerized revisions RUSLE [9] and RUSLE2 [10] are the most widely applied soil erosion prediction models globally [11].
A serious issue in the application of USLE is the computation of R, which requires pluviograph records with a length of at least twenty years, so that wet or dry rainfall periods will be incorporated, eliminating any bias [7]. In order to cope with the difficulties that arise due to the absence of adequate pluviograph data, in terms of spatial and temporal coverage, many models have been developed. These models use empirical equations based on various coarser records of rainfall depths (daily, monthly, annual), under specific spatial parameters and climatological data, in order to estimate R. A comprehensive list of such methods can be found in Vantas et al. [12]. Additionally, these models can be used in conjunction with climate models in order to estimate projected changes in R [3].
The use of more efficient techniques, such as machine or statistical learning algorithms, to estimate erosivity is exiguous in the pertinent literature. Recently, Vantas et al. utilized quantile regression forests to estimate current and future values of R [3], and the same authors applied spatiotemporal clustering to define regions in Greece with similar erosivity density patterns [13]. Vantas et al. [12] compared the results from deep neural networks to empirical equations on the estimation of R, and Vantas and Sidiropoulos [14] evaluated the use of Bayesian neural networks on the estimation of weekly erosivity values. A generalized additive model (GAM) is a statistical learning algorithm [15], in the area of supervised learning, that can be used in classification and regression problems. This algorithm is an extension of generalized linear models (GLMs) that automatically fits a set of smoothing functions for each one of the input variables and adds all of these contributions, in order to estimate the output response [16]. In contrast to the use of ordinary linear regression or GLMs, GAMs provide more accurate results if nonlinear relationships exist [15]. In addition, GAMs are easier to interpret in comparison to other machine learning methods, such as random forests or neural networks, etc. [17]. Additionally, GAMs offer the ability to evaluate the results for each one of the terms of the model, to report the approximate significance of smooth terms and to produce confidence intervals of the output response [18]. Uses of GAM in hydrology are relatively scarce and can be found in regional low-flow frequency analysis [19], modeling extreme droughts [20], and empirical streamflow simulation [21].
This study aims to introduce and evaluate the use of GAMs for the estimation of erosivity using daily rainfall records. As a case study, pluviograph data of twelve stations from the Water District of Thrace in Greece were used. Firstly, erosive storms were extracted and their erosivity was calculated using the procedure described in RUSLE2. Subsequently, the calculated erosivity values and precipitation timeseries were aggregated on a daily basis. Afterward, a GAM model and two empirical exponential equations were fitted utilizing the aggregated values. Finally, the performance of the fitted models on estimating daily and annual erosivity was validated, in relation to the directly calculated values, using an independent test set. This type of analysis is a novel approach and it has not been presented, until now, in the international literature.

Data Acquisition and Processing
The study region, located to the north-east of Greece (Figure 1), extends to an area of 11,243 km 2 that covers the Water District of Thrace. It is delimited by the boundaries of Greece with Bulgaria and Turkey on the north and east, respectively, by the Thracian Sea on the south, and by the watershed of Nestos River on the west. This area was identified: (a) To have common rainfall erosivity spatiotemporal patterns [3]; (b) to form a region with a distinct monthly temporal distribution of erosivity density [13], a parameter strongly related to seasonal rainfall intensity [10]; and (c) to have similar intra-storm temporal distribution intensity patterns of heavy and, consequently, erosive precipitation [22].
Pluviograph data for 12 meteorological stations were taken from the Greek National Bank of Hydrological and Meteorological Information (NBHM) [23]. The time series comprised a total of 372 years of pluviograph records with a time step of 30 min for the time period from 1965 to 1996 (Table  1). The time series rainfall records were checked for consistency and cleared from errors.

Rainfall Erosivity Calculations
The erosivity of a single erosive rainfall event, (MJ·mm/ha/h), was computed using the pluviograph records [1]: where is the kinetic energy per unit of rainfall (MJ/ha/mm), the rainfall depth (mm) for the time interval of the hyetograph, which has been divided into = 1,2, … , time sub-intervals, and 30 is the maximum rainfall intensity for a 30 min duration during that rainfall.
The quantity was calculated for each time sub-interval, , applying the kinetic energy equation used in RUSLE2 [10]: where is the rainfall intensity (mm/h).
An individual rainfall event was extracted from the continuous pluviograph data, if its cumulative depth for a duration of 6 h at a certain location was less than 1.27 mm. A rainfall event was considered to be erosive if it had a cumulative rainfall depth greater than 12.7 mm. Given that the use of coarser fixed time intervals as 30 min to a finer one can lead to an underestimation of the value of erosivity, the appropriate monthly conversion factors, previously computed for Greece [3], were applied.
The daily erosivity (MJ·mm/ha/h/d) was either set equal to the sum of all daily events or when a storm was crossed temporally by two or more consecutive days; it was assigned to the first calendar day: where = 1, . . . , are the erosive events that initiate during calendar day d.

Empirical Equations for the Estimation of Erosivity
The empirical equations utilized were the Richardson et al. [24] and the Yu and Rosewell [25] models, which are given in Equations (4) and (5), respectively: where , is the daily erosivity, is the daily precipitation, and the parameters , and , correspond to station s and month m.
where , is the daily erosivity; the parameters , , and are different for every station s; is the month; and ω is the month with the highest average daily erosivity values. The parameters in the previous equations were determined using nonlinear least squares as implemented in the R language package stats [26].

Generalized Aditive Models
As previously mentioned, GAMs provide a generic framework for the extension of GLM and were introduced by Hastie and Tibshirani [16]. Α GLM has the form [27]: where g(.) is a link function, a priori known, that transforms the output response in order to make the model linear; is expected to follow a distribution from the exponential family; is a vector of unknown parameters; and X is the matrix of input variables. A GAM usually has the form [16]: where is the intercept and is the smooth function, called the "smooth," of the j-th input variable . Smooths consist of a set of basis functions [28]: where is the basis function and are unknown parameters. Some indicative basis functions that are used are polynomial, cubic, or penalized regression splines, or thin-plate regression splines. Computationally efficient fitting procedures have been developed and implemented in the open source R language packages gam [29] and mcgv [28], which automatically select the smoothing parameters. The reader can find more details about alternative fitting algorithms in [15,18,[30][31][32].
In this paper, a GAM model was fitted using the daily erosivity as an output response, and the smooths of the logarithm of daily precipitation and the smooth of the month as numerical values (1,2..,12) as input variables. The basis function was penalized thin-plate regression splines. A log link function was used and daily erosivity was assumed to be distributed according to the Gamma distribution as positive and highly skewed [12,33,34].

Validation and Model Performance Criteria
The daily aggregated data were divided using 70% of them as a training set and 30% as testing. The split was based on the years that constitute the dataset. In this way, it was made possible to estimate the out-of-sample error of the models on new data using both the initial daily step of the dataset and a further aggregated annual, based on daily estimates, in order to examine the presence of systematic errors, as previous studies reported systematic underestimation [12,34]. As measures of the out-ofsample error on the test set, the coefficient of determination , the root-mean-squared error (RMSE), and the mean absolute error (MAE) were used: where is the number of the samples in the test set (daily or annual), is the predicted value of that belongs in the test set, and is the mean of . describes the ratio of variance that is explained by a model and may be negative, among other reasons, if an inappropriate model is used [35].

Results and Discussion
The daily dataset consisted of 10,424 records for all the stations. Using the training set: (a) the empirical equations, Equations (4) and (5), were fitted for each station; (b) a single GAM was fitted using all stations' data, representative of the area. Figure 2 illustrates the partial effect plots of the smoothing functions used in the GAM model. Precipitation had a monotonic nonlinear effect and the month a smooth seasonal pattern, with its maximum during summer. Both smoothing functions add to the overall prediction of erosivity.
The reported estimated degrees of freedom (>>1) indicated nonlinear smoothing functions. The computed approximate significance p-values of both smooth terms was less than 10 −16 and the deviance explained by the model was 86.5% (in the training set). Given the above values, there was strong evidence that both terms explained erosivity adequately.
The comparison of the models was based on their error metrics on the test set using daily and annual time steps. These results are given in Tables 2 and 3, where the single GAM performed better than those of the fitted parametric equations (R 2 was closer to one, and RMSE and MAE had lower values). Figure 3 depicts the annual erosivity values as calculated in the test set versus the values estimated from the three models. In these two diagrams, the GAM's estimations present less scatter than those coming from the empirical equations. In addition, all methods did not systematically underestimate R values.

Conclusions
In summary, this paper presented an indicative sample of the use of the GAM in the problems associated with the estimation of erosivity using daily precipitation values. These coarser precipitation data are in abundance worldwide, in contrast to pluviograph values that are required for the calculation of erosivity. Through the use of GAM, it became possible to model the nonlinear relationships in an additive way between rainfall erosivity, the daily rainfall, and seasonal periodicity. Using 30% of the data as an independent test set, just one model gave erosivity estimates with a significantly smaller error on a daily and on a yearly basis, utilizing the sums of the daily values, in comparison to the estimates from the application of two well-known nonlinear empirical equations, which were applied for every station, using data from the Greek Water District of Thrace.
Author Contributions: Conceptualization, methodology and software, K.V.; writing-original draft preparation, E.S.; writing-review and editing, C.E. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.