A Multilayer Perceptron Model for Stochastic Synthesis

Time series analysis is a major mathematical tool in hydrology, with the moving average being the most popular model type for this purpose due to its simplicity. During the last 20 years, various studies have focused on an important statistical characteristic, namely the long-term persistence and the simultaneous statistical consistency at all timescales, when different timescales are involved in the simulation. Though these issues have been successfully addressed by various researchers, the solutions that have been suggested are mathematically advanced, which poses a challenge regarding their adoption by practitioners. In this study, a multilayer perceptron network is used to obtain synthetic daily values of rainfall. In order to develop this model, first, an appropriate set of features was selected, and then, a custom cost function was crafted to preserve the important statistical properties in the synthetic time series. This approach was applied to two locations of different climatic conditions that have a long record of daily measurements (more than 100 years for the first and more than 40 years for the second). The results indicate that the suggested methodology is capable of preserving all important statistical characteristics. The advantage of this model is that, once it has been trained, it is straightforward to apply and can be modified easily to analyze other types of hydrologic time series.


Introduction
Stochastic models first appeared in hydrology in the early 1950s in the 1954 work of Barnes [1], who generated a 1000-year sequence of mutually independent synthetic annual inflows to design a reservoir on the Upper Yarra river in Australia. Later, Thomas and Fiering [2] introduced the first stochastic model that was capable of reproducing some characteristics of the statistical properties of the natural process. Over the years, various stochastic techniques appeared, with the most popular being the autoregressive models (AR), the moving average models (MA), and their combination (ARMA), which are also known as Box-Jenkins models [3].
Since the introduction of the stochastic models, and after extensive research in this scientific area, various challenges have been highlighted. For example, the magnitude of the autocovariance of the generated time series decays exponentially unless specific techniques are employed [4]. Another challenge is when the time window of the studied hydrologic process extends over different timescales, e.g., generating daily rainfall time series with a length of hundreds of years. In this case, a single model cannot simultaneously focus on the stochastic properties at multiple scales [5].
To cope with these issues, various researchers have suggested approaches that preserve the Hurst coefficient [6] and the autocovariance structure of all time scales with a minimal number of parameters [4]. This statistical property is very important in water management applications because it is related to "the tendency of wet years to cluster into multi-year wet periods or of dry years to cluster into multi-year drought periods" [7]. Furthermore, the stochastic modeling of long time series extending over multiple scales, has been dealt with by utilizing coupling of stochastic models of different time scales [5]. A series of iterations is performed in order to "synchronize" the lower-level and the higher-level models (i.e., achieve a statistical consistency between these two timescales). This approach requires advanced mathematical frameworks and, hence, specialized tools [8,9].
Other models go one step further by incorporating spatially distributed data obtained by either remote sensing devices (e.g., weather radar and satellites) [10] or from general circulation models [11] or earth system models [12]. These two-dimensional weather generators follow a multi-stage approach to blend the higher-scale information into the lower scale. The stages include the spatial downscaling of the input data and then the temporal downscaling with a stochastic model (by adjusting the model parameters according to the predicted changes from large-scale climate model), and finally, the restoration of the statistical dependencies including the inter-annual variability regarding the long-term trends.
Hydrologic approaches based on rigorous mathematical frameworks are important because they help to obtain insight into the complicated properties of the hydrologic processes. On the other hand, various studies have suggested that black-box approaches, like machine learning, can be used in hydrological applications as more straightforward methods [13]. For example, Shuang and Zhao have used various machine learning approaches (MLP network, AdaBoost, Gradient Boosting Decision Tree) to obtain a prediction of the water demand [14]; Rozos has employed an MLP network to optimally manage a complex water supply system [15]; Shin et al. [16] used a Long Short-Term Memory (LSTM) network to evaluate the impact of the groundwater withdrawal on the groundwater level; Niaghi et al. [17] tested various machine learning approaches regarding their efficiency to estimate the reference evapotranspiration; and Minns and Hall [18] used a feedforward network in rainfall-runoff modeling.
The reason for the popularity of the machine learning methods is their conceptual simplicity and their broad scope of application, along with standardized methodology. Though there are plenty of machine learning applications in time-series analysis, most of them are employed at short-term forecasting or surrogate modeling. There are barely any applications in generating synthetic time series. For example, Campos et al. [19] used a neural network to generate synthetic time series of monthly reservoir inflows that is equivalent to an AR(1) model.
In this study, we are proposing the use of a multilayer perceptron (MLP) network [20] for stochastic synthesis of daily rainfall time series. This MLP-based approach is novel because, in contrast to the existing similar approaches, it reproduces the statistical properties of the corresponding historical time series at multiple scales (Hurst effect). We call this model MLPS. We chose MLP, a half-century-old approach, instead of a more recent type of deep learning network for two reasons. First, we found MLP to be sufficiently powerful despite being parsimonious (only a few dozen parameters). In contrast, the deep learning networks tend to employ a much larger set of parameters, from hundreds up to millions (e.g., AlexNet). Second, our motivation is to provide a stochastic model that is fairly simple to implement, even in a spreadsheet [21], a tool with which practitioners, and stakeholders in general, are familiar so that the proposed method has good chances of being adopted by this important community.

Input Features
The generation of features (i.e., the inputs to the network) is based on formulas (Equation (1) below, and a random number generator) with cyclostationary parameters (e.g., a different parameter value for each month of the year). Therefore, the dates of the synthetic time series should be defined before anything else. The MS Excel date format was used [22] to represent the dates (see Appendix A). For handling the dates, an appropriate function [23] was used to decode a serial number to the corresponding month (used for generating cyclostationary signals) and year (used for aggregating up to annual scale).
The input features were selected to mimic the practices of the previous relevant studies. For example, a symmetric moving average for the annual independent and identically distributed innovations (IIDIs) was used, following Koutsoyiannis' study, [4], whereas, inspired by the Chen et al.'s study [24,25], a first-order Markov chain in the generation of daily IIDI was employed.
The daily IIDIs were generated employing a method similar to that used by Richardson and Wright [26]. Initially, a first-order two-state Markov chain was used to generate the occurrence of wet or dry days with the following equation: where s i is the state of the day i (1 for rainy, 0 for non-rainy); P 11m (estimated from the historical daily time series) is the probability of the day i of month m of the year (m = 1, ..., 12) being wet if the day i − 1 is wet; P 01m is the probability of the day i of month m being wet if the day i − 1 is dry; θ i is a random number following uniform distribution. Then, random numbers v i following a two-parameter Gamma distribution were generated. Different Gamma parameter values were employed for each month of the year (cyclostationarity). To estimate the parameters of the Gamma distribution for the month m, a Nelder-Mead simplex algorithm [27] was employed to minimize the distance between the distributionsF m and Γ(α m , β m ), whereF m is the empirical distribution of the daily non-zero depths of all historical rainfall values of month m and Γ(α m , β m ) is the Gamma distribution with the optimized parameters α m and β m .
Finally, the daily IIDIs were obtained with the z-score normalization [28] v where µ v and σ v are the mean and the standard deviation of the time series s i v i . The annual IIDIs, V i , were generated employing also a two-parameter Gamma distribution. The two parameters (a single set, no cyclostationarity) were obtained, like previously, with a fitting procedure employing a Nelder-Mead simplex algorithm. These annual values were initially disaggregated to the daily scale (the time step of the stochastic model), keeping the same value for each day belonging to the same year; then, a moving average was applied with a window of 365 days to smooth out the time series. Note that this smoothing does not induce a central limit theorem effect, because at each summation of the moving average, the 365 values are repetitions of only 2 unique random values (corresponding to two consecutive years). After applying z-score normalization to the smoothed time series, the annual IIDIs were obtained.
After obtaining v i and V i , the features were assembled into a matrix of 14 columns, of which each row F i is given by the following equation.

Topology
The optimum topology of the network was found to be 2-2-1 (i.e., two hidden layers). This topology, for 14 inputs, introduces 39 parameters, i.e., 34 weights and 5 biases. The activation function for the last layer was ReLU, whereas LReLU was used for the remaining layers [29]. ReLU is ideal for this application, because it not only facilitates the faster training of the network but also ensures the non-negative values for precipitation. LReLU was used for the other two layers to avoid the 'dying ReLU' problem [29]. The topology of the network is displayed in Figure 1.

Cost Function
The performance of a network, i.e., the average difference between the actual y i and the simulated valuesŷ i , is measured with the cost function [30]. The most typical cost function is the Mean Squared Error (MSE), which for the vector x, containing the weights and biases of the network is given by Equation (4).
Besides MSE, there are plenty of alternative cost functions (e.g., MAE, Entropy Loss, Divergence Loss, etc., [31]). All of these have one thing in common: they penalize the deviation of the simulated values from the target values (the historical values in our case). It becomes evident that such a kind of cost function is not suitable for MLPS because the objective is to preserve the statistical structure, not to fit the output to the historical values.
After many experiments, the objective function was fixed on Equation (5), in which 16 metrics (referred to as scalar function D(, )) were combined to obtain a single value.
where x is the vector containing the weights and biases of the MLPS network; [W 1 ...W 16 ] are the weights of the 16 metrics, which, after numerous trials, were fixed to the values [2, 2, 2, 2, 2, 2, 2, 4, 10, 10, 10, 10, 10, 10, 10, 100]; D(, ) is given in Equation (6); n 1Si is the number of the sample values lying within the i-th interval (i.e., the histogram; see [32], Equation (5.1)) of the event duration of the synthetic time series (subscript 'S' is for synthetic) obtained by the MLPS network when run with the values of weights and biases of x; n 1S = ∑ k1 i=1 n 1Si ; k1 is the number of intervals into which the space of the event duration is discretized (the same discretization of space is used for both historical and synthetic time series); the following six metrics, with the indexes 2, 3, 4, refer to the dry spell duration, annual rainfall depth, and daily rainfall depth, respectively; ρ Sl is the l-lag auto-correlation of the aggregated to annual scale synthetic time series; σ 365S is the standard deviation of the annual values of the synthetic time series; σ 12Si is the standard deviation of the synthetic time series values of the ith month of the year; µ 12Si is the mean value of the synthetic time series values of the ith month of year; r S is the lag-1 auto-correlation of the daily synthetic time series;μ 4S is the kurtosis of the daily synthetic time series;μ 3S is the skewness of the daily synthetic time series; σ S and µ S are the standard deviation and the mean value of the daily synthetic time series, respectively. Please note that for the typesetting, the IAHS guidelines [33] are followed to improve the clarity of the employed symbols. That is, textual subscripts are in upright (Roman) font. For example, the textual subscripts S and H correspond to synthetic and historical time series, respectively. Therefore, the definitions of the corresponding variables for historical (subscript 'H') time series can be easily derived from the previous ones.
where a, b are vectors of size n and d(a, b) = |a − b|/ max(min(|b|, |a|), δ). The tolerance coefficient δ (a kind of measure of the significance of the decimal precision) is used to avoid awarding large penalty values to deviations of minor significance. For example, suppose the mean value of the wet month is 5 mm/day and the mean value of the dry is 0.05 mm/day; then, a 0.1 mm/day output from the stochastic model for the dry month is considered satisfactory. Adopting a δ value equal to 0.1 allows to give to this model output a penalty of only 0.5 instead of 1.0 without using δ. Equation (6) is actually the mean absolute error, which gives more weight to the agreement between the average rather than the peaks of the compared values [34]. The value of l max in Equation (5) is selected after a preliminary analysis of the historical time series to include all significant ρ Hl values.
Note that, (i) in the metrics 1, 3, 5, and 6, the relative number of samples of the historical time series that are 0 are not taken into account in D(, ). These 0 values, appearing as gaps in the middle of the histogram of the historical values, are considered artifacts. (ii) The second, forth, and seventh metrics introduce an extra penalty to the failure of preserving the frequency of the most extreme value.

Training
The MLPS network training was accomplished with Genetic Algorithms (GA) [35] instead of the standard backpropagation approach [36]. GA is much slower than any optimization method based on backpropagation. However, the latter is based on the chain rule (e.g., see [30]), which cannot be applied, since there is no closed-form expression of the derivative of Equation (5) with respect to the MLPS network output.
The parameters of the GA were as follows: population size 200, maximum number of generations 800, crossover fraction 0.8, Scattered crossover faction, Gaussian mutation function, elite count 2, scale and shrink both equal to 1. At each generation, 200(1 − 0.8) − 2 individuals were mutated. The initial population followed a uniform distribution with values from −10 to 10 (note that GA is not that sensitive to the values of the initial population like gradient-based methods, which suffer from the problem of exploding/vanishing gradients [37]).
To ensure that the network generalizes well, the training was performed with synthetic time series of significant length. For example, this length (actually the maximum value of index i in Equation (3)) was 360,000 days in both case studies presented below.
During the training, bootstrapping was employed [38]. According to this technique, a different subset of the features was used at each cost function evaluation. This subset should be continuous corresponding to consecutive days. The size of the subset was 20% of the length of the features. This technique not only reduces the computational time (since the metrics of Equation (5) are applied on shorter time series) but also helps to improve the model generalization [39].
MLPS was implemented in MATLAB language. GNU Octave (MATLAB open source equivalent) was used in this study to run MLPS. A GA implementation that supports parallel computing [40] was used to accelerate the model training. MLPS was run on a virtual server with eight cores [41].

WeaGETS
WeaGETS [24] was used as a reference model to evaluate MLPS performance. WeaGETS is a single-site stochastic weather generator (Tmin, Tmax, and rainfall). At the daily time step, first-order Markov chain is employed to switch between dry and wet days. The rainfall depth of the days that are deemed wet is obtained with a random number generator (alternative distributions are available). Then, to account for the low-frequency variability, the daily rainfall depth is corrected using power spectra of the observed time series at monthly and yearly scales. A method similar to coupling of stochastic models suggested by Koutsoyiannis [5] was also employed. The parameters used in WeaGETS were the following: Daily precipitation threshold 700 Number of years to generate No Smooth the parameters of precipitation occurrence and quantity 1 Order of Markov Chain to generate precipitation occurrence Skewed normal Distribution to generate wet day precipitation amount Unconditional Scheme to generate maximum and minimum temperatures No Correct the low-frequency variability of precipitation It should be noted that MLPS is not compared against WeaGETS in terms of performance superiority. MLPS's advantage is its conceptual simplicity and straightforward applicability and not any improvement in performance against established models. In fact, WeaGETS was intentionally handicapped by turning off the correction of the low-frequency variability of precipitation. The motive for this was to demonstrate the significance of the long-term persistence effect in time-series analysis.

Application to Hohenpeissenberg
The historical rainfall time series measured at the Hohenpeissenberg Observatory was obtained from Deutscher Wetterdienst. The climate of this location is oceanic (Köppen: Cfb), affected by altitude and proximity to the Alps. The time series starts on the 1 January 1879 and ends on 31 October 2020.
The comparison of the overall statistics of the synthetic time series and historical time series obtained from the weather station of Hohenpeissenberg is given in Table 1. Both models preserved all statistical characteristics well, with the exception of the autocorrelation with 1-day lag of the time series of WeaGETS. Figures 2 and 3 display the monthly mean and standard deviation of the synthetic and historical values of the Hohenpeissenberg weather station. According to these figures, both models satisfactorily preserved the monthly mean value and the standard deviation.    To properly interpret this plot, it should be kept in mind that the longer the bar above the horizontal axis, the higher the frequency, whereas the longer the bar below the horizontal axis, the lower the frequency. For example, the long bars at depths greater than 100 mm correspond to frequency values very close to 0 (no bar is plotted for the values that are equal to 0). According to this figure, both models performed very well. Figure 5 displays the histogram of the annual (aggregated from daily time step with plain summation) synthetic and historical time series.

Application to Gibraltar
The historical rainfall time series measured at the Gibraltar meteorologic station was obtained from freemteo.org. The climate of this location is Mediterranean (Köppen: Csa). The time series starts on the 1 January 1974 and ends on the 29 December 2015.
The comparison of the overall statistics of the synthetic time series and the historical time series obtained from the weather station of Gibraltar is given in Table 2. As in the previous application, WeaGETS underestimates the 1-day lag auto-correlation.  Figures 9 and 10 display the monthly mean and standard deviation of the synthetic and historical values of the Gibraltar weather station. According to these figures, both models preserved satisfactorily the monthly mean value and the standard deviation. Figure 11 displays the histogram of the daily time series of the synthetic and historical values. To properly interpret this plot, it should be kept in mind that the longer the bar above the horizontal axis, the higher the frequency, whereas the longer the bar below the horizontal axis, the lower the frequency. Both models performed relatively well. Figure 12 displays the histogram of the annual (aggregated from daily time step with plain summation) synthetic and historical time series. Figure 13 displays the histogram of the rainfall events' duration. According to this figure, both models performed relatively well. Figure 14 displays the histogram of the duration of the dry spells. According to this figure, WeaGETS performs very well for the low and medium values of duration. However, it underestimates the frequency of the higher duration values. MLPS appears to slightly overestimate the frequency of the higher duration values.

Discussion
The results of the two case studies suggest that the MLPS performance regarding the statistical properties related to short-term memory was equivalent to the performance of an established stochastic model. WeaGETS did not correctly estimate the frequency of the more extreme events in Figures 5,7,12,and 14. This is because of the configuration used in WeaGETS (see Section 2.2), which was selected intentionally to demonstrate the long-term persistence effect and its importance in time series analysis. This is also the reason for the underestimated annual standard deviation of WeaGETS time series (see first row of Tables 1 and 2). According to climacograms [42] in Figures 8 and 15, this underestimation is more profound at higher scales.
The climacograms in both Figures 8 and 15 demonstrate a double-bend shape, whereas the marks of the historical values oscillate at the higher scales. The reason for the first of the two bends, at the scale of 365 days, is attributed to the annual seasonality [43]. The second bend is typically observed in time series related with atmospheric phenomena and is attributed to turbulence effects in the atmosphere [43]. Finally, the oscillations of the climacogram of the historical time series are because of the limited length of the time series at the higher scales of aggregation. According to Dimitriadis and Koutsoyiannis [43], these oscillations appear at the scale at which the length of the aggregated time series is 10% of the length of the observed time series.
It should be noted that a significant CPU time is required for the training of the MLP network. It took about half an hour for this procedure on an eight-core machine. However, once the weight and biases of the network are available, the time required to apply the network is negligible. Moreover, the implementation of the network is so straightforward (feed-forward network) that it can be accomplished even in a spreadsheet (an example can be found in [21]).
The dates of the synthetic time series of MLPS follow the Gregorian calendar, including leap years. This is important in stochastic forecast [3] applications, since in this case the dates of the forecast synthetic time series (which start after the end of the historical time series) increase following the calendar days. The MPLS scheme described previously does not support stochastic forecast. However, Koutsoyiannis [4] describes a generic methodology that can be employed with any stochastic model. Specifically, the steps are as follows: (i) generate the covariance matrix of the known past values employing the parametric formula of generalized autocovariance structure; (ii) generate the synthetic time series with MLPS and obtain, for the common period, the time series of difference between historical and synthetic; (iii) for each "future" value of the synthetic time series, generate the vector of the covariances between this "future" value and all past values (the number of historical values that should be used depends on the properties of the studied system [44]); (iv) apply a linear formula (Equation (40) in [4]) to properly recondition the MLPS outputs.
Though the MLPS configuration employed in this study is adequate for daily rainfall, the model could be applied with some adjustment to other type of applications. For example, for other types of hydrologic variables, like temperature, another approach to obtain features should be employed (e.g., random numbers following normal distribution, conditioned on wet or dry status [45]). For other timescales, e.g., hourly, both the features (e.g., a different type of random number generator) and the cost function (including statistical metrics for the fine timescale) should change. Finally, for multivariate problems, one would need to include additional appropriate metrics to Equation (5) (e.g., crosscorrelation at daily scale, annual scale, etc.), some extra features (e.g., a different set of IIDI for each variable), and an additional output node for each additional hydrologic variable.
Machine learning approaches are often criticized because of the opacity of their inner workings and the lack of interpretability. Recently, some efforts have been made to blend existing scientific knowledge with learning algorithms. For example, various researchers have employed genetic programming [46,47] as an induction framework that finds optimum configurations of a model based on predefined building blocks. This approach, ideal in cases without sufficient insights regarding the system characteristics, provides some physical meaningfulness to the employed machine learning models. Other researchers have modified standard deep learning networks (for example, LSTM) to allow for a set of static system attributes to be used as inputs. Identification of attribute similarities by the model corresponds well to what would be expected from prior hydrological understanding [48]. In this study, we have tried a similar approach to incorporate scientific knowledge. More specifically, instead of directly statically using the system attributes as model inputs, we have carefully selected and prepared the model inputs according to these attributes/statistical properties (see Section 2.1.1). The employed custom cost function (see Section 2.1.3) also incorporates scientific knowledge specific to the stochastic synthesis problem by defining in an abstract manner the desired statistical properties of the model output.

Conclusions
In this study, a multilayer perceptron network, called MLPS, was employed for producing synthetic daily time series of rainfall. The objective was to develop a tool that • Preserves the stochastic properties in multiple scales (e.g., daily, annual); • Preserves the autocovariance structure including the long-term persistence in multiple lags; • Is straightforward to apply; and • Can handle a variety of stochastic problems despite being based on a simple concept.
This approach was applied to two locations with different climatic conditions and was tested against an established stochastic model. The main disadvantage of the proposed approach is the significant time required for the training of the model. However, the results of these applications suggest that the previous objectives were accomplished.
The use of the MLPS required an appropriate formulation of the input features and the cost function. In the two case studies, MLPS generated synthetic time series of rainfall at a single location. However, the MLPS configuration (the features and the cost function) could be easily modified to be applied to other types of hydrologic variables, or to support multivariate modeling. Similarly, after minor modifications, MLPS could be employed in a stochastic forecast.
The questions that time series analysis is called to answer differ among various applications. For example, in a dry region, the correct estimation of the frequency of dry spells is very important for a water management study. On the other hand, in a flood protection study, the frequency of the extreme values is of utmost importance. The main approach of the suggested model is based on the principle 'select what you want to preserve' (via the metrics and weights of Equation (5)). This gives the flexibility of adjustment without requiring any fundamental modification of the mathematical framework. Thus, the suggested model can be easily tailored for each application to emphasize the most critical statistical property.
Finally, this study demonstrated that scientific knowledge can be infused into machine learning models by properly preprocessing and selecting the model inputs and by inducing technically in the cost function an elaborated and generic description of the desired structure and properties of the model output.