SARIMA Approach to Generating Synthetic Monthly Rainfall in the Sin ú River Watershed in Colombia

: Seasonal Auto Regressive Integrative Moving Average models (SARIMA) were developed for monthly rainfall time series. Normality of the rainfall time series was achieved by using the Box Cox transformation. The best SARIMA models were selected based on their autocorrelation function (ACF), partial autocorrelation function (PACF), and the minimum values of the Akaike Information Criterion (AIC). The result of the Ljung–Box statistical test shows the randomness and homogeneity of each model residuals. The performance and validation of the SARIMA models were evaluated based on various statistical measures, among these, the Student’s t-test. It is possible to obtain synthetic records that preserve the statistical characteristics of the historical record through the SARIMA models. Finally, the results obtained can be applied to various hydrological and water resources management studies. This will certainly assist policy and decision-makers to establish strategies, priorities, and the proper use of water resources in the Sin ú river watershed.


Introduction
Precipitation is one of the variables commonly used to study climate variability. It is the result of the interaction of various physical phenomena and is characterized by its spatial and temporal variation. Thus, the analysis of rainfall data is essential for the prediction of meteorological information and also for the planning and management of water resource systems [1][2][3][4].
Missing data are a very frequent problem in climatology, and they influence the quality of the results, impacting hydrological studies as well as water resource management. These studies require complete and reliable records of rainfall data. Some methods to solve the problem of the reliability of obtained results and findings can be found in the literature [5][6][7][8]. Among these, stochastic models stand out. They are used to generate synthetic time series, which exhibit similar statistical characteristics to the observed data and behave according to probabilistic laws. Consequently, the observed series is only one of the possible achievements of the stochastic process and, thus, the forecast is not an exact but a possible scenario [9]. For instance, Autoregressive Moving Average (ARMA) models are stochastic models based on probability theory, so they can represent the temporal uncertainty of data.  Furthermore, according to information extracted from the climatological stations by the Colombian Institute of Hydrology, Meteorology, and Environmental Studies (IDEAM), the Sinú river watershed is known for having a unimodal rainfall regime, with a dry season and rainy season per year. The rainy season begins in April and ends in November, during which more than 80% of the annual rainfall occurs. Extreme rainfall events appear in the Upper Sinú basin, reaching magnitudes of greater than 4000 mm annually and decreasing from south to north. Thus, in the lower part of the river (delta) the precipitation is about 1300 mm. Similarly, it was evident that the average air temperature is above 27 • C and varies from south to north as the precipitation does. Consequently, in the upper Sinú there are temperatures of up to 17 • C, while in the rest of the basin the temperature exceeds 32 • C.
The data used for the modeling were obtained from 75 climatological stations ( Figure 2) from the IDEAM database, and it was verifyied that they had influence in the study area using Thiessen Polygons [45]. The missing rainfall data were estimated using the ClimGen model [46,47]. To gain information to validate the models obtained, the original data were divided so that a calibration vector was obtained that corresponded to the first 90% of the data and a validation vector was obtained that was equivalent to the remaining 10%. This criterion was chosen considering what was reported by Dabral and Murry [48] and Nury et al. [49]. This allows the user to make a forecast with the selected model and compare the observed and predicted rainfall during the period corresponding to the validation vector.
To guarantee the normality of the calibration vectors, a transformation was performed to make the transformed data approximately Gaussian and stabilize the variance of the time series. Although there are many families of transformations which may be used for such a purpose, the Box-Cox power transformation was found to be useful in various fields including hydrology and meteorology, as reported by Chatfield [50] and Dabral and Murry [48]. Given an observed time series, x t , the transformed series is given by: where y t is the Box-Cox transformed data and λ denotes the power parameter chosen to make the transformed data approximately Gaussian. Before determining the optimal forecast model, a filtering procedure was performed. The transformed calibration vectors were decomposed in three components: trend, seasonal, and remainder. According to Cleveland et al. [51], the trend component is the low frequency variation in the data together with nonstationary, long-term changes in the level; the seasonal component is the variation in the data at or near the seasonal frequency, normally one cycle per year; and the remainder component is the remaining variation in the data beyond that in the seasonal and trend components [1,52]. The equation of the additive decomposition is: where the time series, trend component, seasonal component, and the remainder component are denoted by X t , T t , S t , and R t , respectively.

of 16
Atmosphere 2020, 11, x FOR PEER REVIEW 5 of 19

SARIMA Models
The SARIMA model is based on the application of the ARMA models to a transformed time series where the seasonal and non-stationary behavior has been eliminated.
For a seasonal series with s periods per year, the SARIMA(p,d,q) × (P,D,Q)s models are used. Thus, having a B s operator such that B s Xt = Xt-s and since the seasonal difference can be written as (Xt -Xt-s) = (1 -B s )Xt, a SARIMA model with (p,d,q) non-seasonal order terms and (P,D,Q) seasonal order terms-SARIMA(p,d,q) × (P,D,Q)s-has the following structure [50]:

SARIMA Models
The SARIMA model is based on the application of the ARMA models to a transformed time series where the seasonal and non-stationary behavior has been eliminated.
For a seasonal series with s periods per year, the SARIMA(p,d,q) × (P,D,Q) s models are used. Thus, having a B s operator such that B s X t = X t−s and since the seasonal difference can be written as Atmosphere 2020, 11, 602 6 of 16 (X t − X t−s ) = (1 − B s )X t , a SARIMA model with (p,d,q) non-seasonal order terms and (P,D,Q) seasonal order terms-SARIMA(p,d,q) × (P,D,Q) s -has the following structure [50]: where Φ(B s ) and Θ(B s ) denote polynomials in B s with P and Q order, respectively, while φ(B) and θ(B) are polynomials in B with p and q order, respectively. Four stages were followed to determine the optimal forecast model in each station: identification, estimation, verification, and application or forecast, as suggested by Box et al. [53], Salas and Obeysekera [54], and Burlando et al. [55]. In the first instance, it was required to verify the data stationarity, and the general form or model order was estimated. Then, the model parameters were estimated using the maximum likelihood method, and the model suitability was verified using the Ljung-Box statistical test to prove that the residuals behaved as white noise [56]. The optimal model was chosen from among those that satisfactorily adjusted the data. There are several tools, such as the Bayesian Information Criterion (BIC), Hannan-Quinn criterion (H-Q), and the coefficient of determination (R 2 ), to determine the best model [1,49]. However, in most of the carried-out research, the Autocorrelation Function (ACF) and the Partial Autocorrelation Function (PACF) have been used to select the optimal model. Considering this, the ACF and PACF were used to determine the best type of model. The Akaike Information Criterion (AIC) of each model was estimated and the one with the lowest AIC was chosen [53,57].
Finally, once the optimal model was determined, the forecast was made and the results were compared with the validation vector to evaluate the performance. For all this, an algorithm was designed in the programming language for statistical analysis "R", which also allowed us to identify the most suitable forecast model from the AIC. Thus, different model options, such as AR, MA, ARMA, ARIMA, and SARIMA, were tested, and it was determined that the models that showed the best performance were the SARIMA.

Results and Discussion
To illustrate the methodology used to determine the optimal forecast models, the results from the Momil station will be used as an example. The corresponding calibration vector can be observed in Figure 3. First, to guarantee the normality of the series, a data transformation was performed using Box-Cox ( Figure 4); then, the Augmented Dickey-Fuller test (ADF), a unit root test that allows accepting or rejecting the null hypothesis of stationarity in a time series for validation, was used [48,58]. where Φ(B s ) and Θ(B s ) denote polynomials in B s with P and Q order, respectively, while ∅(B)and θ (B) are polynomials in B with p and q order, respectively. Four stages were followed to determine the optimal forecast model in each station: identification, estimation, verification, and application or forecast, as suggested by Box et al. [53], Salas and Obeysekera [54], and Burlando et al. [55]. In the first instance, it was required to verify the data stationarity, and the general form or model order was estimated. Then, the model parameters were estimated using the maximum likelihood method, and the model suitability was verified using the Ljung-Box statistical test to prove that the residuals behaved as white noise [56]. The optimal model was chosen from among those that satisfactorily adjusted the data. There are several tools, such as the Bayesian Information Criterion (BIC), Hannan-Quinn criterion (H-Q), and the coefficient of determination (R 2 ), to determine the best model [1,49]. However, in most of the carried-out research, the Autocorrelation Function (ACF) and the Partial Autocorrelation Function (PACF) have been used to select the optimal model. Considering this, the ACF and PACF were used to determine the best type of model. The Akaike Information Criterion (AIC) of each model was estimated and the one with the lowest AIC was chosen [53,57].
Finally, once the optimal model was determined, the forecast was made and the results were compared with the validation vector to evaluate the performance. For all this, an algorithm was designed in the programming language for statistical analysis "R", which also allowed us to identify the most suitable forecast model from the AIC. Thus, different model options, such as AR, MA, ARMA, ARIMA, and SARIMA, were tested, and it was determined that the models that showed the best performance were the SARIMA.

Results and Discussion
To illustrate the methodology used to determine the optimal forecast models, the results from the Momil station will be used as an example. The corresponding calibration vector can be observed in Figure 3. First, to guarantee the normality of the series, a data transformation was performed using Box-Cox ( Figure 4); then, the Augmented Dickey-Fuller test (ADF), a unit root test that allows accepting or rejecting the null hypothesis of stationarity in a time series for validation, was used [48,58]. In the Momil station transformed series graph (Figure 4), peaks are observed approximately once a year, which may be indicators that the series exhibits seasonal behavior. Nonetheless, although components) and analyzing their structure. In this way, Figure 5 shows the additive decomposition of the series in question, where it is possible to identify that the series has a marked seasonality that mainly obeys the unimodal precipitation regime typical from the area; it is evident that a trend is not maintained throughout the series period.
.  Bearing this in mind, it is inferred that s = 12, i.e., the number of periods per year is 12 (one period per month). To check the latter, the series correlograms were calculated from the ACF and the PACF, as shown in Figures 6 and 7, where indeed it was possible to see that the highest correlation occurs when the lag k is 12, indicating that values of a specific month have a greater relationship with those shown in the same month from the immediately previous year. In the Momil station transformed series graph (Figure 4), peaks are observed approximately once a year, which may be indicators that the series exhibits seasonal behavior. Nonetheless, although it is not graphically possible to confirm the seasonality presence, it is possible by decomposing the time series since it allows determining the series components (trend, seasonality, random components) and analyzing their structure. In this way, Figure 5 shows the additive decomposition of the series in question, where it is possible to identify that the series has a marked seasonality that mainly obeys the unimodal precipitation regime typical from the area; it is evident that a trend is not maintained throughout the series period.
Atmosphere 2020, 11, x FOR PEER REVIEW 7 of 19 components) and analyzing their structure. In this way, Figure 5 shows the additive decomposition of the series in question, where it is possible to identify that the series has a marked seasonality that mainly obeys the unimodal precipitation regime typical from the area; it is evident that a trend is not maintained throughout the series period.
.  Bearing this in mind, it is inferred that s = 12, i.e., the number of periods per year is 12 (one period per month). To check the latter, the series correlograms were calculated from the ACF and the PACF, as shown in Figures 6 and 7, where indeed it was possible to see that the highest correlation occurs when the lag k is 12, indicating that values of a specific month have a greater relationship with those shown in the same month from the immediately previous year. Bearing this in mind, it is inferred that s = 12, i.e., the number of periods per year is 12 (one period per month). To check the latter, the series correlograms were calculated from the ACF and the PACF, as shown in Figures 6 and 7, where indeed it was possible to see that the highest correlation occurs when the lag k is 12, indicating that values of a specific month have a greater relationship with those shown in the same month from the immediately previous year.
Moreover, the correlograms (Figures 6 and 7) show a sinusoidal shape, which additionally suggests that the SARIMA model is the most appropriate one to represent the precipitation series of the Momil station. Thus, the different models were adjusted until the optimal one was selected using the AIC. The best fit model was SARIMA (2,0,0) × (2,1,0) 12 , which was selected based on the minimum values of AIC = 2433.6 ( Table 1). The simplified general equation is: where the parameters φ p and Φ p are observed on the left, multiplying the values of X t times the different lags, i.e., values that the variable took before the instant of interest time. In this case, since there are two parameters φ p , it implies that the two previous values of X t, will be multiplied, i.e., it goes back two months in the non-seasonal component, or the same as the model will use the precipitation values of the two immediately preceding months to calculate rainfall at a given point in time. On the other hand, the two parameters Φ p of the seasonal component imply that the model considers the values of the previous 2 years for the month analyzed. Thus, the equation of the SARIMA (2,0,0) × (2,1,0) 12 is the resulting one from solving the product of the polynomials shown in the simplified general equation until arriving at the following expression:

Model AIC
SARIMA(2,0,0) × (2,1,0) 12 2433.599 SARIMA(1,0,2) × (2,1,0) 12 2435.017 SARIMA(3,0,0) × (2,1,0) 12 2435.188 SARIMA(2,0,1) × (2,1,0) 12 2435.246 Figure 8 shows the synthetic series generated (predicted data). It shows that the model can represent both the magnitude of the recorded rainfall and its variation. SARIMA (2,0,0) × (2,1,0)12 is the resulting one from solving the product of the polynomials shown in the simplified general equation until arriving at the following expression:   Figure 8 shows the synthetic series generated (predicted data). It shows that the model can represent both the magnitude of the recorded rainfall and its variation. Then, the residuals were analyzed using the Ljung-Box test and correlograms ( Figure 9). The correlograms show significant peaks in lags 24 and 25, indicating that those order models could more represent the series structure. Nonetheless, using the AIC based on the parsimony principle that the greater the number of parameters of a model, the greater its complexity, and a solution must be sought that allows maintaining a balance between the adjustment and complexity, it is concluded that the best model is the one selected, since it has minimum values of Akaike information criterion (AIC) and therefore uses the fewest parameters to generate a synthetic series. The result of the Ljung-Box statistic test shows the randomness and homogeneity of the residuals. Thus, the SARIMA model is appropriate for forecasting. At this point, it is important to highlight that, although the residuals' analysis is considered important for the analysis of the model's performance, there could be a case where the Ljung-Box test indicates that they are not random for a given significance level. Even so, Then, the residuals were analyzed using the Ljung-Box test and correlograms ( Figure 9). The correlograms show significant peaks in lags 24 and 25, indicating that those order models could more represent the series structure. Nonetheless, using the AIC based on the parsimony principle that the greater the number of parameters of a model, the greater its complexity, and a solution must be sought that allows maintaining a balance between the adjustment and complexity, it is concluded that the best model is the one selected, since it has minimum values of Akaike information criterion (AIC) and therefore uses the fewest parameters to generate a synthetic series. The result of the Ljung-Box statistic test shows the randomness and homogeneity of the residuals. Thus, the SARIMA model is appropriate for forecasting. At this point, it is important to highlight that, although the residuals' analysis is considered important for the analysis of the model's performance, there could be a case where the Ljung-Box test indicates that they are not random for a given significance level. Even so, if the model can adequately represent the series' behavior and preserve the mean of the original data, it is concluded that the selected model is the most appropriate among all those tested and can be used to make forecasts ( Figure 10).
Subsequently, a test of hypothesis concerning the difference between two means with student's t-distribution was used to compare the mean of the calibration vector with that of the synthetic series, and in the same way compare the validation vector mean with that of the forecast obtained [59]. The null hypothesis is that the difference in the means of the two datasets is equal to 0, therefore the alternative hypothesis is that difference in means is not equal to 0.
In the Momil station case, Table 2 shows that at a 5% significance level there is no statistically significant difference between the means of the calibration vector and the synthetic series. This is because the p value (0.15) is greater than 0.05 and because the 95% confidence interval (−3.15, 20.4) includes the 0 value. The above implies that by using the SARIMA model (2,0,0) × (2,1,0) 12 , a synthetic series statistically equal to the original can be obtained, and therefore it is the optimal model for the Momil station. if the model can adequately represent the series' behavior and preserve the mean of the original data, it is concluded that the selected model is the most appropriate among all those tested and can be used to make forecasts ( Figure 10).  Subsequently, a test of hypothesis concerning the difference between two means with student's t-distribution was used to compare the mean of the calibration vector with that of the synthetic series, and in the same way compare the validation vector mean with that of the forecast obtained [59]. The null hypothesis is that the difference in the means of the two datasets is equal to 0, therefore the alternative hypothesis is that difference in means is not equal to 0.
In the Momil station case, Table 2 shows that at a 5% significance level there is no statistically significant difference between the means of the calibration vector and the synthetic series. This is because the p value (0.15) is greater than 0.05 and because the 95% confidence interval (−3.15, 20.4) includes the 0 value. The above implies that by using the SARIMA model (2,0,0) × (2,1,0)12, a synthetic series statistically equal to the original can be obtained, and therefore it is the optimal model for the Momil station.
Finally, when comparing the means of the validation vector and the forecast, it was concluded that at a 2% significance level there is no statistically significant difference, since the p-value (0.023) is greater than 0.02 and the 98% confidence interval (−0.82, 79.65) includes the 0 value ( Table 2). The if the model can adequately represent the series' behavior and preserve the mean of the original data, it is concluded that the selected model is the most appropriate among all those tested and can be used to make forecasts ( Figure 10).  Subsequently, a test of hypothesis concerning the difference between two means with student's t-distribution was used to compare the mean of the calibration vector with that of the synthetic series, and in the same way compare the validation vector mean with that of the forecast obtained [59]. The null hypothesis is that the difference in the means of the two datasets is equal to 0, therefore the alternative hypothesis is that difference in means is not equal to 0.
In the Momil station case, Table 2 shows that at a 5% significance level there is no statistically significant difference between the means of the calibration vector and the synthetic series. This is because the p value (0.15) is greater than 0.05 and because the 95% confidence interval (−3.15, 20.4) includes the 0 value. The above implies that by using the SARIMA model (2,0,0) × (2,1,0)12, a synthetic series statistically equal to the original can be obtained, and therefore it is the optimal model for the Momil station.
Finally, when comparing the means of the validation vector and the forecast, it was concluded that at a 2% significance level there is no statistically significant difference, since the p-value (0.023) is greater than 0.02 and the 98% confidence interval (−0.82, 79.65) includes the 0 value ( Table 2). The Finally, when comparing the means of the validation vector and the forecast, it was concluded that at a 2% significance level there is no statistically significant difference, since the p-value (0.023) is greater than 0.02 and the 98% confidence interval (−0.82, 79.65) includes the 0 value ( Table 2). The test was performed for a confidence level of 98% for the comparison between the validation vector and the forecast since, in this way, there was a greater probability that the hypothesis was satisfied. Since the validation series and the forecast are determined to be statistically the same, it is considered that the SARIMA (2,0,0) × (2,1,0) 12 model adequately represents the data structure from the Momil station and is appropriate for forecasting precipitation. Table 3 shows the optimal model for all the climatological stations in the study area. Table 3. SARIMA models for modeling and forecasting the monthly rainfall in the Sinú river watershed in Colombia.  Table 3. Cont.

Conclusions
This study focused on modeling and forecasting a monthly rainfall series using SARIMA modeling. The results obtained can be applied for various hydrological and water resource management studies. This will certainly assist policy and decision-makers to establish strategies, priorities, and the proper use of water resources in the Sinú river watershed-e.g., the proposed models can be used to determine the occurrence of the El Niño-Southern Oscillation phenomena, to identify the potential for water harvesting use, and to operate irrigation systems.
The monthly precipitation series recorded in the stations of the study area show a seasonal behavior, reflecting the typical unimodal rain regime from the basin. Similarly, it was determined that a trend was not maintained throughout the recording period.
Through the Box-Cox transformation, it was possible to guarantee the stationarity assumption for the precipitation time series recorded in the Sinú river basin. This allowed us to establish that, given the seasonal behavior of the series analyzed, the SARIMA models allow the satisfactory representation of the temporal structure of precipitation in the study area. The SARIMA approach showed an improvement over the integration of the removal of the trend, periodicity, and stochastic components approach in time series modelling. The different performance statistics in the model development and validation phase of this study also confirmed higher prediction accuracy using the SARIMA models.
By means of the stochastic SARIMA models, it was possible to generate a synthetic time series of precipitation without missing data, which allows us to preserve the behavior and structure of the original data. It was shown that the synthetic series obtained are statistically the same as the observed precipitation, and likewise that the forecasts made using the SARIMA models selected for each season adequately represent the behavior of the rainfall in the area. Based on the results, stochastic models are one of the most appropriate techniques for the prediction of rainfall.
It is important to highlight that although the forecasts obtained with the models do not allow predicting the exact precipitation amount, they can reveal the probable trend of future rains and provide information that can help decision-makers to establish strategies in areas such as agriculture, where it is of utmost importance to know the beginning and end of the rainy seasons the planning of civil works; and the preparation of mitigation plans for natural dangers, such as floods and droughts.
Finally, it is worth noting that rational planning and comprehensive management of water resources require forecasting events that may occur in the future, bearing in mind at the same time that it is usually based on past events. For this reason, time series analysis is a valuable tool since it allows making inferences about the future.