Short-Term and Medium-Term Drought Forecasting Using Generalized Additive Models

: Forecasting extreme hydrological events is critical for drought risk and e ﬃ cient water resource management in semi-arid environments that are prone to natural hazards. This study aimed at forecasting drought conditions in a semi-arid region in north-eastern South Africa. The Standardized Precipitation Evaporation Index (SPEI) was used as a drought-quantifying parameter. Data for SPEI formulation for eight weather stations were obtained from South Africa Weather Services. Forecasting of the SPEI was achieved by using Generalized Additive Models (GAMs) at 1, 6, and 12 month timescales. Time series decomposition was done to reduce time series complexities, and variable selection was done using Lasso. Mild drought conditions were found to be more prevalent in the study area compared to other drought categories. Four models were developed to forecast drought in the Luvuvhu River Catchment (i.e., GAM, Ensemble Empirical Mode Decomposition (EEMD)-GAM, EEMD-Autoregressive Integrated Moving Average (ARIMA)-GAM, and Forecast Quantile Regression Averaging (fQRA)). At the ﬁrst two timescales, fQRA forecasted the test data better than the other models, while GAMs were best at the 12 month timescale. Root Mean Square Error values of 0.0599, 0.2609, and 0.1809 were shown by fQRA and GAM at the 1, 6, and 12 month timescales, respectively. The study ﬁndings demonstrated the strength of GAMs in short- and medium-term drought forecasting.


Introduction
Rainfall variability is highly significant on several temporal and spatial scales in southern Africa [1][2][3][4], as most rural livelihoods in the region depend on agriculture, which is largely rainfed. Increasing trends of rainfall have been reported for a few locations over South Africa [5][6][7][8]. However, the authors of [9] cautioned that whilst this may suggest an increase in water resource availability, an increasing population and land use changes, coupled with intensification of agricultural activities, exert pressure on them. Although rainfall trends have been predicted to decrease in the Luvuvhu River Catchment (LRC) in the northeast of South Africa, some stations exhibited increasing trends, which were potentially attributed to the 10 year decadal mean daily fluctuations [10]. The chronic nature of drought disasters in the region further affects social, economic, and environmental aspects negatively [11]. The authors of [9] found an increasing trend of annual maximum temperatures in the Limpopo River Basin, which is consistent with several other land areas. Increased temperatures National Park receive the lowest rainfall (400 mm/a) during the wet season. The catchment's mean annual rainfall is 608 mm, and produces a mean annual run-off of 3 6 10 520 m × [10]. The distribution of rainfall through the year exhibits highly seasonal characteristics, with 95% of the rainfall occurring during the summer months (October and March) [31]. The lower-rainfall area in the catchment tends to experience greater variability than the higher-rainfall areas. Temperature generally increases from the mountains in the west to the lower reaches in the east of the catchment, with local towns, such as Thohoyandou, experiencing daily average temperatures of 33 °C in summer and 24 °C in winter [32]. The study area is predominantly rural, with a community that is highly dependent on rain-fed commercial and subsistence agriculture. The regional climate within which the LRC is located ranges from tropical rain in the coastal plains of Mozambique to tropical dry savannah and tropical dry desert further inland, south of Zimbabwe [33]. The mean spatial pattern of summer rainfall over southern Africa depicts a strong gradient that increases from west to east [34], with local maxima due to orographic effects. The LRC is a sub-basin of the Limpopo River Basin, in which annual precipitation varies between 250 to 1050 mm in the hot, dry western and central areas in the high-rainfall eastern escarpment areas, respectively [33]. The region experiences a high variability between extreme wet and dry seasons, which makes it vulnerable to frequent droughts and floods [35].
The nature and pattern of inter-annual variability of precipitation is crucial, as it exerts long-term control on water resources and affects plant growth and the bio-geochemical cycle while moderating extreme events such as droughts and floods [36]. The Limpopo valley (20-25 °S) is characterized by the highest variability in southern Africa [34], in agreement with [37] and [38]. Figures 2 and 3 show the inter-annual variability of rainfall and streamflow in the LRC for over 57 and 53 years, respectively. The inter-annual variability plots depict a strong seasonal variation in the study area. The rainy season of the region is characterized by alternating wet and dry spells [39], with wet spells recently becoming shorter. Both rainfall and streamflow exhibit positive trends over The regional climate within which the LRC is located ranges from tropical rain in the coastal plains of Mozambique to tropical dry savannah and tropical dry desert further inland, south of Zimbabwe [33]. The mean spatial pattern of summer rainfall over southern Africa depicts a strong gradient that increases from west to east [34], with local maxima due to orographic effects. The LRC is a sub-basin of the Limpopo River Basin, in which annual precipitation varies between 250 to 1050 mm in the hot, dry western and central areas in the high-rainfall eastern escarpment areas, respectively [33]. The region experiences a high variability between extreme wet and dry seasons, which makes it vulnerable to frequent droughts and floods [35].
The nature and pattern of inter-annual variability of precipitation is crucial, as it exerts long-term control on water resources and affects plant growth and the bio-geochemical cycle while moderating extreme events such as droughts and floods [36]. The Limpopo valley (20-25 • S) is characterized by the highest variability in southern Africa [34], in agreement with [37,38]. Figures 2 and 3 show the inter-annual variability of rainfall and streamflow in the LRC for over 57 and 53 years, respectively. The inter-annual variability plots depict a strong seasonal variation in the study area. The rainy season of the region is characterized by alternating wet and dry spells [39], with wet spells recently becoming shorter. Both rainfall and streamflow exhibit positive trends over the sampling periods considered Sustainability 2020, 12, 4006 4 of 20 although neither was statistically significant; rainfall showed an R 2 of 0.0011, while streamflow reported 0.0043. These trend results agree with those obtained by [34] over southern Africa. Statistical trends in this region are affected by extremes, such as the significant flood due to tropical Eline in February 2000 (Figures 2 and 3). From both Figures 2 and 3, the trend equations are y = 0.0005x + 86.734 and y = 0.0005x + 53.446, respectively. This means that both rainfall and streamflow increase at the rate of 0.0005 mm and 0.0005 cumesc per year. This intercept is made using the trendline gradient of the two variables. x y , respectively. This means that both rainfall and streamflow increase at the rate of 0.0005 mm and 0.0005 cumesc per year. This intercept is made using the trendline gradient of the two variables.  Rainfall and temperature data from eight weather stations were obtained from the South African Weather Service, while evaporation data for one station were obtained from the Department of Water and Sanitation within the Luvuvhu River Catchment. The location of all the stations is shown in Figure 1 and additional details are shown in Table 1. These datasets were obtained from  x y , respectively. This means that both rainfall and streamflow increase at the rate of 0.0005 mm and 0.0005 cumesc per year. This intercept is made using the trendline gradient of the two variables.  Rainfall and temperature data from eight weather stations were obtained from the South African Weather Service, while evaporation data for one station were obtained from the Department of Water and Sanitation within the Luvuvhu River Catchment. The location of all the stations is shown in Figure 1 and additional details are shown in Table 1. These datasets were obtained from Rainfall and temperature data from eight weather stations were obtained from the South African Weather Service, while evaporation data for one station were obtained from the Department of Water and Sanitation within the Luvuvhu River Catchment. The location of all the stations is shown in Figure 1 and additional details are shown in Table 1. These datasets were obtained from 1986 to 2016, covering 31 years. The missing data in this study were imputed using the Self-Organizing Maps (SOM) based on the Kohonen Neural Networks [40].

Formulation of the SPEI for the Study Area
The SPEI is based on the computation procedure of the original SPI (Standardized Precipitation Index). The index makes use of either monthly or weekly differences between precipitation and Potential Evapotranspiration (PET) [41]. Due to the complex computation of PET, which involves several variables, including surface temperature, air humidity, soil, incoming radiation, water vapor pressure, and ground-atmosphere latent and sensible heat fluxes [42], this study made use of Hagreaves' and Samani's temperature-based method for PET estimation. This approach has the advantage of only requiring data on monthly mean temperature. The SPI methodology was modified by replacing the two-parameter distribution with a three-parameter distribution (i.e., SPEI requirement) [41]. The latter suggested getting the best fit three-parameter distribution from the L-moment, and the detailed methodology for achieving this can be obtained in [43]. Following the classical approximation of [44], the SPEI for this study was formulated at 1, 6, and 12 month timescales using Equation (1).
where W = −2 ln(P) for P ≤ 0.5, and P is the probability of exceeding a threshold value denoted by D value, P = 1 − F(x). If P > 0.5, then P is replaced by 1 − P and the sign of the resultant SPEI is reversed. The constants C 0 , C 1 , C 2 , d 1 , d 2 , and d 3 , are 2.515517, 0.802853, 0.010328, 1.432788, 0.189269, and 0.001308 as defined by [45], respectively. For this study, the SPEI was computed using the R Package "spei" developed by Begueria [46].

Drought Trends over North-Eastern South Africa
The Breaks for Additive Seasonal and Trend (BFAST) method in Equation (2) is applied to decompose the drought index time series to obtain the trend variations in the study area.
where m is the mean, T t is the trend component value, S t is the seasonal component, and R t is the random component at time t. The monotonic trends in the SPEI time series were obtained through the Mann-Kendall (MK) non-parametric trend test. Based on studies by [47][48][49][50][51], among others, the MK test statistic is calculated from the following formula: Sustainability 2020, 12, 4006 6 of 20 The average value of S is E[S] = 0 and the variance σ 2 is given by the following equation: where t j is the number of data points in the jth tied group, and p is the number of the tied group in the time series. It is important to mention that the summation operator in Equation (5) is applied only in the case of tied groups in the time series to reduce the influence of individual values in tied groups in the ranked statistics. On the assumption of random and independent time series, the statistic S is approximately normally distributed if the following z-transformation equation is used. The value of the S statistic is associated with the Kendall tau, as shown in Equation (7). where With regards to the z-transformation equation in Equation (6), this study considered a 5% confidence level, where the null hypothesis of no trend was rejected if |z| > 1.96. The Mann-Kendall statistic is the Kendall τ term, which is a measure of correlation that indicates the strength of the relationship between any two independent variables, and was also considered important in this study. The MK test system summarized above was applied to the SPEI time series data by writing a code in R and following the instructions given by [52].

SPEI Time Series Forecasting
The forecasting procedure followed in this study started with selecting important variables, formulation of training and testing sets, and determination of model performance. The training data consisted of 70%, while the testing sets were 30% of the total data. The developed models' test sets were based on the correlation coefficient (R), Root Mean Square Error (RMSE), Mean Error (ME), Mean Absolute Error (MAE), Mean Percentage Error (MPE), and the Mean Absolute Percentage Error (MAPE). The forecasting of SPEI at 1, 6, and 12 month timescales using GAM is discussed below.

The Generalized Additive Model without Auto-Correlated Errors
Let y t be the SPEI on month t, where t = 1, . . . , n with the corresponding covariates x t1 , x t2 , . . . , x tp , where p represents the number of variables. The generalized additive model is then written as follows: where y t is the response variable, X t is a matric of covariates, β 0 is the intercept, β j are parameters, s j is a smoothing parameter, and ε t is the error term. It should be noted that, although y t denotes the SPEI at month t, it was used to for all the timescales considered in this study. Equation (9) is estimated using penalized cubic splines [53,54], which is expressed in terms of Equation (10). This should be: The degree of smoothness is controlled by the penalty parameter Λ = (λ j , j = 1, . . . , p), which determines the roughness of the function estimate to the data. It is optimized using the generalized cross-validation criterion (GCV) and is easily implemented in the package 'mgcv' [53,55]. For small values of λ j , the smoothness is rough. The smooth function, s j , is given by Equation (11), which can be explained as the sum of the basis functions b i (x) and their regression coefficients βi.
where q denotes the basis dimension.

The Generalized Additive Model with Auto-Correlated Errors
Let y t be the SPEI as defined in Section 2.4.1, which gives the generalized additive model in Equations (12), where the error terms ε t are assumed to be autocorrelated.
where variables and parameters are as defined above. Time series observations are normally autocorrelated. To correct for autocorrelation, it is normally advised to use time series regression models. This study, therefore, assumes that the error terms εt are auto-correlated and follow the SARIMA model given in Equation (13).
where θ(B) is the non-seasonal moving average operator, and the corresponding seasonal autoregressive and seasonal moving operators are Φ(B) and Θ(B), respectively; v t denotes a white noise series. By expressing Equation (12) in terms of ε t and substituting in Equation (13), we get Equation (14). Table 2 shows the different drought categories in terms of percentages of occurrence of historical drought for all the stations at different timescales. Mild, moderate, severe, and extreme droughts conditions ranged between 61.15-71.88%, 12.65-27.59%, 2.24-21.69%, and 0-6.65%, respectively, across all stations and considering the respective timescales considered in the study. This shows that mild drought is more prevalent in the LRC compared to other drought categories. Stations Muk and Mat showed the highest percentages of extreme droughts (i.e., 6.14%) at one-and six-month timescales, while the Vondo Bos (VB) station showed the same with 6.65% at the 12 month timescale. However, this is still lower than the percentage of extreme events shown by the SPI at the same timescale. The spatial variability of the SPEI at 1, 6, and 12 month timescales is presented in Figure 4. The variability shows Sustainability 2020, 12, 4006 8 of 20 that SPEI 12 was found to be of greater severity compared to the one-and six-month timescales in the middle reaches, while, in the upper reaches, the 12 month timescale showed the least drought severity compared to the SPEI for one and six months.   Figure 5 shows the SPEI time series plot of density, normal quantile to quantile (QQ), and the box plots at the 1, 6, and 12 month timescales before decomposition. To determine the normality of the SPEI data, the Anderson-Darling test was carried out. The initial visual interpretation of the QQ plot suggested departure from the normality of SPEI data, while the detailed Anderson-Darling test Probability-Probability Plot (PP) showed that the data are approximately normally distributed at all the timescales. The authors of [56,57] reported that although visual inspection of normality is used, it is often unreliable, with no guarantees of the results. Johnson SB, Error, and Dagun (4P) distributions were the best fit for the SPEI 1, 6, and 12 month timescales, respectively. From this, the study concluded that the distributions of the SPEI data at 6 and 12 month timescales are bimodal, while the one-month timescale exhibited a unimodal distribution.

Exploratory Data Analysis
(a)  Figure 5 shows the SPEI time series plot of density, normal quantile to quantile (QQ), and the box plots at the 1, 6, and 12 month timescales before decomposition. To determine the normality of the SPEI data, the Anderson-Darling test was carried out. The initial visual interpretation of the QQ plot suggested departure from the normality of SPEI data, while the detailed Anderson-Darling test Probability-Probability Plot (PP) showed that the data are approximately normally distributed at all the timescales. The authors of [56,57] reported that although visual inspection of normality is used, it is often unreliable, with no guarantees of the results. Johnson SB, Error, and Dagun (4P) distributions were the best fit for the SPEI 1, 6, and 12 month timescales, respectively. From this, the study concluded that the distributions of the SPEI data at 6 and 12 month timescales are bimodal, while the one-month timescale exhibited a unimodal distribution.   Figure 5 shows the SPEI time series plot of density, normal quantile to quantile (QQ), and the box plots at the 1, 6, and 12 month timescales before decomposition. To determine the normality of the SPEI data, the Anderson-Darling test was carried out. The initial visual interpretation of the QQ plot suggested departure from the normality of SPEI data, while the detailed Anderson-Darling test Probability-Probability Plot (PP) showed that the data are approximately normally distributed at all the timescales. The authors of [56,57] reported that although visual inspection of normality is used, it is often unreliable, with no guarantees of the results. Johnson SB, Error, and Dagun (4P) distributions were the best fit for the SPEI 1, 6, and 12 month timescales, respectively. From this, the study concluded that the distributions of the SPEI data at 6 and 12 month timescales are bimodal, while the one-month timescale exhibited a unimodal distribution.

Exploratory Data Analysis
(a)

Variable Selection
Variable selection was achieved using gradient boosting. The main objective of a variable selection procedure is to identify the correct predictor variables, which have important influence on the response variable and could provide robust model prediction [58]. Variable selection was conducted for each SPEI time series. The relative importance values are the means of 50 model runs, each based on a randomly selected subset of 90% of the data [59]. Rain showed to be the most important variable for predicting the SPEI at the one-month timescale, while the non-linear trend is the most important predictor for predicting the SPEI at the six-month timescale. Time series components (i.e., trend, seasonality, remainder, etc.) have been successfully used as model inputs in forecasting exercises. For example, the authors of [60] used time series components in forecasting airline passengers using ANN. The authors of [61] reported three significant consecutive lags (i.e., Lags 1, 2, and 3) as input while predicting daily PM10 data. The lagged variable of importance was successfully determined by Principal Component Analysis (PCA). The findings of this study regarding the importance of Lags 1 and 2 are therefore comparable with those reported by [61]. Temperature, which plays a significant role in the development of drought through influence on evaporation, was found to be important in the SPEI at the one-and six-month timescales. The authors of [62] reported that sea surface temperature variability contributed to increased land temperature variability and autocorrelation, which ultimately contributed to persistent droughts in

Variable Selection
Variable selection was achieved using gradient boosting. The main objective of a variable selection procedure is to identify the correct predictor variables, which have important influence on the response variable and could provide robust model prediction [58]. Variable selection was conducted for each SPEI time series. The relative importance values are the means of 50 model runs, each based on a randomly selected subset of 90% of the data [59]. Rain showed to be the most important variable for predicting the SPEI at the one-month timescale, while the non-linear trend is the most important predictor for predicting the SPEI at the six-month timescale. Time series components (i.e., trend, seasonality, remainder, etc.) have been successfully used as model inputs in forecasting exercises. For example, the authors of [60] used time series components in forecasting airline passengers using ANN. The authors of [61] reported three significant consecutive lags (i.e., Lags 1, 2, and 3) as input while predicting daily PM 10 data. The lagged variable of importance was successfully determined by Principal Component Analysis (PCA). The findings of this study regarding the importance of Lags 1 and 2 are therefore comparable with those reported by [61]. Temperature, which plays a significant role in the development of drought through influence on evaporation, was found to be important in the SPEI at the one-and six-month timescales. The authors of [62] reported that sea surface temperature variability contributed to increased land temperature variability and autocorrelation, which ultimately contributed to persistent droughts in North America and the Mediterranean. Mean temperatures were noted to be more dominant compared to the minimum and maximum temperatures in forecasting the SPEI. All predictor variables are significantly different from one another in their relative importance [59]; therefore, all features that appeared to have some relative importance (i.e., ranging from 0 to 100) were selected as input variables in forecasting drought over the LRC and are presented in Table 3.

Short-and Medium-Term Forecasting
To understand the forecasting performance of statistical models, a comparative study was conducted between the four developed models. Figure 6 shows the models' drought forecasting results at all timescales considered in this study. The smoothing effect of the GAM models is evident in the forecasting results. The GAM provides a flexible specification of response by defining the model in terms of smooth functions as a replacement for the detailed parametric relationships on the covariates [63]. The decomposition of environmental time series is expected to improve the forecasting accuracy of models, and it is evident at all timescales that the decomposed GAM performed better than an undecomposed GAM. Although an Ensemble Empirical Mode Decomposition (EEMD)-GAM forecasted the SPEI better than the GAM, from Figure 6, the EEMD-GAM is seen to overestimate drought conditions, especially between 2011 and 2016, with the GAM greatly underestimating the target values, while the EEMD-ARIMA-GAM (i.e., the GAM after correcting residual autocorrelation) improved the forecast. This was noted clearly in the one-month timescale, while for the 6 and 12 month timescales, all models showed some level of improvements. The results of this study agree with those of [64], which found that incorporating corrected residual autocorrelation increased model performance and improved the out-of-sample forecasts. Since forecasting models are imperfect abstracts of reality [65], such behaviors in model outputs are therefore expected, as the forecasts are often not perfect.  Figure 7 is the scatterplot of the different models' outputs at all the timescales considered in this study. All the models' results show that a positive correlation exists between the modeled output and the actual data. The GAM, EEMD-GAM, and EEMD-ARIMA-GAM forecast results showed the lowest correlation at the 1, 6, and 12 month timescales, respectively. The fQRA showed the highest correlations at the one-and six-month timescales, while, for the 12 month timescale, the GAM showed the highest correlation. The high correlation showed by fQRA may be because fQRA is made up of a weighted combination of all the models; therefore, all the models' strength results in fQRA are superior compared to the other models developed in this study. In addition to the correlation between the forecasted and target values, this study employed further statistical measures to determine the model performance, and these are shown in Table 4. The best RMSE was shown by fQRA at the one-and six-month timescales, with the 12 month timescale showing that the GAM performed better compared to the other models. The incorporation of autocorrelation errors as reported by [64] is noted in the model performance at the one-and six-month timescales. These models were seen to have performed better than the GAM and decomposed GAM at the one-and six-month timescales, therefore showing that decomposition coupled with autocorrelation errors improves forecasting accuracy.  Figure 7 is the scatterplot of the different models' outputs at all the timescales considered in this study. All the models' results show that a positive correlation exists between the modeled output and the actual data. The GAM, EEMD-GAM, and EEMD-ARIMA-GAM forecast results showed the lowest correlation at the 1, 6, and 12 month timescales, respectively. The fQRA showed the highest correlations at the one-and six-month timescales, while, for the 12 month timescale, the GAM showed the highest correlation. The high correlation showed by fQRA may be because fQRA is made up of a weighted combination of all the models; therefore, all the models' strength results in fQRA are superior compared to the other models developed in this study. In addition to the correlation between the forecasted and target values, this study employed further statistical measures to determine the model performance, and these are shown in Table 4. The best RMSE was shown by fQRA at the one-and six-month timescales, with the 12 month timescale showing that the GAM performed better compared to the other models. The incorporation of autocorrelation errors as reported by [64] is noted in the model performance at the one-and six-month timescales. These models were seen to have performed better than the GAM and decomposed GAM at the one-and six-month timescales, therefore showing that decomposition coupled with autocorrelation errors improves forecasting accuracy.  The density plots from all the developed models (i.e., GAM, EEMD-GAM, EEMD-ARIMA-GAM, and fQRA) at the three timescales superimposed with actual SPEI time series The density plots from all the developed models (i.e., GAM, EEMD-GAM, EEMD-ARIMA-GAM, and fQRA) at the three timescales superimposed with actual SPEI time series are given in Figure 8a-c. Similar to the results shown by the different model performance measures, at the one- (Figure 8a) and six-month (Figure 8b) timescales, the fQRA models showed fairly good density fits compared to the other models. At the 12 month timescale, both the GAM and fQRA showed good density fits, and the same was noted by the different performance measures. are given in Figure 8a-c. Similarl to the results shown by the different model performance measures, at the one-( Figure 8a) and six-month (Figure 8b) timescales, the fQRA models showed fairly good density fits compared to the other models. At the 12 month timescale, both the GAM and fQRA showed good density fits, and the same was noted by the different performance measures. (a)

Evaluation of Model Uncertainity
Uncertainty analysis in this study was carried out only for the best performing statistical GAM-based models. This was achieved by constructing empirical prediction intervals (PIs) at all timescales. Figure 9 shows the 95% prediction limits of the GAM-based models. The skewness at the one-and six-month timescales was positive, while the 12 month timescale showed a negative skewness. At the 12 month timescale, the model showed the smallest standard deviation, which is indicative of a narrower prediction interval.
El Niño conditions alter moisture flux circulation over southern Africa, thereby influencing the spatial pattern and intensity of drought over the region [66]. The authors of [67] found that it was

Evaluation of Model Uncertainity
Uncertainty analysis in this study was carried out only for the best performing statistical GAM-based models. This was achieved by constructing empirical prediction intervals (PIs) at all timescales. Figure 9 shows the 95% prediction limits of the GAM-based models. The skewness at the one-and six-month timescales was positive, while the 12 month timescale showed a negative skewness. At the 12 month timescale, the model showed the smallest standard deviation, which is indicative of a narrower prediction interval. possible to forecast El Niño events approximately one year ahead, as well as the highest peak based on the characteristics of a previous event. Uncertainty of climatic variables in the future due to climate change is a reality, which ultimately influences drought patterns. Although the models in this study were developed based on historical data, with the availability of downscaled future rainfall and temperature, the developed models can be tested for the uncertainty of future drought patterns.
(a) (b) (c) Figure 9. 95% prediction limits-(a) SPEI 1, (b) SPEI 6 and (c) SPEI 12 month at all timescales considered in the study; LL and UL denote lower limit and upper limit, respectively. El Niño conditions alter moisture flux circulation over southern Africa, thereby influencing the spatial pattern and intensity of drought over the region [66]. The authors of [67] found that it was possible to forecast El Niño events approximately one year ahead, as well as the highest peak based on the characteristics of a previous event. Uncertainty of climatic variables in the future due to climate change is a reality, which ultimately influences drought patterns. Although the models in this study were developed based on historical data, with the availability of downscaled future rainfall and temperature, the developed models can be tested for the uncertainty of future drought patterns.

Discussion
With the increasing frequency and magnitude of extreme hydrological events over the Limpopo River Basin, studies that seek to evaluate the likelihood of these events occurring in the future are important. Since water resources are the core of human existence and environmental health, reliable early warning systems have become important. This study used SPEI time series as a drought-quantifying parameter to describe drought conditions at short-and medium-term timescales (one-and six-month timescales are short-term, while a 12 month timescale is medium-term in this study). To reduce the inherent complexities of hydrological data, time series decomposition was conducted and correction of autocorrelated errors was done as well as forecast combination to improve the forecasting accuracy of the developed models. Mild droughts were found to be the most dominant in the study area compared to severe and extreme drought conditions.
The variability shows that SPEI 12 has greater severity compared to the one-and six-month timescales in the middle reaches, while, in the upper reaches, the 12 month timescale showed the least drought severity compared to SPEI one-and six-months. Four models (GAM, EEMD-GAM, EEMD-ARIMA-GAM, and fQRA) were developed to forecast drought in the LRC at short-and medium-term timescales. The study found that correction of autocorrelated errors and decomposition improved model performance. At the one-and six-month timescales, the fQRA performed better, followed by EEMD-ARIMA-GAM, while the 12 month timescale behaved differently. This timescale showed that the GAM was better than the decomposition with and without corrected autocorrelated errors. An RMSE difference of 0.002% was noted between the GAM and fQRA at this timescale, making these two models the best at this timescale.

Conclusions
This study aimed to characterise drought and evaluate the performance of GAM-based models in drought forecasting in a semi-arid catchment at short and medium terms. Although earth systems models are available for such exercises, these models often provide forecasts gridded over large areas without providing details for small catchments, such as the LRC. This makes studies of this nature important, as they focus on smaller scales where impacts on communities are measurable. The study showed the forecasting strength of GAM-based models for drought at short-term scales and further proved the hypothesis that decomposition and correction of autocorrelation errors together with time series decomposition improved model performance. For medium-term forecasting, this study found that the treatment of a time series did slightly improve the forecast, but an undecomposed GAM showed better performance at this timescale. These models showed their strength for short-and medium-term forecasting and can, therefore, be used for timely water resource decision-making in semi-arid regions. Thus, the forecast would serve as a decision support tool that would ensure advance knowledge of water resource availability and facilitate realistic planning and allocation to meet the minimum water requirements during drought periods, thereby reducing communities' vulnerability to drought impacts. These models can, therefore, be incorporated into early warning systems in these regions to aid in better planning and management of water resources and drought risk reduction in the short and medium terms.