Modelling of Extremely High Rainfall in Limpopo Province of South Africa

: Extreme value theory is a powerful method that is known to provide statistical models for events rarely observed. This paper presents a modelling framework for the maximum rainfall data recorded in Limpopo province, South Africa, from 1960 to 2020. Daily and monthly rainfall data were obtained from the South Africa Weather Service. In this work, the r -largest order statistics modelling approach is used. Yearly blocks were used in ﬁtting a 61 years’ data set. The parameters of the developed models were estimated using the maximum likelihood method. After the suitable model for data was chosen, i.e., GEVD r = 8 , the 50-year return level was estimated as 368 mm, which means a probability of 0.02 exceeding 368 mm in ﬁfty years in the Thabazimbi area. This study helps decision-makers in government and non-proﬁt organisations improve preparation strategies and build resilience in reducing disasters resulting from extreme weather events such as excessive rainfall.


Introduction
Climate extremes such as floods, droughts and heatwaves have become topical issues since they have triggered most natural disasters in recent decades that can potentially affect humans and the natural environment [1]. Climate extreme events are regular across the globe and impact society in various ways, leading to loss of lives, shortage of food, failure of crops, famine, mass migration and health issues [2]. The increased number, frequency and intensity of natural hazards such as floods, heatwaves and hurricanes are generally attributed to climate change [3][4][5]. In Africa, impacts of a changing climate vary significantly by region [6,7]. More than 90% of natural disasters in southern Africa are related to weather, climate and water. Understanding extreme climate events will help prepare and formulate mitigation strategies to cope with events associated with climate change. Modelling and predicting future extreme events become more relevant in commercial agriculture, to insurance companies, statisticians and meteorologists.
Extreme climate and weather events such as floods, droughts and heatwaves negatively impact society, environment and resource management in developing countries [6,8,9]. In South Africa, anomalous cut-off lows, tropical cyclones and tropical storms are the major extreme rainfall producing systems affecting the Limpopo province, while the Botswana High becomes dominant during heatwaves and drought. Extreme weather events are common in Limpopo during summertime and often coincide with mature phases of the El Niño Southern Oscillation. In February 2000, about 700 people lost their lives and over a million residents were displaced in Mozambique due to flooding associated with tropical cyclone Eline [10,11]. In recent decades , southern Africa experienced 491 climate disasters (hydrological, climatological and meteorological) which resulted in 110,978 deaths and left 2.49 million people homeless [8,12]. Therefore, climate extreme events cause risks to the lives and livelihoods of South African society [13]. South Africa is highly vulnerable to extreme climate events due to its geographical location and socioeconomic factors. Several tropical cyclones have distressed various countries such as Madagascar, Mozambique and South Africa [14].
Rainfall is highly variable over southern Africa on several space and time scales [15]. Climate change has altered rainfall characteristics, including duration of the rainy season, the length of dry spells, frequency of rainy days and the occurrence of heavy rainfall events [16]. This results in regular and severe water-associated extremes such as floods and drought [17]. In South Africa, the Limpopo province experiences hot to very hot conditions during the austral summer season [18,19]. Extreme drought is a critical problem in the region affecting the agricultural sector due to high temperatures and unreliable rainfall [8,20]. This study is built on this factual background coupled with challenges and impacts of climate and weather extreme events in the Limpopo province.
Long-term data gained from historical extreme climate analysis provides a huge possibility for good management, forecasting and mitigation of climate extremes [7]. Extreme Value Theory is a powerful method to quantify the stochastic behaviour of low or unusual levels. Extreme value theory (EVT) has been widely used in various fields such as atmospheric science (e.g., [21]), hydrology (e.g., [22]), finance industry [23] and many other fields of application. The observational and statistical modelling results of the studies mentioned above have shown remarkable increases in the intensity of precipitation extremes.
This study aimed to employ Extreme Value Theory to model climate extreme events in the future using generalised extreme value distribution (GEVD) by using the maximum likelihood estimation method. Generalised extreme value distribution (GEVD) is the family of asymptotic distribution that describes the behaviour of extreme conditions. The GEVD consists of three extreme value distributions, namely: Gumbel, Fréchet and Weibull families, which are also referred to as type I, II and III extreme value distributions [24].
Chifurira and Chikobvu [25] fitted a GEVD to average yearly rainfall with an objective of modelling the upper tail of the rainfall distribution. The Gumbel class of distributions was found to fit the data well using the Anderson-Darling goodness of fit test. The GEVD with constant shape and scale parameters but varying location parameters over time were inadequate to model Zimbabwe's extreme maximum rainfall. The study indicated that a high mean annual rainfall of 1193 mm is expected in approximately 300 years ( [25]). A similar analysis to the present study in multivariate extreme value theory (MEVT) is that of [26], who used bivariate threshold excess in modelling temperature extremes in the Limpopo province for three meteorological stations Thohoyandou, Lephalale and Polokwane. Similar to the present study, the approach by [26] also used a penalised cubic smoothing spline to perform a nonlinear detrending of the temperature data before fitting bivariate threshold excess models to positive residuals above the threshold. The present study dealing with rainfall as the main parameter extends the approach of [26] by using a time-varying threshold instead of a constant threshold to capture the climate change effects in the monthly maximum rainfall data series. Recent studies on modelling extreme rainfall using extreme value theory and the r-largest order statistics considering model and return level uncertainty include those of [27][28][29], among others.
This study applies extreme value distribution to model maximum annual rainfall in Limpopo province. Results from this study can contribute vitally to the knowledge of EVT application to long-term rainfall data and recommendations to government agencies private organisations on extreme events and their negative impact on the economy. There are no studies available to the public domain in the science sphere that have modelled long-term yearly maximum rainfall in Limpopo province using EVT approaches applied in this study.
Various studies such as [30] discuss the modelling of the influence of temperature on average daily electricity demand in South Africa using a piecewise linear regression model and the generalised extreme value theory approach from 2000-2010. Severe weather conditions increase electricity demand because air-conditioned appliances are used in summer and heating systems are used in winter [6,30]. South Africa is also concerned about the impacts of extreme heat wave events on the public and how these events may change in the future [31,32]. The most robust approach in Extreme Value Theory is the choice of a threshold when using the POT approach. We also closely follow the work of [33,34].
Southworth et al. [34] provide a detailed computational approach of multivariate extreme value data conditional modelling using an R package called 'texmex'. In another study on threshold choice, Ref. [35] proposed a covariate-dependent threshold based on expectiles. They argued that although no threshold choice method is universally the best, strong arguments against the use of constant threshold is that the observation that may be considered extreme at some covariate level may not necessarily qualify as an extreme observation when considered at another covariate level. The present study use threshold stability plots. This is a graphical method that is widely used to determine the threshold. The idea of this plot is that the exceedances of a high threshold follow a GPD. The study by [36] used a GPD with time-varying covariates and thresholds to model daily peak electricity demand for South Africa. They used an intervals estimator method in declustering observations that exceed the threshold. Furthermore, the findings of [36] showed a better fit for the GPD model to the data compared to the generalised extreme value distribution (GEVD).
The main highlights of this study are as follows: The main contribution of this paper is to employ Extreme Value Theory to model climate extreme events using the r-largest order statistics. The knowledge and understanding of extreme climate events will help prepare and formulate mitigation strategies to cope with events associated with climate change. In this study, the interest was in deriving extreme maximum rainfall return levels from 1960 to 2020. The study combines two main approaches: bivariate condition extremes model [33,34,37] and time-varying threshold [36]. The rest of the paper is organised as follows: Section 2 presents the materials and methods. The empirical results are presented in Section 3. A discussion of the results is given in Section 4, while Section 5 concludes the paper.

Materials and Methods
Extreme Value Theory (EVT) is unique as a statistical discipline in that it develops techniques and models for describing the unusual rather than the usual [38]. EVT was used in the modelling of extreme rainfall. The analysis of extremes for a given data representing extremes was selected. EVT provides a tool for modelling the asymptotic distribution of a sequence of observations [39]. The best way to describe the behaviour of climate extreme events for a particular environment is to identify the distribution(s) suitable to fit the data. In this study, the generalised Pareto distribution (GPD) was used for estimating extreme return levels of historical monthly rainfall data from 1960 to 2020.

The r-Largest Order Statistics
The use of r-largest order statistics is usually used with limited data. This study is motivated by the desire to search for characterisation of extreme value behaviour other than using one observation in a block that would enable modelling observations in the upper tails of distributions. Such an approach is more efficient in its use of data.
Let X 1 , X 2 , ..., X n , be a sequence of independent and identically distributed (i.i.d.) random variables. Define M (k) n = kth largest of {X 1 , ..., X n }. If there exists a sequence of constants {a n > 0} and {b n > 0} such that: P M (r) n −b n a n ≤ z → G(z) as n → ∞ for some non degenerate distribution G, then, for fixed r, the limiting distribution as n → ∞ ofM (r) n −b n a n , ..., M (r) n −b n a n falls within the family having joint probability density function (for ξ = 0) [38]. (1) ; and x (k) : For the case r = 1, we have the GEVD model. When ξ −→ 0 usually written as ξ = 0, the joint density is given as: Equation (2) reduces to the Gumbel class of distributions when r = 1. Selection of the best value of r in this study is done using the automatic selection algorithm discussed in [40].

The Generalised Pareto Distribution
The Generalised Pareto Distribution (GPD) is a Peaks-over-threshold (POT) distribution that can be used to model the observations above a sufficiently high threshold [41]. The GPD has two parameters ξ, the shape parameter, and σ, the scale parameter.
The survival function of the GPD is given in Equation (3).
Equation (3) shows that when ξ < 0 the survival function of the GPD is bounded above by σ ξ . The return levels are estimated using Equation (4)

Threshold Selection and Declustering
In this paper, we use threshold stability plots. The procedure of declustering and then fitting the GPD to cluster maxima gives a valid statistical model whose underlying assumptions are met. However, the cluster maxima may not be of ultimate interest in practice. For example, rainfall information can be helpful if the assessment of flood damage is the ultimate goal. Here it may be more informative to analyse complete clusters and understand the aggregate rainfall over a rainy spell, rather than focus on the largest yearly value over that spell [42]. In this paper, the declustering approach proposed by [43] is used. This problem is inherently more difficult and requires a much more sophisticated solution; this paper does not attempt such.

Parameter Estimation
In this paper, we are going to use the MLE.

The Delta Method
Using the delta method the variance of x p is given as [38,44]: where V is the covariance matrix of μ,σ,ξ and which is evaluated at μ,σ,ξ . The approximate confidence interval of the flood heights x p is then given by

The Profile Likelihood Method
The profile likelihood for some parameter θ i is defined as [38]: where θ −i represents components of θ excluding θ i [38,44]. To obtain the confidence interval for x p a re-parametrisation is required in which x p is one of the parameters in the GEVD r model, given as follows: with y p = −log(1 − p).

Forecast Combination 2.4.1. Combining Estimated Return Levels and Prediction Intervals
The idea of combining forecasts was first developed by [45]. They argued that combined forecasts improve forecast over the single model forecast. Suppose the estimated return levels from GEVD r=1 , GEVD r>1 and GPD models are combined so that we have a vectorŷ In this study the estimated return levels will be combined using the simple average and median methods. The average method is given as: and the median method aŝ Robust prediction intervals (PIs) is known to be produced from combining prediction limits from various models ( [46][47][48]; among others). In this study we shall use the simple average and median methods for combining the prediction limits. The simple average method can be expressed as in Equation (13).
The median method is known to be less sensitive to outliers. This is given in Equation (14) L

Evaluation of Prediction Intervals
The models used in this study are only a simplification and approximation of the actual rainfall behaviour (patterns). The first index for estimating PI is the prediction interval width (PIW). It is estimated using lower and upper prediction limits and calculated as shown in Equation (15).
where U α (y t ) and L α (y t ), denote the upper and lower prediction limits respectively, and α is the nominal confidence. The quality of the PIs is evaluated using various indices such as the prediction interval coverage probability (PICP), prediction interval normalised average width (PINAW), among others. This study uses the PINAW. PINAW indicates the model's ability to capture the uncertainty information on the interval predictions. It evaluates the average width of the PIs and is given as where R is the range of the variable y t . A smaller PINAW means the PIs are more informative. The flow chart of the proposed modelling framework is given in Figure 1. South Africa is a semi-arid country, receiving annual rainfall of about 464 mm on average, compared to the global average of 860 mm. Large parts of South Africa receive rainfall during the austral summer season. However, the southwestern Cape receives predominantly winter rainfall, with all-year rainfall over the Cape south coast. This study focuses on the Limpopo province, located on the northeast of South Africa and neighbouring Zimbabwe, Mozambique and Botswana. The province falls within the summer-rainfall region (October to March) and thunderstorms are common during the day. Very little rainfall is received during the austral winter.
The southern part of Limpopo lies on the African plateau, while the north eastern Lowveld is well below 1000 m in the Limpopo River valley (see Figure 2). The elevated interior, the low-altitude coastal plain and mountain systems are the three basic landforms of southern Africa. The terrain enhances rainfall by causing the orographic uplift of warm moist air.

Rainfall
In this study, daily rain gauge observations from several stations in Limpopo were obtained from the South African Weather Service (SAWS) for the period 1960-2020. The observations are made once every 24 h at 6.00Z (8:00 am South Africa Standard Time). This study selected Thabazimbi station with at least 95% rainfall data from 1960 to 2020 over Limpopo province. Due to lack of long-term data from other meteorological stations, other stations were excluded in this study. The data is recorded uniformly in all stations as per the World Meteorological Organisation (WMO) guidelines. The observations are made simultaneously across the southern African region to allow inter-comparisons. Table 1 shows a summary of the information about the Thabazimbi weather station.  [49]. The Niño 3.4 Index is a measure of sea surface temperature anomalies in the eastern equatorial Pacific. The SOI indices were obtained from the archives of NCEP and were correlated with rainfall over the Limpopo province.

Indian Ocean Dipole
In addition to ENSO, Indian Ocean Dipole (IOD) is another phenomenon which allows for the interaction between the atmosphere and the sea and is referred to as the IOD [50]. Positive IOD, when the SSTs in the western Indian Ocean are warmer relative to the east, dominates the enhancement of rainfall over eastern Africa [51]. Negative IOD, when the western Indian Ocean is cooler relative to the east, is normally associated with wet conditions over the south-eastern part of southern Africa. IOD is partly responsible for driving climate variability of surrounding landmasses and is related to the El Niño Oscillation system [52]. Mondal and Mujumdar [53] used EVT to analyse characteristic changes in extreme rainfall in India using a high-resolution daily gridded dataset. Nonstationary distributions with varying parameters for physical covariates like ENSO-index, global average temperature and local mean temperatures were used to model intensity, duration and frequency of extreme rainfall over a high threshold. Intensity, duration and frequency were non-stationary and no spatially uniform pattern was found in their changes across India. Period of excessive rain was found to be stationary in most of the locations in India. In contrast, associations between frequency, intensity and local temperature changes were found to be non-stationary [53,54].

Results
In this section, we discuss the empirical results of the study.

Exploratory Data Analysis
For the case r = 1, we used annual blocks to extract the maximum monthly values each year. The data is then called the annual maximum rainfall (AMR). For the case r > 1, we used the automatic selection algorithm discussed in [40]. The monthly rainfall (MR) data is used with the peaks-over-threshold model, the generalised Pareto distribution (GPD). Table 2 shows summary statistics of both MR and AMR data sets. Over the sampling period the monthly maximum is the same as the annual maximum, which is the maximum rainfall over the sampling period.  Figure 3 shows a scatter plot of the monthly rainfall over the sampling period.
Initially, the data were tested for the existence of a monotonic trend using the Cox-Stuart (CS) trend test. Using the CS test, the p-value was 0.2492, implying no monotonic trend at the 5% level of significance. However, on using the Mann-Kendall (MK) trend test and the seasonal MK test, we failed to reject the null hypothesis and concluded that there is both a local trend (p-value = 0.0001784) and a seasonal global trend (p-value = 0.000002689). This was then followed by computing the magnitude of the trend based on Sen's slope test. The sample estimate of the slope was found to be −0.0041841 with a 95% confidence interval of (−0.0163, 0.0000). The seasonal Sen's slope was estimated as zero. The magnitudes of both the local trend slope and the seasonal slope are very small. The correlation between the rainfall data with Soi and IOD data was weak. As a result, these two variables were not included as covariates in the developed models. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

GEVD r Results
The tests were carried out on the annual maximum rainfall (AMR )data. Using the Mann-Kendall (MK) trend test, the p-value was 0.6452, meaning that the AMR data do not have a local trend at the 5% level of significance. These results are consistent with those from the CS trend test in which the p-value was found to be 0.4161.
The models considered in this study are: However, attention is limited to r ≤ 8 order statistics due to the reasonable doubt on the model's validity for all values of r ≥ 9. For r ≤ 8, the standard errors of the estimates (μ,σ andξ) decrease as the values of r increase, implying an increase in precision of the model. Figure 4 shows a return level plot for determining best value of r using the profile likelihood and delta methods.

GPD Results
We carried out a trend test on the cluster maxima data using the MK and CS tests. We failed to reject the null hypothesis that there is a local trend (p-value was 0.598) based on the MK test. These results are consistent with the CS trend test in which we got a p-value of 0.5637.
The model considered in this study is: The parameter estimates of the three models, GEVD r=1 , GEVD r=8 and GPD together with their standard errors in parentheses are given in Table 3.  Figure 5 shows threshold stability plots of the extremal index, scale parameter and the shape parameter, respectively. From panel (a), it appears that a threshold of 120 mm would be appropriate as raising the threshold further does not seem to significantly change the estimated extremal index of θ = 0.75. Figure 6 shows a plot of the exceedances above the threshold of 120 mm.   A scatter plot including a histogram and a box plot of the cluster maxima rainfall is given in Figure 7. The distribution of the cluster maxima is skewed to the right, as shown by both the histogram and box plot in the top right and bottom left panels of Figure 7. This suggests that rainfall above 200 mm in the Thabazimbi area is rare.  The original rainfall series has 732 observations. With a threshold of 120, the number of threshold exceedances was 94. Using the intervals estimator method proposed by [43], the extremal index was estimated as 0.7556, resulting in 64 clusters being identified. The average cluster size is 1 0.7556 = 1.323. This suggests that rainfall tends to be heavy on consecutive months, but very rainy spells tend not to last longer than 1 or 2 months.
The diagnostic plots shown in Figure 8 suggest that the GPD is a good fit to the declustered exceedances above the threshold of 120 mm. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Model Empirical Probability Plot q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 200 300 400 200 300 Model Empirical Quantile Plot q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q

Model Comparisons
Summary statistics for the prediction interval widths at the 95% confidence level for all the proposed individual models, including the forecast combination models, are given in Table 4. The GEVD r=8 (delta) model has the smallest standard deviation of the PIW. This suggests that this model has the narrowest PIW.  Table 5 shows a summary of the evaluation metrics for the prediction interval widths.  Figure 9 shows the box plots of the prediction interval widths (PIWs) of the individual models, including those combined using the Mean and Median methods of forecast combination. From Figure 9 the distribution of the PIW from the GPD model is skewed and seems to be too narrow. The PIW that appears to be the best with a fairly small PIW and whose distribution appears symmetrical is the one from the model GEVD r=1 whose prediction limits are estimated using the profile likelihood method.  Table 6 shows the estimated return levels together with the 95% confidence intervals using the delta method and the profile likelihood method for the GEVD r=8 model. The confidence intervals from the profile likelihood method are narrower than those from the profile likelihood method. The return level plots for the model GEVD r=8 with 95% prediction intervals estimated using the delta and profile likelihood methods are given in Figure 10.

Discussion
The current study was motivated by the work of [44,55], who used the r-largest order statistics in modelling extreme wind speed and estimation of maximum daily temperature, respectively. The results produced from this study were from the application of GEVD r for r = 1, 8 and the GPD models. The parameters of the models were estimated using the MLE method. Empirical results from the evaluation metrics for prediction intervals suggest that GEVD r=1 , which was based on the profile likelihood, produces prediction intervals with the smallest PINAW. Modelling of extreme maximum rainfall is important in the field of hydrology for decision making. The stakeholders can be informed of return levels and periods by modelling excessive maximum rainfall in the study area, Thabazimbi. This helps in decision making and alarms the community living around Thabazimbi and surrounding areas when they are likely to experience extreme, destructive rainfall.
In this study, the data were tested for the existence of a monotonic trend using the Cox-Stuart (CS) trend test. Using the CS test, the p-value was 0.2492, implying no monotonic trend at the 5% level of significance. However, upon using the Mann-Kendall (MK) trend test and the seasonal MK test, we failed to reject the null hypothesis and concluded that there is both a local trend (p-value = 0.0001784) and a seasonal global trend (p-value = 0.000002689). This was then followed by computing the magnitude of the trend based on Sen's slope test. The Weibull class of distributions is the best fitting model for the data in all the modelling frameworks. This implies that the distributions of extreme maximum rainfall are bounded above.
The study declustered the exceedances above a sufficiently high threshold before fitting the GPD model. It should be noted that, although the procedure of declustering and then fitting the GPD to cluster maxima gives a valid statistical model whose underlying assumptions are met, this may not be of ultimate interest in practice. For example, rainfall information can be helpful if the assessment of flood damage is the ultimate goal. Here it may be more informative to analyse complete clusters and understand the aggregate rainfall over a rainy spell rather than focus on the largest yearly value over that spell. The lack of long-term rainfall data for various stations in Limpopo province limits the other stations to be investigated in this study. The correlation between rainfall data with ocean atmospheric drivers such as SOI and IOD data was weak. As a result, these two variables were not included as covariates in the developed models in this study.
Empirical results from this study show that the prediction interval widths from the profile likelihood method are preferred to those from the delta method, as seen in Table 5. From Table 5 the delta method used on the model GEVD r=8 has the lowest PINAW value of 1.97 from the four models: (GEVD r=1 , delta), (GEVD r=1 , profile), (GEVD r=8 , delta) and (GEVD r=8 , profile). The PINAW values from the GPD, Mean and Median models are too narrow and do not capture the uncertainty in the return levels. Robust narrower prediction intervals are preferred and usable by decision-makers in hydrology at capturing uncertainty than those too wide. Our results are consistent with those of [56] who estimated extreme flood heights using the r-largest order statistics and modelled the uncertainty in the extreme quantiles of flood heights using the delta and the profile likelihood methods, respectively. Similar studies on the use of the r-largest order statistics are given in [28,38,40,57], among others.

Conclusions
The paper presented the r-largest order statistics modelling approach to modelling extremely high rainfall in the Thabazimbi area in the Limpopo province of South Africa. A comparative analysis was done with the generalised Pareto distribution. The study results suggest that the data follow a GEVD and do not deviate from assumptions. Diagnostic plots for the selected station, probability plot, quantile plot, return level plot and density plot provide solid evidence that GEVD is a good fit for the block maxima data. After the suitable model for data was chosen, the 50-year return level was estimated as 368 mm, which means a probability of 0.02 exceeding 368 mm in fifty years in the Thabazimbi area. This study helps decision-makers in government and non-profit organisations improve preparation strategies and build resilience in reducing disasters resulting from extreme weather events such as excessive rainfall. Future studies may consider covariates such as the influence of ocean-atmosphere interactions on the occurrence and intensity of extremes in the Limpopo.

Data Availability Statement:
The data used in this study are from South African Weather Services, website (https://www.weathersa.co.za/, accessed on 11 October 2021).

Acknowledgments:
The authors would like to acknowledge the South African Weather Services (SAWS) for providing the data. We are also thankful to the University of Venda where this study was carried out.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: