1. Introduction
Stochastic weather generators are statistical tools used to simulate synthetic sequences of precipitation data that retain the essential statistical and temporal characteristics of observed rainfall. These models are particularly valuable in hydrology, ecology, climate impact assessment, and water resource management, where long-term precipitation data are required but often unavailable [
1]. For instance, synthetic daily weather sequences have been used as model inputs to investigate the impacts of climate variability on crop yields [
2]. The foundational work on stochastic precipitation generation dates back to the 1970s, when Richardson (1981) [
3] introduced one of the first daily rainfall generators based on Markov chains. This approach models rainfall occurrence as a two-state Markov process (wet or dry), capturing the persistence and transition probabilities of rain events. Richardson’s model paved the way for the further development of stochastic weather generators by coupling rainfall occurrence with amount distributions. Markov chain-based models have remained popular due to their simplicity and ability to represent temporal dependence [
4]. These models typically involve estimating transition probabilities between wet and dry states and fitting probability distributions (such as gamma or exponential) to rainfall amounts. However, single-site Markov models often struggle to capture a longer-term climate variability. More sophisticated statistical and machine learning techniques are incorporated to improve the realism of simulated precipitation. For example, models using hidden Markov models (HMMs) [
5] and generalized linear models (GLMs) [
6] provide improved flexibility in capturing the complex behavior of precipitation occurrence and intensity. Furthermore, nonparametric approaches and weather generators based on neural networks and deep learning have emerged, though their adoption in operational hydrology remains limited due to interpretability and data requirements.
The primary objective of stochastic weather generators is to replicate the specific statistical characteristics of meteorological variables. These generators typically rely on parametric modeling [
3], resampling techniques [
7], or a combination of both methods [
8]. A notable parametric method utilizes generalized linear models (GLMs), facilitating the stochastic modeling of daily weather variables. Furrer and Katz (2007) [
9] applied a GLM-based stochastic weather generator to daily weather data from Pergamino, Argentina, a location characterized by a pronounced wet season. Parametric stochastic weather generators frequently underestimate the observed interannual variability of seasonally aggregated variables, a phenomenon referred to as overdispersion [
10,
11,
12]. To address this limitation, Kim et al. (2012) [
13] introduced smoothed seasonal total precipitation and seasonal mean minimum and maximum temperatures as covariates into a generalized linear model (GLM)-based weather generator. They applied locally weighted scatterplot smoothing (LOESS; [
14,
15]) to effectively mitigate underdispersion. Furthermore, the integration of seasonally aggregated climate statistics into the statistical downscaling of seasonal climate forecasts has yielded robust methodologies for generating weather sequences consistent with seasonal forecasts, significantly benefiting resource planning and management [
16]. Nevertheless, challenges remain regarding the selection of appropriate smoothing parameters and forecasting reliability due to the requirement of future seasonal aggregate covariates. Recently, Kim et al. (2017) [
17] included seasonal dry/wet indicators derived from hidden Markov model-based decodings of seasonal total precipitation as covariates in the GLM weather generator [
18]. Despite these advancements, the model still struggles to produce sufficient precipitation intensity during the wet season (July–September) (
Figure 1).
Stochastic precipitation generators are widely used for climate change impact studies [
1], hydrologic modeling, and risk assessment. However, challenges persist in accurately simulating extreme precipitation events and adapting to non-stationary climate conditions [
19]. Consequently, ongoing research focuses on improving extreme event modeling and integrating climate model outputs into stochastic generators. Gamma [
20] and log-normal distributions are commonly used to model rainfall data, although mixed distributions have gained popularity more recently. The gamma distribution, however, tends to perform poorly in cases of heavy precipitation-characterized by high rainfall with low frequency—where high accuracy is essential. To address extreme rainfall events, the generalized Pareto (GP) distribution is applied effectively. Yet, the GP distribution is less suitable when the frequency of light rain is high.
To overcome these limitations, spliced distributions, which combine two different distributions over separate supports, have become a focus of research. Hanum et al. (2015) [
21] demonstrated that a spliced distribution composed of a gamma distribution for the lower range and a Pareto distribution for the upper range better fits tropical heavy rainfall data from Jakarta, Indonesia, compared to single distributions such as gamma, Pareto, or GP alone (see also [
22]). However, spliced distributions typically lack continuity and differentiability at the threshold separating the two component distributions. To address this, hybrid distributions have been developed that ensure continuity at the threshold, although differentiability is still not guaranteed. Building on this, Kim et al. (2019) [
23] introduced a modified hybrid gamma–generalized Pareto distribution, which is a generalized form of the spliced distribution. This model uses a gamma distribution for the ‘head’ (lower values) and a generalized Pareto distribution for the ‘tail’ (extreme values). They analyzed the threshold conditions for this modified hybrid distribution, derived its negative log-likelihood function, and estimated parameters using the differential evolution algorithm for approximate maximum likelihood estimation. Recent methodological advances in stochastic weather generation highlight the value of non-stationary, extreme-accurate frameworks. Nguyen et al. (2024) [
24] developed a climate-informed nsRWG that conditions precipitation distributions on circulation and temperature, successfully reproducing spatial and extreme characteristics for future climate scenarios. Guan et al. (2024) [
25] demonstrated the ability of nsRWG to capture heavy precipitation events across scales using extremity indices. Abbas et al. (2025) [
26] proposed a zero-inflated extended GP model that seamlessly models dry days, typical accumulation, and extremes, while Reulen and Mehrkanoon (2024) [
27] leveraged attention-based GANs for enhanced nowcasting of extreme precipitation.
Building on these contemporary directions, our study introduces the MHGGP–GLM framework, which further advances the stochastic modeling of precipitation intensity by ensuring smooth transitions between distributional regimes and addressing overdispersion issues in both frequent and extreme rainfall. The novelty of this study lies in the integration of a modified hybrid gamma–generalized Pareto (MHGGP) distribution into a GLM-based stochastic weather generator, enabling the simultaneous modeling of frequent light rainfall and rare extreme events within a unified framework. This approach effectively mitigates the long-standing problem of overdispersion by capturing interannual variability in wet-season precipitation more realistically. Unlike conventional spliced models, the proposed method preserves continuity and differentiability at the distribution threshold, ensuring stable parameter estimation and smoother simulation outputs. Empirical validation using a 51-year dataset from Seoul, Korea, confirms the model’s superior ability to reproduce rainfall statistics, highlighting its potential utility for hydrology, agriculture, and climate risk assessment.
Section 2 briefly reviews the foundational GLM approach for stochastic weather generation and details the mixture of gamma and generalized Pareto distributions with a threshold, highlighting its efficacy in producing realistic weather sequences.
Section 3 evaluates the model’s performance using daily weather data from Seoul, Korea, with particular attention to reducing overdispersion and accurately modeling precipitation intensity. Finally,
Section 4 provides a discussion of the findings and their implications.
2. GLM Precipitation Generator Using Modified Hybrid Gamma with GP Distribution
2.1. GLM Precipitation Generator
The GLM-based stochastic weather generator employed in this study draws upon the foundational structure originally proposed by [
3], which introduced the concept of using a two-part model to separately simulate precipitation occurrence and intensity. This foundational framework has since been extended and refined in numerous ways, most notably by [
9], who developed a stochastic weather generator grounded in generalized linear modeling (GLM) principles to capture the stochastic nature of daily weather events. In the present study, we adopt the GLM-based approach introduced by [
9] while implementing several modifications to suit the objectives of our analysis. A concise overview of the model structure is provided below for completeness; however, a more comprehensive treatment, including algorithmic implementation and parameter estimation procedures, can be found in the original work by [
9].
To simplify the interpretation of results, particularly those related to overdispersion in precipitation processes, we intentionally exclude large-scale climate drivers—specifically the El Ni
o-Southern Oscillation (ENSO)—as covariates in our modeling framework. This represents a deliberate departure from the methodology adopted by [
9], who incorporated ENSO indices to account for interannual climate variability. By omitting ENSO-related terms, we focus on the intrinsic temporal structure and seasonal variability of local precipitation processes, thereby facilitating a clearer examination of the generator’s behavior in the absence of exogenous climate signals.
The modeling of daily precipitation within this GLM-based framework follows the methodology of [
28], who proposed a chain-dependent process to model precipitation occurrence, with the transition probabilities governed by seasonal functions. Specifically, let
denote the binary precipitation indicator on day
t, where
The temporal dependence between successive days is captured through a first-order Markov process, whereby the transition probability, , defined as the probability that precipitation state , occurs on day t given that state is modeled using a logistic regression:
This transition probability is specified via a Binomial GLM with a logit link function. To capture the inherent seasonality in precipitation occurrence, the logistic model incorporates sinusoidal terms representing the annual cycle. The seasonal terms are defined as and , reflecting periodic behavior with a one-year cycle.
The conditional probability of precipitation on day
t, denoted
, is then modeled as
where
represents the baseline log-odds of precipitation occurrence, and the coefficient
a quantifies the influence of the preceding day’s precipitation occurrence on the current day’s precipitation probability. The coefficients
and
capture the amplitude and phase of the seasonal cycle, respectively. The interaction terms
and
allow for seasonal modulation of the autocorrelation in precipitation occurrence, thereby enabling the model to reflect differing seasonal dynamics in wet and dry spell persistence.
For the precipitation intensity component, conditional on
, we follow the approach of [
28] in modeling daily precipitation amounts using a gamma distribution, which is commonly used due to its flexibility and positive support. The conditional mean precipitation intensity of the gamma distribution at time
t,
is modeled as a function of time through a sinusoidal formulation that captures the inherent annual cycle in precipitation. This is expressed as
where
and
are the standard cosine and sine terms representing annual periodicity. Note that the scale and shape parameter of the gamma distribution can be reparameterized by the mean and variance parameters. The coefficients
and
determine the amplitude and phase shift of the seasonal cycle, respectively, while
represents the baseline log-mean intensity. This formulation enables the model to reflect realistic seasonal patterns in precipitation intensity, improving the accuracy of synthetic series generation and enhancing the representation of interannual variability. The mean of the gamma distribution is here allowed to vary seasonally through the inclusion of a sinusoidal function, thereby reflecting seasonal patterns in precipitation intensity. Despite its empirical utility, this method tends to underestimate precipitation intensity during peak wet periods, which in turn leads to a systematic underestimation of the interannual variability of aggregated precipitation indices—such as seasonal or annual totals. This limitation is well-documented in the literature and highlights an ongoing challenge in accurately simulating the full distribution of daily precipitation, particularly in climates characterized by highly variable wet seasons. Therefore, the GLM-based precipitation generator adopted here offers a flexible and statistically rigorous framework for simulating daily precipitation processes, capturing both temporal dependence and seasonality. However, inherent trade-offs exist, particularly in the representation of extreme events and aggregated variability, necessitating further refinement or post-processing to ensure realistic simulation outcomes across a range of temporal scales, and this approach typically fails to generate sufficient precipitation intensity, particularly during wet seasons, leading to an underestimation of observed interannual variance in seasonally aggregated variables.
2.2. Modified Hybrid Gamma with GP Distribution
In the modeling of daily precipitation intensity, especially within the context of stochastic weather generators and hydrological simulations, the selection of an appropriate probability distribution is critical for accurately capturing the statistical characteristics of observed rainfall. The gamma distribution is widely employed for this purpose, particularly in the simulation of frequent, low to moderate rainfall events. Its mathematical properties, including a relatively thin and exponentially decaying right tail, make it well-suited for representing precipitation regimes in which light rainfall is dominant. This suitability arises from the gamma distribution’s flexibility in accommodating positively skewed data, while ensuring a non-negative support consistent with the physical nature of precipitation. However, despite these advantages, the gamma distribution often exhibits limitations when applied to datasets that contain a substantial number of heavy rainfall events. In such cases, the empirical distribution of precipitation intensities frequently displays a long right tail—a characteristic that the gamma distribution fails to accommodate adequately. This mismatch can result in poor goodness-of-fit metrics, particularly in the upper quantiles of the distribution, thereby impairing the model’s ability to accurately represent extreme precipitation events that are of significant interest in risk assessment, climate change impact studies, and water resource planning.
To better capture the statistical properties of heavy rainfall events, the generalized Pareto (GP) distribution is often employed. The GP distribution is well-known for its heavy-tailed nature, making it particularly effective for modeling exceedances over high thresholds as commonly encountered in the analysis of extremes. Its theoretical basis lies in extreme value theory, where it arises as the limiting distribution of scaled excesses above a specified threshold. This makes it a natural candidate for modeling the tail behavior of precipitation data. Nonetheless, while the GP distribution can provide a superior fit for high-intensity precipitation events, it often performs poorly for low and moderate intensities, where the tail is relatively thin. Furthermore, the application of the GP distribution typically necessitates the truncation of data below a predefined threshold, which can result in the loss of valuable information contained in the bulk of the distribution. This trade-off between tail accuracy and data completeness introduces practical challenges, particularly when attempting to construct a unified model capable of capturing the full spectrum of precipitation intensities.
Given these limitations, several studies have explored the use of hybrid or mixture distributions to leverage the respective strengths of the gamma and GP distributions. In particular, approaches that combine a gamma distribution for the lower and moderate ranges of precipitation intensity with a GP distribution for the upper tail have shown promise in producing synthetic precipitation sequences that are both realistic and statistically consistent with observed data. Such composite models provide enhanced flexibility and accuracy across a broader range of precipitation values, allowing for improved representation of both common and extreme events. Nevertheless, a commonly cited drawback of conventional mixture models lies in the potential discontinuity or non-differentiability at the threshold separating the two component distributions. This discontinuity can introduce artifacts in simulation output and complicate both parameter estimation and model interpretation. In response to these concerns, recent methodological advancements have proposed modified hybrid distributions that ensure smooth transitions at the threshold point. Specifically, Kim et al. (2019) [
23] introduced a refined hybrid model in which the gamma distribution governs the lower segment of the intensity spectrum, while the GP distribution governs the upper segment, with careful parameterization to maintain continuity and differentiability at the junction. This construction not only preserves the statistical integrity of the model across the full range of precipitation intensities but also facilitates more robust simulation of both frequent and extreme rainfall events. By adopting this modified hybrid approach, the present study aims to overcome the shortcomings of traditional single-distribution and discontinuous mixture models. In doing so, we enhance the realism and reliability of synthetic daily precipitation generation, thereby improving the utility of stochastic weather generators for applications in climate modeling, hydrological impact analysis, and infrastructure design under conditions of climatic variability and extremes.
A form of spliced distributions by probability density function
is the constructed probability density function
[
29,
30]. As a special case, spliced distribution by probability density function
is the combined part of the head from probability density function
with the part of the tail from
and can be shown as follows:
where
and
are mixing weights greater than 0 and satisfied by
. Also, each of
and
can be formed as follows:
where each of
and
is a cumulative distribution function of
and
, and
means the range limit of the domain and is also regarded as one of the model parameters.
Suppose that part of the head,
, is the probability density function of a gamma distribution with shape parameter
and scale parameter
, and part of the tail,
, is the probability density function of a GP distribution with location
, scale parameter
, and shape parameter
. However, the probability density function of spliced distribution suggested by [
29] generally is not continuous. So it needs to have several of the critical conditions as follows for being continuous and differentiable [
31]. Under both the continuity and differentiability of probability density function
, the threshold
satisfying
is indicated by
where
. That is, threshold
of the modified hybrid gamma and generalized Pareto distribution only depends on the parameters of the gamma distribution under condition
.
The probability density function (PDF) of this modified hybrid gamma–generalized Pareto (MHGGP) distribution is expressed as
and is denoted as MHGGP
. Under the continuous condition, the positive mixing weights of MHGGP distribution are
and
, respectively. In addition, the
value for mixing weights can be expresses as follows:
where
is the lower incomplete gamma function. Note that
is determined by parameters of both generalized Pareto distribution and gamma distribution.
2.3. Statistical Analysis
Formally, the model employed for precipitation occurrence in this study is identical in structure to the basic model previously described, retaining the same first-order Markov chain and generalized linear modeling (GLM) approach to simulate the binary occurrence of daily precipitation. However, in modeling precipitation intensity, we depart from the conventional framework by incorporating a more flexible distributional form—specifically, a mixture of the gamma distribution and the generalized Pareto (GP) distribution—instead of relying solely on the gamma distribution as traditionally done. This hybrid model, denoted as MHGGP (modified hybrid gamma–generalized Pareto), is designed to better capture the dual behavior of precipitation intensities, which often exhibit light to moderate values frequently but also include sporadic, high-intensity events that lie in the tail of the distribution. We estimate the parameters of the modified hybrid gamma–generalized Pareto (GP) distribution using maximum likelihood estimation (MLE). Based on the general form of the likelihood function for the spliced distribution,
the log-likelihood function for parameters (
) is expressed as follows:
where
,
,
, and
is the cumulative density function of the gamma distribution. Since obtaining explicit maximum likelihood estimators (MLE) for the parameters (
) by directly maximizing the log-likelihood function
is infeasible, we employ the differential evolution (DE) algorithm for global optimization. The DE algorithm is particularly advantageous, as it does not require differentiability, a condition typically necessary for classical optimization methods. This makes DE suitable for handling non-differentiable optimization problems, multiple local minima, and complex nonlinearities.
A synthetic sequence of daily precipitation time series is generated as follows. First, models (1) and (2) are calibrated using data from the entire study period. To ensure parameter stability and interpretability, the shape parameter of the gamma distribution is estimated globally and treated as time-invariant. In contrast, the scale parameter is allowed to vary with time to account for seasonal variations in precipitation intensity. Specifically, the conditional mean precipitation intensity at time t, , is reparameterized based on the conditional mean of non-zero precipitation intensities for each calendar day t. Conditional on , precipitation during the non-wet season is generated from the fitted gamma distribution, whereas precipitation during the wet season is simulated using the modified hybrid gamma–generalized Pareto distribution. This hybrid framework effectively overcomes the limitations of single-distribution approaches by simultaneously capturing both the bulk and tail behavior of daily precipitation. The threshold , which defines the point of transition from the gamma to the generalized Pareto (GP) distribution, is specified as . Observations exceeding this threshold () are classified as exceedances, and only these data are used to estimate the GP parameters , ensuring an accurate representation of tail behavior.
3. Real Data Analysis
Our statistical approach is based on linking long-term (interannual) predictor variables with short-term (daily) predictands. However, the use of observed seasonal climate statistics derived from the gamma distribution may introduce substantial noise into the daily weather data, potentially leading to underdispersion in the aggregated climate statistics. This study analyzes daily precipitation data for Seoul obtained from the Korea Meteorological Administration (KMA). The dataset covers a 51-year period, from 1961 to 2011, with data from February 29 of leap years excluded to maintain consistency. Analysis reveals a pronounced annual cycle in Seoul’s precipitation, with a notable peak occurring during late spring and summer, and a clear minimum observed throughout the winter months. Descriptive statistics summarizing Seoul’s rainfall data from 1961 to 2011 are presented in
Table 1.
The 51-year rainfall dataset is positively skewed, indicating significant differences between the maximum value and both the first and third quartiles (see
Figure 2). The data exhibit numerous instances of low rainfall, with relatively few occurrences of large rainfall events grouped together. In this study, we estimate the parameters of the modified hybrid gamma–generalized Pareto (GP) distribution using maximum likelihood estimation (MLE).
Table 2 summarizes the approximated MLEs for the modified hybrid gamma–generalized Pareto distribution based on Seoul’s rainfall data from 1961 to 2011. In particular, the modified hybrid gamma–generalized Pareto distribution provides a better fit to the observed rainfall data in Seoul during the wet season.
It should be noted that the shape parameter of the gamma distribution, , is estimated using the entire 51-year dataset and is subsequently applied uniformly across the entire study period. In contrast, the scale parameter, , of the gamma distribution is derived from the conditional mean precipitation intensity, allowing it to vary temporally. Since the threshold depends solely on the gamma distribution parameters, the parameters of the generalized Pareto distribution are estimated exclusively from observations exceeding this threshold .
Table 3 provides the estimated coefficients and their corresponding standard errors for each component of the proposed stochastic precipitation generator for Seoul. It is noteworthy that the precipitation models employ the daily mean precipitation rate as a covariate rather than total precipitation. Results indicated that all covariate categories are statistically significant except for the interaction terms in the precipitation occurrence model. The synthetic sequence of daily precipitation time series is generated by the estimated models over the same 51-year period for which observational data are available, and aggregated statistics are computed. This simulation process is repeated 500 times using the proposed model to evaluate its performance. From these 500 realizations, key statistical features are derived, and the ability of the precipitation generator to reproduce selected daily statistics is assessed.
Figure 3 demonstrates the effectiveness of the proposed model in reproducing the variance of annual and summer total precipitation, depicted through boxplots of standard deviations (SD) computed from the aggregated statistics of the 500 simulation runs. Specifically,
Figure 3 presents the minimum, lower quartile, median, upper quartile, and maximum SD of these aggregated statistics alongside their observed counterparts derived from the historical 51-year dataset. Boxplots effectively illustrate the variability range of simulated statistics and allow for direct comparison with the observed historical data. The proposed model significantly reduces overdispersion, particularly for annual and summer total precipitation. In winter, precipitation variability is naturally lower, allowing even the original model to reproduce winter variability effectively.
To further evaluate the performance of the proposed GLM weather generator, various useful daily statistics are considered.
Figure 4 compares the observed and simulated distributions of dry spells during summer in Seoul across the 51-year study period. Additionally,
Figure 5,
Figure 6 and
Figure 7 illustrate temporal variations in transition probabilities (
and
), unconditional probability of rain, first-order autocorrelation coefficients, and the mean and SD of daily precipitation intensity, respectively. The mean curves generated by the proposed stochastic precipitation model closely match the observed daily statistics. Results reveal prominent mid-summer maxima in transition probability (
), unconditional rain probability, and the mean and SD of precipitation intensity, which the proposed model effectively captures. Importantly, the proposed method adequately generates precipitation intensity during wet seasons, thereby addressing the tendency of previous models to underestimate the observed interannual variance of seasonally aggregated variables during summer.
4. Conclusions
This study proposed and evaluated a GLM-based stochastic weather generator enhanced with a modified hybrid gamma–generalized Pareto (MHGGP) distribution to overcome limitations in simulating precipitation intensity especially during wet season. The results demonstrated that the conventional gamma-only framework systematically underestimates precipitation variability, particularly during wet seasons, leading to pronounced overdispersion. By introducing the MHGGP distribution, the generator effectively reproduced both the bulk distribution of light-to-moderate rainfall and the heavy-tailed nature of extreme precipitation events, thereby providing a more robust probabilistic representation of daily rainfall sequences. The ability of the enhanced generator to reproduce realistic seasonal statistics—such as variance in annual and summer rainfall, distribution of dry spells, and intensity of extreme wet events—positions it as a valuable tool for diverse applications in climate-sensitive sectors. In particular, the model’s demonstrated capacity to mitigate overdispersion makes it suitable for integration into broader modeling frameworks, including crop yield forecasting, flood risk assessment, and infrastructure design under climate uncertainty.
Nonetheless, some challenges remain. The current study relied on long-term precipitation data from a single urban site (Seoul), which, while illustrative, limits immediate generalization to other climatic regimes. Further work is warranted to test the robustness of the approach across different geographic contexts and to explore its adaptability to regions with distinct precipitation dynamics. In addition, incorporating larger-scale climatic drivers (e.g., ENSO or monsoon indices) or coupling with regional climate model outputs could enhance predictive power, particularly for seasonal forecasting applications. Future research will also involve comparative analyses of the proposed MHGGP–GLM framework against other statistical distributions that have been widely applied in hydrological problems and extend the model to multisite or spatially explicit settings. In particular, distributions such as the TCEV distribution (designed for hydrological extremes), the five-parameter Lambda distribution, and the Wakeby distribution (constructed from two Pareto components) have demonstrated strong performance in modeling precipitation extremes and flood frequency analysis [
32,
33]. Conducting systematic comparisons with these established models will provide a more objective evaluation of the strengths and limitations of the MHGGP–GLM framework and further enhance its applicability to hydrological risk assessment.
In conclusion, the modified hybrid gamma–generalized Pareto distribution offers a flexible and powerful extension to GLM-based weather generators. By better capturing the dual behavior of precipitation intensities, this approach reduces long-standing issues of underdispersion while ensuring realistic simulation of extremes. As climate variability and extremes become increasingly central to planning and risk management, the proposed framework represents a meaningful step toward more reliable stochastic weather generation. Continued development and rigorous validation of this framework will further solidify its practical utility as a dependable tool for generating realistic daily precipitation scenarios aligned with evolving climatic conditions.