3.1. Discrete Approximation of the Environmental Transmission Rate
In this section, we focus on stable estimation of the human-to-human transmission rate, , the environmental transmission rate, , and the incidence reporting rate, . The goal of our study is to investigate the main drivers behind aggressive cholera spread during the 1991–1997 Cholera Epidemic in Peru, and to assess the efficiency of government response and the behavioral changes in the population. To that end, we investigated and compared three discretization strategies for the environmental transmission rate, , which is responsible for the direct transmission of Vibrio cholerae: (1) pre-parameterized periodic transmission, (2) temperature-dependent transmission, and (3) data-driven time-dependent transmission. We concurrently estimated two other important disease parameters, and .
Peru is known for its diverse climate due to a combination of the cold current in the Pacific Ocean, the mountain climate of the Andean highlands, and the tropical temperatures in the Amazon jungle; the difference between summer (December to April) and winter (June to October) temperatures in each region may be significant. It has been observed that the emergence of cholera is strongly correlated with temperature fluctuations, owing to warmer temperatures leading to a higher number of cholera cases [
27,
28,
31]. Hence, modeling
as a function of temperature can potentially help to reconstruct a more nuanced environmental transmission rate, which is crucial to a better understanding of cholera dynamics:
where
represents the standardized temperature, and
is the original temperature measurement. The terms
and
refer to the mean and standard deviation of all temperature values in the data set, respectively, with any missing values omitted from these calculations. This transformation centers the temperature data around zero and scales it according to its variability.
Note that seasonality in cholera transmission is also driven by other factors, rather than temperature. For example, cholera outbreaks often peak during (or after) heavy rainfalls due to increased contamination of water sources with cholera bacteria and possible damage to sanitation systems. Therefore, in some sense, using a general pre-parameterized periodic transmission,
, accounts for seasonality patterns more broadly:
where
determines the amplitude of oscillations, and
and
control horizontal and vertical shifts, respectively. The addition of 1 to the sine function ensures that
remains positive within the study window. The factor 52 represents the number of weeks in a year.
Finally, approximating
as a linear or nonlinear combination of base functions without inducing any pre-set behavior helps to account for triggers beyond climate, such as sanitation infrastructure, travel patterns, natural disasters, food preparations, and others. In particular, writing
as a linear combination of Fourier or Legendre polynomials,
,
, gives rise to
In 1989, shortly before the start of the 1991–1997 cholera epidemic, in collaboration with the US Centers for Disease Control (CDC), the Peruvian Field Epidemiology Training Program (FETP) was established. Over the course of the 1991–1997 outbreak, the system covered nearly 6000 health centers in 25 different departments [
20,
21,
22] (see
Figure 1). Epidemiological surveillance included both laboratory-confirmed and suspected cases (that is, cases of acute and watery diarrhea in patients older than five). The data are publicly available in the Figshare repository (
https://doi.org/10.6084/m9.figshare.10005170.v1, accessed on 15 April 2025) [
31,
40,
41]. For the study of temperature-dependent transmission,
, weekly temperature time series can be retrieved from the European Centre for Medium-Range Weather Forecasts’ ERA-Interim atmospheric reanalysis archive (
https://www.ecmwf.int/en/forecasts/dataset/ecmwf-reanalysis-interim, accessed on 15 April 2025), covering the period from 1991 to 1997 [
31]. This archive provides daily minimum, mean, and maximum temperatures for all 25 Peruvian departments, which we used to explore the relationship between case incidence and temperature.
3.2. Parameter Estimation for the Ayacucho Region
To examine the advantages and limitations of our proposed parameter estimation models for
, i.e., pre-parameterized periodic transmission (
8), temperature-dependent transmission (
7), and data-driven time-dependent transmission (
9), we focused on the cholera epidemic in an inland region of Ayacucho during two distinct periods: from 26 February 1991 to 17 December 1991, and from 4 June 1991 to 19 May 1992. The outbreak in Ayacucho was influenced by a combination of environmental and socioeconomic factors, with increased rainfall, flooding, and the lack of basic services all contributing to the severity of the cholera spread [
42,
43,
44].
To solve the nonlinear least squares problem (
8), we employed the built-in function ‘lsqcurvefit’ from the Matlab optimization toolbox, which executes the Trust-Region-Reflective algorithm. At every step of the iterative process, we used `ode23s’ for the numerical approximation of state variables in the ODE system (
1 since, for some epidemic scenarios, in the presence of two different transmission pathways, environmental and human-to-human, model (
1) may easily be stiff. To quantify the uncertainty in our estimated disease parameters,
,
, and
, we refit the model to
additional data sets for incidence and cumulative cases, assuming a Poisson error structure. The resulting
M best-fit parameter sets were used to estimate the mean values and the 95% confidence intervals for each of the three parameters,
,
, and
, and for the effective reproduction number,
(
6).
Figure 2 compares the fit to incidence and cumulative cases for three transmission rate modeling approaches, (
7)–(
9), with parameters reconstructed from epidemic data for the Ayacucho region, February-December 1991. The incidence and cumulative curves generated from pre-parameterized periodic (
8) and temperature-dependent (
7) transmission rates (left and middle columns, respectively) follow the reported data quite well, considering the limitations of the a priori assumptions enforced by these models. Both approximations, (
7) and (
8), merge the first two peaks into one peak in the middle of them. However, overall, they represent the general data trend correctly. As expected, the pre-parameterized periodic model (
8) over-smoothed both the incidence and cumulative curves, yet the confidence intervals cover most of the data points in the reported sets. In contrast to the periodic transmission rate (
8), the temperature-dependent transmission model (
7) (middle column) shows greater sensitivity to data variations. This increased sensitivity to environmental factors produces trajectories that mimic data fluctuations more closely, but it struggles with overall accuracy.
As
Figure 2 illustrates, among all three discretization methods, the reconstructed data-driven time-dependent (
9) transmission rate (right column) achieves exceptional accuracy for both incidence and cumulative cases. By projecting the environmental transmission rate onto a finite-dimensional space with a sufficiently large number of basis functions (we used 30 basis functions in our simulations), one obtains a near-perfect fit, with very high confidence. This proves the effectiveness of the proposed methodology in reconstructing the complex dynamics of cholera transmission.
The 95% confidence intervals (CI) provide insights into the estimation uncertainty, with the time-dependent
model (
9) generating the narrowest intervals with the best data coverage, reinforcing its superior performance. For (
9), the median curves follow the incidence and cumulative data closely while showing visible misfits when models (
7) and (
8) are utilized to approximate
.
Figure 3 represents a comprehensive comparison of parameter estimation results across the three approximation models of environmental transmission in the Ayacucho region, February–December 1991. Each column corresponds to a different discretization method, and the rows represent different epidemiological parameters.
In the top row, the environmental transmission rate,
, for models (
7) and (
8) exhibit similar behavioral patterns, with the temperature-dependent (
7) transmission rate (middle column) showing low confidence and greater sensitivity to the observed data, for which temperature distribution was expected to serve as a proxy. The periodic (
8) discretization (left column) gives rise to a smooth, low-amplitude curve with minimal fluctuations (ranging from approximately 0 to 0.04), demonstrating a simplified version of the environmental transmission rate. The temperature-dependent (
7) approximation of
(middle column) assumes a higher average value (approximately 0.05) with a wide CI (ranging from approximately 0 to 0.14), but it still changes rather slowly. In contrast, the data-driven time-dependent (
9) transmission rate (right column) displays complex, oscillating behavior, with a significantly higher amplitude (up to 0.13). This complex pattern closely follows the reported incidence data presented in
Figure 2, displaying this method’s ability to capture intricate transmission dynamics driving the outbreak.
The second row illustrates the effective reproduction number,
, which provides crucial insights into disease transmissibility over time. The periodic method (
8) leads to a relatively stable reproduction number, close to 1, with minor changes (between approximately 0.8 and 1.4) that appear over-smoothed (similar to its underlying transmission rate). The temperature-dependent approach (
7) shows higher initial values of
that rapidly decrease and then stabilize around 1 for most of the study period; the CI becomes narrower towards the right end of the interval. The reproduction number,
, for the reconstructed time-dependent method (
9) reveals the most complex pattern, with multiple peaks exceeding 1.5 that align with the incidence series in periodicity but not necessarily in height. The reproduction number based on (
9) becomes less than 0.5 in July 1991, only to go back up and form another wave towards the end of the study period.
The third row presents the estimated human-to-human transmission rates,
, with their respective confidence intervals. When the periodic method (
8) is used for
approximation,
(95% CI:
). When using the temperature-dependent model (
7) for
, we obtain
(95% CI:
), and the data-driven time-dependent discretization (
9) for
gives rise to
(95% CI:
). While these estimates differ, all three methods suggest relatively low human-to-human transmission rates of order
, indicating that (indirect) human-to-human transmission plays a minor role compared to the transmission from the aquatic environment. The periodic method shows the narrowest histogram distribution with clear central values, and the temperature-dependent approach exhibits the widest uncertainty range (consistent with the uncertainty in
reconstruction).
The bottom row displays the reporting rate estimates,
, representing the proportion of reported cases. The periodic transmission rate method (
8) estimates
(95% CI:
), suggesting that approximately 2.8 percent of all cases, a relatively low number, have been reported. The temperature-dependent approach (
7) leads to
(95% CI:
), and the data-driven time-dependent discretization (
9) corresponds to
(95% CI:
). These consistently low reporting rates across all methods indicate significant under-reporting, with the data-driven time-dependent method (
9) suggesting the highest reporting rate at approximately 4.2%. This underscores the fact that most infected people have mild or no symptoms, and these cases remain unaccounted.
Figure 3 reveals that while the three methods for the approximation of
differ in how accurately they characterize the transmission dynamics from the environment, they reconstruct similar magnitudes for human-to-human transmission,
, (
) and the reporting rate,
, (
). Model (
9) provides the most reliable methodology for the analysis of cholera transmission and the efficiency of control and prevention put forward by the authorities.
Figure 4 displays the fit to incidence and cumulative cases in Ayacucho, June 1991–May 1992, associated with three discretization strategies for the transmission rate,
, which is responsible for the direct transmission of Vibrio cholerae from the environment: pre-parameterized periodic transmission (
8), temperature-dependent transmission (
7), and data-driven time-dependent transmission (
9). Two other important disease parameters,
and
, have also been estimated along with
.
The periodic transmission rate (
8) method (left column) captures the overall disease dynamics rather accurately while skipping over the details. It averages the peaks in the second half of the window and under-estimates the initial peak. Yet, it “gets” the general trend very well. For the June 1991–May 1992 time frame, as presented in
Figure 4, the temperature-dependent (
7) approximation method for
(middle column) demonstrates a less impressive data fit than the February–December 1991 window shown in
Figure 2. The reconstructed incidence curve struggles to capture any peaks beyond the initial one, predicting a relatively stable epidemic trend instead. This can be explained by the fact that temperature fluctuations were not the dominating factor in environmental transmission during this period of time, with more significant events, such as increased rainfall and flooding, contributing to cholera spread.
The data-driven time-dependent transmission rate (
9) method (right column) continues to demonstrate exceptional performance, achieving a perfect fit to both incidence and cumulative data across all waves. This consistent excellence across different time windows indicates that, considering numerous factors impacting cholera, the best transmission approximation is the one that is learnt from data, and it is superior to any pre-set behavior. The numerical study across the two time periods shows that while the pre-parameterized periodic approach (
8) generally outperforms the temperature-dependent method (
7), neither can match the accuracy of the data-driven time-dependent algorithm (
9).
Figure 5 illustrates the comparison of parameter estimation results among the three modeling strategies for the June 1991–May 1992 period. Each column corresponds to a different method, while the rows represent different epidemiological parameters. In the upper row, the environmental transmission rate,
, displays dissimilar patterns across the three methods. The periodic (
8) transmission rate (left column) is a full sine wave with a narrow confidence interval, ranging from approximately 0.05 to 0.35, and it clearly demonstrates the limitations of its built-in behavior. The mean curve for the temperature-dependent (
7) rate (middle column) is near-horizontal (ranging from approximately 0.02 to 0.04), with slight variations and large uncertainty (similar to the temperature-dependent rate in
Figure 3 for the previous time interval).
The reconstructed data-driven time-dependent (
9) transmission rate (right column) shows complex oscillatory behavior, with multiple peaks (ranging from approximately 0.01 to 0.15). As before, the trajectory for (
9) correlates with the waves observed in the incidence data (see
Figure 4), though the magnitude of the peaks is not consistent. The second row illustrates the time-dependent effective reproduction number
associated with
and
. The pre-parameterized method (
8) produces a smooth trajectory, suggesting two consecutive epidemic waves, with values starting above 2.5, dropping below 1 around August 1991, rising above 1 again around October 1991 (reaching approximately 2), and finally declining toward the end of the period. The temperature-dependent transmission rate approach (
7) gives rise to significantly different dynamics for
, with values fluctuating near 1 throughout most of the study window and showing minimal variation. The data-driven time-dependent method (
9) informs the fast-changing behavior of the corresponding reproduction number, with amplitude bouncing from 2.5 to almost 0.
The third row presents the estimated human-to-human transmission rates,
, with their respective confidence intervals. The periodic method (
8) estimates
(95% CI:
), the temperature-dependent approach (
7) yields
(95% CI:
), and the data-driven time-dependent discretization (
9) reconstructs
(95% CI:
). These estimates show greater variability compared to the first time period. In particular, the confidence interval for the temperature-dependent method (
7) includes a few outliers (due to instability), with negative values that lack physical meaning. All three methods indicate relatively low rates of human-to-human transmission, which is in agreement with the findings from the previous data set.
The bottom row displays the reporting rate approximations,
, representing the proportion of reported cases. The periodic method (
8) estimates
(95% CI:
), that is, approximately 0.8 percent of cases are reported. This is substantially lower than the estimate for the first time period. The temperature-dependent approach (
7) gives
(95% CI:
), and data-driven time-dependent discretization (
9) estimates
(95% CI:
). Evidently, algorithm (
9) yields the most accurate reporting rate histogram, with methods (
7) and (
8) showing the lower and upper bounds, respectively.
Overall, the above experiments illustrate that parameter estimation based on periodic
approximation (
8) is slightly more reliable than the temperature-dependent method (
7) since temperature fluctuations are important but not the only factors that contribute to the seasonality of cholera spread. Yet, time-dependent discretization (
9) offers unparalleled accuracy regarding data fit (see
Table 2 and
Table 3). It leads to the most informed estimates of the effective reproduction number,
, which allows for monitoring (and adjusting) the impact of control and prevention measures. Estimates of the human-to-human transmission rate,
, and case reporting rate,
, are relatively consistent (though, understandably, not identical) across all three discretization methods for
. The experiments convincingly demonstrate that the reporting rate,
, is low (due to the prevalence of asymptomatic and mild cases that go largely under-reported) and (indirect) human-to-human transmission,
, is much less of a factor in cholera spread than (direct) cholera transmission from the environment (since cholera is unlikely to pass from person to person after a casual contact [
14]).