Compound Climate Risk: Diagnosing Clustered Regional Flooding at Inter-Annual and Longer Time Scales

: The potential for extreme climate events to cluster in space and time has driven increased interest in understanding and predicting compound climate risks. Through a case study on ﬂoods in the Ohio River Basin, we demonstrated that low-frequency climate variability could drive spatial and temporal clustering of the risk of regional climate extremes. Long records of annual maximum streamﬂow from 24 USGS gauges were used to explore the regional spatiotemporal patterns of ﬂooding and their associated large-scale climate modes. We found that the dominant time scales of ﬂood risk in this basin were in the interannual (6–7 years), decadal (11–13 years), and secular bands and that different sub-regions within the Ohio River Basin responded differently to large-scale forcing. We showed that the leading modes of streamﬂow variability were associated with ENSO and secular trends. The low-frequency climate modes translated into epochs of increased and decreased ﬂood risk with multiple extreme ﬂoods or the absence of extreme ﬂoods, thus informing the nature of compound climate-induced ﬂood risk. A notable ﬁnding is that the secular trend was associated with an east-to-west shift in the ﬂood incidence and the associated storm track. This is consistent with some expectations of climate change projections.


Introduction
Floods are the leading cause of property damage and lead to billions of annual global losses [1].These losses are expected to worsen through increasing exposure to coastal and river flooding [2,3] and global and local environmental hazards [4].Severe floods tend to cluster in space [5,6] and time [7,8], leading to fat tails in aggregated risks [9,10].
One mechanism for the space-time correlation structure of extreme floods is the interaction between low-frequency hemispheric modes of global climate variability, which influence weather patterns.Paleo records of floods show clustering in time [7,11] and between locations [12].A dominant mode of interannual variability is the El Niño-Southern Oscillation (ENSO), which has been linked to changes in flood risk around the world [7,13,14], particularly for more extreme floods [15].However, many other patterns have also been identified for decadal-scale flood variability around the world [16], including the Pacific Decadal Oscillation (PDO) [17] and North Atlantic Oscillation (NAO).For example, Tommey et al. [18] found that the NAO strongly modulated interannual flood frequency in the Susquehanna River in the Eastern United States.Often, these low-frequency modes of climate variability dominate secular changes in the historical streamflow record of the United States and Europe [19].
Most of the past work on flood risk estimation considered the site-level analysis of extreme events, with limited attention to the spatiotemporal climate risks at a regional level.
The assumption of stationarity across space and time could still be utilized as a default during the design and planning for flood management infrastructure [20], even though its applicability had been questioned, noting secular, as well as epochal inhomogeneities [7,21].The recent literature has shown methods for the incorporation and considerations of nonstationarity in statistical flood risk models using additional covariates [22][23][24][25][26][27][28].
The Ohio River Basin (ORB), located in the eastern United States (Figure 1), covers 204,000 square miles (522,000 sq.km.) and has a population of 25 million.The ORB has a history of notable floods in 1845, 1883, 1884, 1907, 1913, 1933, 1937, 1948, 1964, 1997, and 2018 [29-31].While summer floods are often characterized by locally intensive precipitation leading to pluvial floods [32], major floods tend to occur in early spring or late winter and are caused by persistent anomalies that track moisture from the Gulf of Mexico and the Caribbean Sea to this region [33,34].Past work on floods in the Ohio River Basin has identified common mechanisms associated with the most extreme floods [33][34][35].At the synoptic time scale, each of the floods occurring in different parts of the basin resulted from a sequence of waves of incoming moisture and rainfall from the Gulf of Mexico, from every 4 to 7 days in the January to March period, culminating in an extreme rainfall event that corresponded to the peak flood [33,34].Some relationship between this mechanism to ENSO was identified [34], and the critical conclusion was that changes in winds or atmospheric moisture transport rather than increases in atmospheric moisture were key to flood occurrence.This is an important observation since the dominant concern with future floods has been with the increased moisture-holding capacity of the atmosphere under warming, then with circulation-driven mechanisms.We further highlight the role of the latter.
The Ohio River Basin (ORB), located in the eastern United States (Figure 1), covers 204,000 square miles (522,000 sq.km.) and has a population of 25 million.The ORB has a history of notable floods in 1845, 1883, 1884, 1907, 1913, 1933, 1937, 1948, 1964, 1997, and 2018 [29-31].While summer floods are often characterized by locally intensive precipitation leading to pluvial floods [32], major floods tend to occur in early spring or late winter and are caused by persistent anomalies that track moisture from the Gulf of Mexico and the Caribbean Sea to this region [33,34].Past work on floods in the Ohio River Basin has identified common mechanisms associated with the most extreme floods [33][34][35].At the synoptic time scale, each of the floods occurring in different parts of the basin resulted from a sequence of waves of incoming moisture and rainfall from the Gulf of Mexico, from every 4 to 7 days in the January to March period, culminating in an extreme rainfall event that corresponded to the peak flood [33,34].Some relationship between this mechanism to ENSO was identified [34], and the critical conclusion was that changes in winds or atmospheric moisture transport rather than increases in atmospheric moisture were key to flood occurrence.This is an important observation since the dominant concern with future floods has been with the increased moisture-holding capacity of the atmosphere under warming, then with circulation-driven mechanisms.We further highlight the role of the latter.We built on past research to explore whether the annual floods were equal to or greater than the annual maximum across 24 long record locations that exhibited common spatial patterns and temporal clustering.The annual maximum streamflow at these locations was considered to not be significantly modified by local human activities or regulations by the US Geological Survey during the period of study.This criterion results in locations that have relatively small drainage basins.
These data constraints limit the potential for assessing a major flood event where the entire Ohio River network may be flooded simultaneously.However, a season or year in which a large number of extreme floods occurs in the basin will lead to higher flood damages, even if the floods are well separated in time and space, in that year.This is the target of our study from the perspective of compound flood risk in the Ohio River Basin.The time series of annual maximum flows across the basin that exceed the nominal 10-year event at each site is illustrated in Figure 2. Note that there are five years with ten or more sites where the 10-year return period event was exceeded over an 87-year period of record.We build on past research to explore whether the annual floods equal to or greater than the annual maximum across 24 long record locations exhibit common spatial patterns and temporal clustering.The annual maximum streamflow at these locations was considered to not be significantly modified by local human activities or regulations by the US Geological Survey during the period of study.This criterion results in locations that have relatively small drainage basins.
These data constraints limit the potential for assessing a major flood event where the entire Ohio River network may be flooded simultaneously.However, a season or year in which a large number of extreme floods occurs in the basin will lead to higher flood damages, even if the floods are well separated in time and space, in that year.This is the target of our study from the perspective of compound flood risk in the Ohio River Basin.The time series of annual maximum flows across the basin that exceed the nominal 10-year event at each site is illustrated in Figure 2. Note that there are five years with ten or more sites where the 10-year return period event was exceeded over an 87-year period of record.If the data were independently distributed as random variables in space and time, then the probability of these outcomes would essentially be zero (based on the binomial distribution).There are more than twenty years with zero occurrences of the 10-year event across the sites, and under the independence hypothesis, the probability of this outcome is also essentially zero.These observations are consistent with the idea that one should be concerned with the compound risk of floods at an annual scale in the Ohio River Basin.The question this paper then addresses is whether there are spatial patterns associated with floods in the Ohio River Basin and whether there is a corresponding temporal structure to the occurrence of these spatial patterns that result in years of non-occurrence and with high occurrence of these events.If the answer is yes, then the question is whether the large-scale climate patterns that lead to these emergent flood patterns in time and space in the Ohio River Basin can be identified.If the data were independently distributed as random variables in space and time, then the probability of these outcomes would essentially be zero (based on the binomial distribution).There are more than twenty years with zero occurrences of the 10-year event across the sites, and under the independence hypothesis, the probability of this outcome is also essentially zero.These observations are consistent with the idea that one should be concerned with the compound risk of floods at an annual scale in the Ohio River Basin.The question this paper then addresses is whether there are spatial patterns associated with floods in the Ohio River Basin and whether there is a corresponding temporal structure to the occurrence of these spatial patterns that result in years of non-occurrence and with high occurrence of these events.If the answer is yes, then the question is whether the large-scale climate patterns that lead to these emergent flood patterns in time and space in the Ohio River Basin can be identified.Section 2 provides a description of the streamflow data and the climate indices used in this study.Section 3 describes the methods used to study the space-time signatures of annual maxima streamflow events and their relation to the known modes of atmospheric variability.Results are presented in Section 4, followed by the conclusion in Section 5.

Streamflow Data
The daily streamflow data were downloaded from the United States Geological Survey (USGS) water databases using the 'dataRetrieval' package [36].Stream gauges located in the Ohio river basin with a maximum of 0.1% of missing data and a drainage area greater than 3750 sq.miles were selected.Sites with a significant regulation of flow during the period of study affecting the annual maxima were not considered.Using the above criteria, 24 sites (Figure 1) were included in this study, with each location having 87 years of data from 1935-2021.The daily streamflow time series at each site was used to identify the annual maximum streamflow for each water year, which started in October (1/10) of the previous year and ended in September (30/9) of the current year.

Climate Indices
Climate indices are time series of diagnostic quantities that are used to characterize hydro-climatic systems based on data from climate stations, grid points, regional averaged data, or computed from empirical orthogonal functions and usually involve a single field, most commonly sea surface temperature anomalies [37].The most commonly used climate indices are the El Niño Southern Oscillation index, Pacific Decadal Oscillation, North Atlantic Oscillation, and Atlantic Multi-decadal Oscillation.Section 2 provides a description of the streamflow data and the climate indices used in this study.Section 3 describes the methods used to study the space-time signatures of annual maxima streamflow events and their relation to the known modes of atmospheric variability.Results are presented in Section 4, followed by the conclusion in Section 5.

Streamflow Data
The daily streamflow data were downloaded from the United States Geological Survey (USGS) water databases using the 'dataRetrieval' package [36].Stream gauges located in the Ohio river basin with a maximum of 0.1% of missing data and a drainage area greater than 3750 sq.miles were selected.Sites with a significant regulation of flow during the period of study affecting the annual maxima were not considered.Using the above criteria, 24 sites (Figure 1) were included in this study, with each location having 87 years of data from 1935-2021.The daily streamflow time series at each site was used to identify the annual maximum streamflow for each water year, which started in October (1/10) of the previous year and ended in September (30/9) of the current year.

Climate Indices
Climate indices are time series of diagnostic quantities that are used to characterize hydro-climatic systems based on data from climate stations, grid points, regional averaged data, or computed from empirical orthogonal functions and usually involve a single field, most commonly sea surface temperature anomalies [37].The most commonly used climate indices are the El Niño Southern Oscillation index, Pacific Decadal Oscillation, North Atlantic Oscillation, and Atlantic Multi-decadal Oscillation.
The Niño 3.4 index, used as an indicator of the El Niño Southern Oscillation phenomena as the dominant mode of global variability influencing climate globally, is computed from sea surface temperature anomalies in the equatorial Pacific [38].The Pacific Decadal Oscillation (PDO) index is computed as the first principal component of the Northern Pacific sea surface temperature anomalies [17,39].The Atlantic Multidecadal Oscillation (AMO) is computed from the sea surface temperature anomalies in the northern Atlantic basin [40].The North Atlantic Oscillation (NAO) index, unlike the above-mentioned indices, is computed as the surface sea-level pressure difference between the subtropical high (Azores or Gibraltar) and the sub-polar low (Iceland) [41].
The Niño 3.4, NAO, PDO, and AMO indices were converted from monthly resolution to annual time-scale by computing their mean Dec-Jan-Feb values, which corresponded to boreal winter.This seasonal choice was made since most of the extreme floods in the basin occur in the January-April period.They were standardized before use in this study.Interactions between the indices were computed as the product of departures from their respective means.
where x t , y t , and xy t are the values of the climate indices x, y, and their interaction at time t, respectively.x and y are the mean values of climate index x and y, respectively.The KNMI Climate Explorer (https://climexp.knmi.nl/start.cgi(accessed on 10 October 2022)) was used as the primary data extraction source for these indices.Overall, ENSO, NAO, PDO, and AMO, along with their interactions, were used in this study, where their connections to the leading modes of streamflow variability within the Ohio River Basin that are identified were explored.Double labels, for example, ENSO-NAO, denote interactions between ENSO and NAO.

Methods
We considered two complementary strategies for the diagnosis of multi-site lowfrequency variations and their associated space-time signatures in basin-wide annual maximum streamflow.From a regional perspective, we hypothesized that the time series at all sites could be modulated at the same frequencies if they were influenced by larger-scale climate oscillations that have marked quasi-periodic variability.Such common large-scale drivers can also induce a spatial correlation structure in the annual maxima flow series across the sites, even if the flows do not occur simultaneously across all the sites in each year.All the larger extreme floods occur in the same season and often correspond to recurrent synoptic waves of incoming atmospheric moisture every 4-7 days [33,34].Thus, some sites may experience the annual maximum event earlier than others, while all sites have elevated flows in such a season.Two complementary approaches were explored to diagnose the multi-site low-frequency variations and spatiotemporal structure in annual maxima streamflow: (1) PC-Wavelet and (2) Wave-Clust (Figure 3).
The Niño 3.4 index, used as an indicator of the El Niño Southern Oscillation phenomena as the dominant mode of global variability influencing climate globally, is computed from sea surface temperature anomalies in the equatorial Pacific [38].The Pacific Decadal Oscillation (PDO) index is computed as the first principal component of the Northern Pacific sea surface temperature anomalies [17,39].The Atlantic Multidecadal Oscillation (AMO) is computed from the sea surface temperature anomalies in the northern Atlantic basin [40].The North Atlantic Oscillation (NAO) index, unlike the above-mentioned indices, is computed as the surface sea-level pressure difference between the subtropical high (Azores or Gibraltar) and the sub-polar low (Iceland) [41].
The Niño 3.4, NAO, PDO, and AMO indices were converted from monthly resolution to annual time-scale by computing their mean Dec-Jan-Feb values, which corresponded to boreal winter.This seasonal choice was made since most of the extreme floods in the basin occur in the January-April period.They were standardized before use in this study.Interactions between the indices were computed as the product of departures from their respective means.
where   ,   , and   are the values of the climate indices x, y, and their interaction at time t, respectively.̅ and  ̅ are the mean values of climate index x and y, respectively.The KNMI Climate Explorer (https://climexp.knmi.nl/start.cgi(accessed on 10 October 2022)) was used as the primary data extraction source for these indices.Overall, ENSO, NAO, PDO, and AMO, along with their interactions, were used in this study, where their connections to the leading modes of streamflow variability within the Ohio River Basin that are identified were explored.Double labels, for example, ENSO-NAO, denote interactions between ENSO and NAO.

Methods
We considered two complementary strategies for the diagnosis of multi-site low-frequency variations and their associated space-time signatures in basin-wide annual maximum streamflow.From a regional perspective, we hypothesized that the time series at all sites could be modulated at the same frequencies if they were influenced by larger-scale climate oscillations that have marked quasi-periodic variability.Such common large-scale drivers can also induce a spatial correlation structure in the annual maxima flow series across the sites, even if the flows do not occur simultaneously across all the sites in each year.All the larger extreme floods occur in the same season and often correspond to recurrent synoptic waves of incoming atmospheric moisture every 4-7 days [33,34].Thus, some sites may experience the annual maximum event earlier than others, while all sites have elevated flows in such a season.Two complementary approaches were explored to diagnose the multi-site low-frequency variations and spatiotemporal structure in annual maxima streamflow: (1) PC-Wavelet and (2) Wave-Clust (Figure 3).

PC-Wavelet
The PC-Wavelet method (Figure 3 (left)) corresponds to a method that analyzes the spatial structure of the multi-site data followed by a time-frequency analysis of the resulting spatial patterns.A principal component analysis (PCA) [42] was performed on a correlation matrix of the annual maxima streamflow data across the 24 sites to achieve a reduction in the spatial dimension.The eigenvectors of the leading principal components identified the patterns of spatial variability.The number of leading principal components to be analyzed was decided by the variance (eigenvalues), as explained by the PCs.Each leading PC was then subjected to wavelet analysis to identify the common low-frequency variations across time if any.The following sub-section can be referred to for further details on wavelet analysis.

Wave-Clust
The first step of the Wave-Clust method (Figure 3 (right)) entails wavelet analysis on the annual maxima streamflow time series of the 24 sites, followed by the hierarchical clustering of the resulting wavelet transforms.Hierarchical clustering, a form of agglomerative clustering, was used to partition objects in a set based on measures of similarity, with the most common being the distance between the objects [43].In this study, 'Ward Distance', which minimizes the within-cluster variance, was used as the distance measure for cluster separation [44].The hierarchical clustering was applied to the time-varying wavelet power across each of the frequencies/scales at each site.The hierarchical cluster analysis on the time-frequency structure helped identify clusters that participated in similar climate patterns but were not necessarily orthogonal or statistically independent.The selection of the number of clusters to use was performed through a visual inspection of the dendrogram and the dissimilarity measure.Once the spatial clusters were identified, we performed a PCA on only the time series in the cluster and examined the time-frequency structure of the leading PC using wavelet analysis to identify the dominant time-frequency mode for that cluster.The eigenvectors and principal component scores for the first principal component for each cluster gave the spatial dependence structure and temporal variation within that cluster, respectively.
Figure 3 shows a schematic of this method, with an explicit east-west clustering divide and the identification of low-frequency signals from the leading principal component of a cluster by means of wavelet analysis.The space-time patterns identified from PC-Wavelet may or may not be similar to those identified from the Wave-Clust, and consequently, they may exhibit different relations to the larger-scale climate indices.The wavelet analysis, which is a building block of both the PC-Wavelet and Wave-Clust methods, is summarized below.

Wavelet Analysis
Wavelet analysis was used as a tool to analyze the localized power variations in a time series.We applied it here to analyze the time-frequency structure in the annual maximum streamflow series at each site and also for each PC.The presentation below follows [45], where the continuous wavelet transform (CWT) of the discrete-time series x t of length N (t = 0 to N − 1), with discrete time spacing, δt is defined by: where ψ * is the complex conjugate of the wavelet function ψ, and s is the wavelet scale.The Morlet wavelet function was utilized in this study.The wavelet function ψ(t) is complex, i.e., it has a real and an imaginary part.The variations in the wavelet scale s and translations along the localized time index allow for the analysis of both the change in amplitude versus scale and the change in amplitude versus time.A faster method to compute the wavelet transform is via calculations in Fourier space, and to approximate the continuous wavelet transform, a convolution at each scale s was conducted N times.By using the convolution theorem, the continuous wavelet transform is the inverse Fourier transform of the product [45] given by where xk is the discrete Fourier transform of x it , and ω k is the angular frequency.The wavelet transform W t (s) is complex and can be divided into real and imaginary parts.The wavelet power spectrum is defined as the square of its amplitude and is given by The significance level associated with the wavelet spectrum is tied explicitly to the choice of background spectrum, which for hydro-climatic data, is often taken to be a red noise or white noise spectrum.The red noise significance test developed by [45] involves fitting an AR(1) model to the time series and computing its Fourier spectrum where the associated one-sided (1 − α)% confidence limits give the α% significance levels over red noise at the scale.We guide the reader to [45] for details.
For the Wave-Clust, at each site i, the annual maximum flood series x it is transformed into W ti (s) as above, and the hierarchical clustering method is used to divide the sites i = 1 . . .ns into sub-groups based on the similarity of the W ti (s).For the PC-Wavelet analysis, the time series of a principal component is used as x it , and the resulting W t (s) is examined for significant variance at each scale s.

Diagnostic Analysis of the Role of Low Frequency Climate Variation
The climate indices ENSO, NAO, PDO, AMO, and their interactions with the global annual temperatures, are considered potential candidates which influence streamflow variability within the Ohio River Basin.We analyzed the relationship of these climate indices with the leading modes identified from the PC-Wavelet or Wave-Clust.These relationships were analyzed first through correlation analysis, followed by linear and non-linear regression methods.A further description of these methods is provided below.

Correlation Analysis
Wavelet coherence is used to assess the relationship between the dominant modes of variability in the annual maximum streamflow data and the known global or hemispheric modes of atmospheric variability [45,46].The wavelet coherence R 2 t is defined as: where W x t is the continuous wavelet transform of x t and '*' denotes the complex conjugate of W y t , which is the continuous wavelet transform of y t ; S is a smoothing operator, and s is the scale.The wavelet coherence can be thought of as a localized correlation between twotime series in the time-frequency domain, with the above equation resembling a correlation coefficient equation.We refer the reader to [45,46] for further details on this method.The wavelet coherences were computed and generated using the 'biwavelet' package in R [47].Furthermore, we also looked at the direct correlation between the PCs and the climate indices using the Pearson correlation coefficient [48].

Linear Regression with Regularization and Variable Selection
The least absolute shrinkage and selection operator (LASSO) is a linear regression method that performs both regularization and variable selection [49].Lasso is a penalized selection method that minimizes the residual sum of squares, which is subject to the sum of the absolute value of the coefficients and is less than a constant [50].The coefficients β j of the linear relationship between a response variable Y i and p predictors X ij were solved by minimizing the cost function: where, λ ≥ 0 is the shrinkage parameter.This is subject to the constraint: The parameter λ controls the magnitude of shrinkage that is applied to the model and is selected using leave-one-out cross-validation.This formulation leads to a sparse solution where several of the coefficients β j are set to zero if the corresponding predictor does not contribute significantly to the dependent variable.This translates into an automatic variable selection.This analysis was carried out using the glmnet package [51] in R.

Non-Linear Regression
Random forests are a class of ensemble learning methods that use decision trees as the building blocks [52].They can be used for either classification or in a regression setup.In a regression setup, the mean of the predictions across the ensemble of decision trees is used as the estimate.
A random forest consists of an ensemble of decision trees, where each tree is constructed using a bootstrapped (sampling with replacement) sample of the data.The CART algorithm is then used to train the decision trees.In addition to the bootstrap, a subset of the variables is selected at random, and the best split is based on that limited subset only (52).Since the input training data for each tree are drawn at random, and a subset of the variables are used for node splitting, random forests overcome the issue of overfitting and the high variance faced by modeling individual decision trees.The bootstrapping of the training data, leaving out a fraction of the data that is not used for training that decision tree, is then used to compute the out-of-bag estimate, thereby providing validation in spite of using the entire data to train the algorithm.
Random forest models allow for the computation of a variable importance measure, which can be used to perform variable selection or identify relevant variables/features.The variable importance is decided using the Gini impurity criterion [53].The 'Random-Forest' [54] package in R is used to fit these models.

Diagnosis of Low Frequency Variations and Space-Time Signatures for the ORB
The PC-Wavelet and Wave-Clust methods were applied to the annual maximum streamflow data across the Ohio river basin to identify the low-frequency variations and space-time signatures of the data (Figures 4 and 5).The first and second principal components of annual maximum streamflow data explain ~39% and 16% of the total variance, respectively, adding up to 56% of the total variance (Figure A1).Hierarchical clustering, when applied to the wavelet-transformed streamflow data, led to the identification of two primary clusters among the stream gauges (Figure A2).The leading two principal components and the two clusters from the Wave-Clust were subsequently analyzed further.The two methods led to consistent results in the identification of the dominant space-time frequency structure of floods for the leading PC (Figure 4) but diverged for the second dominant mode of variability (Figure 5).frequency structure of floods for the leading PC (Figure 4) but diverged for the second dominant mode of variability (Figure 5).Figures 4 and 5 show the application of the PC-Wavelet and Wave-Clust methods to the leading and second dominant mode of variability affecting the annual maxima streamflow within the Ohio River Basin.For PC1, the associated spatial pattern reflected a positive correlation across all but two of the southeastern gauges.Focusing on the temporal domain, the wavelet spectrum for PC-1 of the entire field (Figure 4 (a-C)) and for the first wavelet cluster (Figure 4 (b-C)) showed a sharp peak around 6-7 years, which was statistically significant at the 90% level relative to a red noise null hypothesis.This finding was particularly evident pre-1960 (sub-panel D).Visually, the time series (sub-panel A) exhibited a much higher variance pre-1970 than in the 1970-1990 period.Episodic variability organized over an approximately three-year period was also seen to be statistically significant at the 90% level relative to the red noise null hypothesis.The second principal component from PC-Wavelet (Figure 5a) and the leading PC of the second largest cluster (Figure 5b) both contained large secular trends.The 1930s-1960s are periods characterized by a large number of exceedances (Figure 2) and where the wavelet power spectrum is significant (Figure 4(a-D,b-D)).It also shows that large regions have similar spectral signatures, possibly pointing to regions that have similar dynamics or are forced by global external climatic variability.
The leading mode captures the dynamics of the aggregate (lower Ohio) basin with a key 6-7-year periodicity and a weaker 12-year cycle (Figure 4).The second mode highlights the eastern-western divide within the basin with few characteristic low-frequency variations but also indicates the presence of a strong trend.This secular trend could be modulated by an extremely low-frequency variation or anthropogenic climate change.It is likely to be related to a systematic displacement west of the meridional moisture flow from the Gulf of Mexico coming into the Ohio River Basin [55,56].Overall the order of processing space-time (PC-Wavelet) or time-space (Wave-Clust) can lead to different insights.For PC1 and cluster 1, the difference is small because the same dominant pattern was identified, and the dominant cluster had 18 (out of 24) stations representing the group's frequency behavior.The PC1 eigenvectors suggest that this is more or less the average spatial behavior across the sites since they are all of the same Figures 4 and 5 show the application of the PC-Wavelet and Wave-Clust methods to the leading and second dominant mode of variability affecting the annual maxima streamflow within the Ohio River Basin.For PC1, the associated spatial pattern reflected a positive correlation across all but two of the southeastern gauges.Focusing on the temporal domain, the wavelet spectrum for PC-1 of the entire field (Figure 4(a-C)) and for the first wavelet cluster (Figure 4(b-C)) showed a sharp peak around 6-7 years, which was statistically significant at the 90% level relative to a red noise null hypothesis.This finding was particularly evident pre-1960 (sub-panel D).Visually, the time series (sub-panel A) exhibited a much higher variance pre-1970 than in the 1970-1990 period.Episodic variability organized over an approximately three-year period was also seen to be statistically significant at the 90% level relative to the red noise null hypothesis.The second principal component from PC-Wavelet (Figure 5a) and the leading PC of the second largest cluster (Figure 5b) both contained large secular trends.The 1930s-1960s are periods characterized by a large number of exceedances (Figure 2) and where the wavelet power spectrum is significant (Figure 4(a-D,b-D)).It also shows that large regions have similar spectral signatures, possibly pointing to regions that have similar dynamics or are forced by global external climatic variability.
The leading mode captures the dynamics of the aggregate (lower Ohio) basin with a key 6-7-year periodicity and a weaker 12-year cycle (Figure 4).The second mode highlights the eastern-western divide within the basin with few characteristic low-frequency variations but also indicates the presence of a strong trend.This secular trend could be modulated by an extremely low-frequency variation or anthropogenic climate change.It is likely to be related to a systematic displacement west of the meridional moisture flow from the Gulf of Mexico coming into the Ohio River Basin [55,56].
Overall the order of processing space-time (PC-Wavelet) or time-space (Wave-Clust) can lead to different insights.For PC1 and cluster 1, the difference is small because the same dominant pattern was identified, and the dominant cluster had 18 (out of 24) stations representing the group's frequency behavior.The PC1 eigenvectors suggest that this is more or less the average spatial behavior across the sites since they are all of the same sign.The differences across sites are in the eigenvector coefficients, representing the varying degree of participation in the regional pattern.In this case, the inter-annual modes are identified as the regional feature by the space-time (PC-Wavelet) analysis.Correspondingly, the stations that participated the most in the same frequency structure are identified as the dominant cluster by the time-space (Wave-Clust) analysis.
However, since the clustering process is disjunctive, in that stations assigned to cluster 1 cannot show up in cluster 2, the second cluster now represents a much smaller set of stations, and the dominant feature in these stations emerges as a secular trend or shift.By this time-space (Wave-Clust) method, this shift is associated with just these four stations.
However, by looking at the space-time (PC-Wavelet) analysis, we can see that the corresponding PC-2 actually exhibits a spatial pattern that has a dipole structure, and in conjunction with the trend in the time series, we understand that this is also a regional pattern with a secular trend that indicates a post-1960 decline in the flood magnitudes in the Eastern part of the basin relative to the Western part of the basin.The time series of the PC-2 and the average of Cluster 2 are similar, with the only difference being that cluster 2 has a minimal inter-annual variability, and the long-term trend dominates.

Diagnosis of Relations with Climate Indices
As hypothesized earlier, large-scale climate drivers may induce a spatiotemporal structure in the annual maxima streamflow series across sites, even if the flows do not occur simultaneously across all the sites in each year.The large-scale climate indices considered were ENSO, PDO, NAO, AMO, and their interactions with each other.We also included the lagged annual global temperature as a proxy of the long-term climate trend due to anthropogenic climate change.Overall, we explored the possibility of the joint influence of global climate variability through these climate indices in addition to the climate change signal, which influenced the climate indices.Since the modes computed from the PC-Wavelet and Wave-Clust are similar and PC-Wavelet represented the entire region, the modes computed from PC-Wavelet were retained for subsequent analysis.

Correlation Analysis
The leading modes of streamflow variability within the Ohio River Basin were first analyzed to explore the influence of hemispheric climate variability through climate indices and the long-term global warming trend.The leading principal component (PC-1), which explains 39% of the total variance, is the dominant mode of the regional flow variability.It has a significant Pearson correlation coefficient with only ENSO (r = −0.27)at the 5% level (Figure 6).PC-1's correlation with other indices, including temperature, which serves as a proxy for the anthropogenic climate trend, is low and non-significant.Figure 6 provides the correlation values between PC-1, PC-2, and other climate indices.
included the lagged annual global temperature as a proxy of the long-term climate trend due to anthropogenic climate change.Overall, we explored the possibility of the joint influence of global climate variability through these climate indices in addition to the climate change signal, which influenced the climate indices.Since the modes computed from the PC-Wavelet and Wave-Clust are similar and PC-Wavelet represented the entire region, the modes computed from PC-Wavelet were retained for subsequent analysis.

Correlation Analysis
The leading modes of streamflow variability within the Ohio River Basin were first analyzed to explore the influence of hemispheric climate variability through climate indices and the long-term global warming trend.The leading principal component (PC-1), which explains 39% of the total variance, is the dominant mode of the regional flow variability.It has a significant Pearson correlation coefficient with only ENSO (r = −0.27)at the 5% level (Figure 6).PC-1's correlation with other indices, including temperature, which serves as a proxy for the anthropogenic climate trend, is low and non-significant.Figure 6 provides the correlation values between PC-1, PC-2, and other climate indices.In the time-frequency domain, PC-1 has a strong peak in variance near the 6-7-year band, which is significant at the 10% level relative to a hypothesis of red or white noise, along with a weaker and lower frequency of a 12-year cycle (Figure 4 (a-C)).The global wavelet spectrum of the Niño 3.4 index (ENSO) averaged over Dec-Jan-Feb shows elevated variance associated with the 5-7 year band, which is significant at the 10% level, a key characteristic of the ENSO phenomenon, along with a weaker 12-14 year periodicity (Figure 7A).The power associated with the 5-7 year interannual band was highest in the broad periods of our data records  (Figure 7B).The wavelet coherence between PC-1 and the ENSO index showed high coherency in the 12-year cycle post-1980s and was significant at the 10% level (Figure 7C), which is suggestive of a connection between the two.Further, the phase angle indicates that ENSO leads PC-1 by a period of 1 year at this frequency.
The second principal component PC-2, which explains 16% of the total variance in annual maxima streamflow variability, was characterized by a long secular trend (Figure In the time-frequency domain, PC-1 has a strong peak in variance near the 6-7-year band, which is significant at the 10% level relative to a hypothesis of red or white noise, along with a weaker and lower frequency of a 12-year cycle (Figure 4(a-C)).The global wavelet spectrum of the Niño 3.4 index (ENSO) averaged over Dec-Jan-Feb shows elevated variance associated with the 5-7 year band, which is significant at the 10% level, a key characteristic of the ENSO phenomenon, along with a weaker 12-14 year periodicity (Figure 7A).The power associated with the 5-7 year interannual band was highest in the broad periods of our data records  (Figure 7B).The wavelet coherence between PC-1 and the ENSO index showed high coherency in the 12-year cycle post-1980s and was significant at the 10% level (Figure 7C), which is suggestive of a connection between the two.Further, the phase angle indicates that ENSO leads PC-1 by a period of 1 year at this frequency.
The second principal component PC-2, which explains 16% of the total variance in annual maxima streamflow variability, was characterized by a long secular trend (Figure 5(a-a-C)).PC-2 has high correlations with NAO (r = −0.26),PDO-AMO (r = 0.28), and temperature (r = −0.43),all of which are significant at the 5% level (Figure 6).Overall, the nature of the trend in PC-2, which filters information on the east-west divide, mirrors to a large degree the long-term trend in temperature that is driven by anthropogenic warming.
5 (a-a-C)).PC-2 has high correlations with NAO (r = −0.26),PDO-AMO (r = 0.28), and temperature (r = −0.43),all of which are significant at the 5% level (Figure 6).Overall, the nature of the trend in PC-2, which filters information on the east-west divide, mirrors to a large degree the long-term trend in temperature that is driven by anthropogenic warming.

Linear Regression
Figure 8 (top) shows the cross-validated results for lasso regression with PC-1 as the dependent variable and the climate indices with their interactions and temperature as the independent variables.ENSO is the only variable selected, whereas the coefficients of all other variables are pushed to zero when the cross-validated mean squared error is the lowest.This corresponds to the dashed line in Figure 8 (top).Furthermore, the removal of the other ten variables leads to a decrease in the cross-validated error, indicating that no linear combination of any subset for the non-ENSO variables is useful for predicting PC-1.
Lasso regression with PC-2 as the dependent variable results in the inclusion of multiple variables for the lowest cross-validated error.Temperature, PDO, NAO, AMO, and PDO-AMO are the selected variables when the shrinkage penalty is log(λ) ≈ −2.5.The other dotted line (log(λ) ≈ −1) to the right in Figure 8 (bottom) corresponds to the largest value of the shrinkage parameter (most parsimonious with the least number of predictors) such that the value of the cross-validated error is within one standard deviation of the minimum error.This scenario includes only temperature as the independent variable, with all others, pushed to zero, highlighting the role of temperature as the most important variable in this subset for predicting PC-2.

Linear Regression
Figure 8 (top) shows the cross-validated results for lasso regression with PC-1 as the dependent variable and the climate indices with their interactions and temperature as the independent variables.ENSO is the only variable selected, whereas the coefficients of all other variables are pushed to zero when the cross-validated mean squared error is the lowest.This corresponds to the dashed line in Figure 8 (top).Furthermore, the removal of the other ten variables leads to a decrease in the cross-validated error, indicating that no linear combination of any subset for the non-ENSO variables is useful for predicting PC-1.

Non-Linear Regression
Random forest models were fit separately with PC-1 and PC-2 as dependent variables.Each model was fit with 10,000 individual trees, and √11 ≈ 3 variables were used at each split.The models, as non-linear regression counterparts of the lasso, were used to identify a subset of pertinent variables to the dependent variables.
Figure 9A denotes variable importance when PC-1 is used as a dependent variable in a random forest regression setting.ENSO came up as the most important variable in this Lasso regression with PC-2 as the dependent variable results in the inclusion of multiple variables for the lowest cross-validated error.Temperature, PDO, NAO, AMO, and PDO-AMO are the selected variables when the shrinkage penalty is log(λ) ≈ −2.5.The other dotted line (log(λ) ≈ −1) to the right in Figure 8 (bottom) corresponds to the largest value of the shrinkage parameter (most parsimonious with the least number of predictors) such that the value of the cross-validated error is within one standard deviation of the minimum error.This scenario includes only temperature as the independent variable, with all others, pushed to zero, highlighting the role of temperature as the most important variable in this subset for predicting PC-2.

Non-Linear Regression
Random forest models were fit separately with PC-1 and PC-2 as dependent variables.Each model was fit with 10,000 individual trees, and √ 11 ≈ 3 variables were used at each split.The models, as non-linear regression counterparts of the lasso, were used to identify a subset of pertinent variables to the dependent variables.
Figure 9A denotes variable importance when PC-1 is used as a dependent variable in a random forest regression setting.ENSO came up as the most important variable in this case, followed by ENSO-AMO and ENSO-PDO.The temperature showed up as the most important variable when PC-2 was the dependent variable (Figure 9B).This was followed by PDO-AMO, NAO, PDO-NAO, and AMO-NAO, which had similar levels of importance.

Non-Linear Regression
Random forest models were fit separately with PC-1 and PC-2 as dependent bles.Each model was fit with 10,000 individual trees, and √11 ≈ 3 variables were at each split.The models, as non-linear regression counterparts of the lasso, were u identify a subset of pertinent variables to the dependent variables.
Figure 9A denotes variable importance when PC-1 is used as a dependent varia a random forest regression setting.ENSO came up as the most important variable case, followed by ENSO-AMO and ENSO-PDO.The temperature showed up as the important variable when PC-2 was the dependent variable (Figure 9B).This was fol by PDO-AMO, NAO, PDO-NAO, and AMO-NAO, which had similar levels o portance.
(A) (B) Overall, given the regression analysis on climate indices in addition to the diagnostic analysis, our interpretation is that the region has two dominant modes of long-term variability.The leading principal component PC-1 is associated with ENSO and marked by inter-annual variability, and it seems to have a basin-wide impact.The second leading mode, PC-2, is characterized by a long secular trend, and its primary climate association appears to be with the global warming trend, with a possible association identified between some of the lower frequency climate indices.PC-2 reflects a west-east shift in the incidence of flooding since the 1960s and correlated best with global annual temperature.If indeed this is due to anthropogenic climate change-induced global warming, then it is an interesting observation indicating a spatial shift in the sub-basins that are likely to be flooded.

Summary
This paper was motivated by the need to better understand compound flood risks at the river basin scale in terms of the potential for spatial and temporal clustering of floods.We noted that the temporal pattern of the number of annual exceedances for the 10-year return period annual maximum flood at the 24 sites was extremely unlikely to have occurred by chance if the data indeed included spatially and temporally independent random variables.This begged the question of whether there were spatial patterns of flooding with distinct quasi-periodic or secular trends that could be related to large-scale climate variability, including anthropogenic global climate change.
Recognizing that the answer to this question may be sensitive to whether we first identified dominant spatial patterns and then looked at the time-frequency structure or if we identified the time-frequency structure at each site and then looked for spatial similarities in those patterns, we used the PC-Wavelet and the Wave-Clust methods to decompose the space-time-frequency structure in the 24-time series.For the leading modes of variability that accounted for about 56% of the spatial correlation in the flood occurrence process, we found that the order of processing space-time or time-space made a difference in the insights that could be drawn.The leading PC represents a common behavior across the Ohio River Basin with quasi-periodic variability at the inter-annual and decadal time scales that appeared to be associated with similar variability in the ENSO index.The second PC represented an east-west dipole or negative correlation that was associated with a pronounced secular trend and appeared to be associated most strongly with the global temperature index reflecting a shift in the likelihood of flooding from the eastern to the western part of the basin.This is an interesting find that reinforces not just increased moisture in the atmosphere due to warming but subtle shifts in circulation modes, which lead to spatial shifts in flood occurrence that may be of considerable interest for researchers to understand as they project future flood risks, especially from a compound risk perspective; this translates into a need to better understand the spatial structure of flood incidence over the entire flood season in a regional context of the larger river basin.The analysis also highlights the need to develop statistical methods for the multi-scale conditioning of compound extremes accounting for the slower climate variations that induce synoptic event structures that favor such clustering.

Figure 1 .
Figure 1.The Ohio River Basin domain (shaded in blue).The red dots denote the stations used in this study.

Figure 1 .
Figure 1.The Ohio River Basin domain (shaded in blue).The red dots denote the stations used in this study.

Figure 2 .
Figure 2. Number of annual maximum flows exceeding the at-site 10-year return period event each year across 24 streamflow gaging locations in the Ohio River Basin from the period 1935 to 2021.

Figure 2 .
Figure 2. Number of annual maximum flows exceeding the at-site 10-year return period event each year across 24 streamflow gaging locations in the Ohio River Basin from the period 1935 to 2021.

Figure 3 .
Figure 3. Schematic of methods for time series analysis of multi-site streamflow data to analyze low frequency variation, non-stationarity, and space-time signatures of streamflow extremes.(left)-PC-Wavelet.(right)-Wave-Clust.

Figure 4 .
Figure 4. PC-Wavelet analysis ((a)) on the first principal component of the annual maximum streamflow data.Wave-Clust analysis ((b)) on the leading principal component of the first largest cluster of the annual maximum streamflow.For each subplot: (A)-The principal component score with the local polynomial regression (7-year span) fit was in red.(B)-The eigenvectors of the principal component for the associated stream gauges.The color grey denotes stream-gauges not in the cluster for Wave-Clust.(C)-Global wavelet spectrum of the principal component score.The red and black lines correspond to the 90% significance level for red and white noise, respectively.(D)-Wavelet power spectrum of the principal component score.The regions bounded in thick contour line are significant.

Figure 4 .Figure 5 .
Figure 4. PC-Wavelet analysis ((a)) on the first principal component of the annual maximum streamflow data.Wave-Clust analysis ((b)) on the leading principal component of the first largest cluster of the annual maximum streamflow.For each subplot: (A)-The principal component score with the local polynomial regression (7-year span) fit was in red.(B)-The eigenvectors of the principal component for the associated stream gauges.The color grey denotes stream-gauges not in the cluster for Wave-Clust.(C)-Global wavelet spectrum of the principal component score.The red and black lines correspond to the 90% significance level for red and white noise, respectively.(D)-Wavelet power spectrum of the principal component score.The regions bounded in thick contour line are significant.Hydrology 2023, 10, x FOR PEER REVIEW 9 of 16

Figure 5 .
Figure 5. PC-Wavelet analysis ((a)) on the second principal component of the annual maximum streamflow data.Wave-Clust analysis ((b)) on the leading principal component of the second largest cluster of the annual maximum streamflow.For each subplot: (A)-The principal component score with the local polynomial regression (7-year span) fit in red.(B)-The eigenvectors of the principal component for the associated stream gauges.The color grey denotes stream-gauges not in the cluster for Wave-Clust.(C)-Global wavelet spectrum of the principal component score.The red and black lines correspond to the 90% significance level for red and white noise, respectively.(D)-Wavelet power spectrum of the principal component score.The regions bounded in thick contour line are significant.

Figure 6 .
Figure 6.Pearson correlation coefficients of the leading (PC1) and second (PC2) principal components of the entire data with climate indices.Values in white are not significant at the 5% level.Double labels, for example, ENSO-NAO, denote interactions between ENSO and NAO.Temp denotes the global annual temperature time series.

Figure 6 .
Figure 6.Pearson correlation coefficients of the leading (PC1) and second (PC2) principal components of the entire data with climate indices.Values in white are not significant at the 5% level.Double labels, for example, ENSO-NAO, denote interactions between ENSO and NAO.Temp denotes the global annual temperature time series.

Figure 7 .
Figure 7. PC-1 and ENSO Index.(A)-Global wavelet spectrum of ENSO index.The red and black line in the global wavelet spectrum denote the 90% level relative to a hypothesis of red or white noise.(B) Wavelet power spectrum of the ENSO index.(C) Wavelet coherence between the 1st principal component of the data and the ENSO index.Warmer colors denote higher power/correlation, while cooler colors denote lower power/correlation.The regions bounded in thick contour lines are significant.

Figure 7 .
Figure 7. PC-1 and ENSO Index.(A)-Global wavelet spectrum of ENSO index.The red and black line in the global wavelet spectrum denote the 90% level relative to a hypothesis of red or white noise.(B) Wavelet power spectrum of the ENSO index.(C) Wavelet coherence between the 1st principal component of the data and the ENSO index.Warmer colors denote higher power/correlation, while cooler colors denote lower power/correlation.The regions bounded in thick contour lines are significant.

Figure 8 .
Figure 8. Cross-validated lasso regression plots for (top) PC-1 and (bottom) PC-2 as the independent variables.The x-axis is the log of the shrinkage penalty, and the y-axis is the cross-validated mean squared error.The numbers at the top of the plot indicate the number of variables with non-zero coefficients at that shrinkage parameter.The dotted line denotes the cross-validated value with the lowest mean-squared error.

Figure 8 .
Figure 8. Cross-validated lasso regression plots for (top) PC-1 and (bottom) PC-2 as the independent variables.The x-axis is the log of the shrinkage penalty, and the y-axis is the cross-validated mean squared error.The numbers at the top of the plot indicate the number of variables with non-zero coefficients at that shrinkage parameter.The dotted line denotes the cross-validated value with the lowest mean-squared error.

Figure 8 .
Figure 8. Cross-validated lasso regression plots for (top) PC-1 and (bottom) PC-2 as the indep variables.The x-axis is the log of the shrinkage penalty, and the y-axis is the cross-validated squared error.The numbers at the top of the plot indicate the number of variables with no coefficients at that shrinkage parameter.The dotted line denotes the cross-validated value w lowest mean-squared error.

Figure 9 .Figure 9 .
Figure 9. Variable importance plots for random forest models with (A) PC-1 and (B) as the dependent variables.The y-axis denotes all the independent variables used i Figure 9.Variable importance plots for random forest models with (A) PC-1 and (B) PC-2 as the dependent variables.The y-axis denotes all the independent variables used in this study, while the x-axis is a measure mean decrease in Gini or how much the model error increases when a particular variable is randomly permuted or shuffled.

Figure A1 .
Figure A1.Cumulative variance explained by the leading two principal components of the annual maximum streamflow data across the Ohio River Basin.The two leading principal components explain ~39% and 16% of the total variance, respectively.