1. Introduction
In recent times, there is an ever-growing need to study processes surrounding our world that are related to the availability of water resources. Many of these resources depend highly on various hydroclimatic conditions and processes that may be cross–correlated (such as temperature and dew point or precipitation and wind speed) [
1,
2]. Several studies have focused on the relationships between these variables and attempt to simulate them under variate conditions [
3,
4]. Furthermore, these processes are dominated by high variability at a vast range of scales [
5]. Thus, it becomes important to examine possible correlations among hydrological processes such as precipitation, temperature, wind speed and dew point, and the simplest way to evaluate them is to employ the Pearson linear cross–correlation coefficient.
However, hydroclimatic processes may not be independent of each other. In addition, they are shown to exhibit long-range dependence [
6,
7], which is indicated by fluctuations on a long-term time scale, enhanced patterns and high unpredictability. This justifies the observed variability of these processes, and thus, the uncertainty in estimations, while questioning the use of classical statistical tests that assume independency and serially independent values [
8,
9].
In this study, the impact of the length of a data series on cross–correlation distributions is first examined using the synthetic series of Gaussian variables, which are then compared to the series with long-term persistence, resembling the variability of the recorded timeseries related to these processes. The reason for this is a need to study the cross–correlations of independently generated numbers to better understand and compare how possibly dependent processes such as hydroclimatic ones perform when under the same statistical analysis. Next, an innovative statistical test is constructed using a stochastic approach, which can determine the upper and lower bounds of statistical significance for cross–correlations of series that exhibit long-range dependence. A key benefit of the proposed method is it allows for generating cross–correlations from inputted timeseries directly, along with an estimate of their statistical significance, without the need of pre-whitening the data series—a process which could disrupt the stochastic properties of hydroclimatic processes.
For illustration, the cross–correlations among key hydrological-cycle processes from numerous global-scale observations are estimated. In turn, an exploratory data analysis including all the examined processes is performed to detect any patterns in their cross–correlations. Finally, using the proposed stochastic test, it is possible to determine which of the calculated cross–correlations can be assumed to be statistically significant.
2. Materials and Methods
We start with an investigation conducted by generating 10,000 series of standardized Gaussian random values (i.e., with a zero mean and a standard deviation of unity), each with a length of 20 values. Then, the zero-lag cross–correlation between each pair of these series is calculated. While the expected value must be zero, since these variables are uncorrelated (and independent), the estimates of cross–correlation values are found to follow a bell-curve distribution [
10]. Furthermore, if the number of series is increased to 100,000, this bell-curved distribution becomes even more evident.
However, upon increasing the data length of every series up to 100, for example, more estimated cross–correlation coefficients are close to zero, resulting in a narrower distribution, thus, lowering the variability of the estimation, as seen in
Figure 1 (see also results in [
11]).
There have been many studies in the literature showing that most hydroclimatic processes are characterized by the so-called Hurst phenomenon, otherwise known as scaling, long-range dependence (LRD) or long-term persistence [
6,
12]. In this work, we focus on the effect of LRD on cross–correlations between hydroclimatic processes, quantified through the Hurst parameter. In simple terms, this parameter indicates the behavior of a process over a long-time scale. As the Hurst parameter increases and approaches its maximum value of 1, a timeseries of a long-range dependent process exhibits enhanced patterns as well as change, which leads to high uncertainty and unpredictability at large scales. Although these processes may deviate from Gaussianity (even at the annual scale), here, we show results based on the hypothesis that this deviation is small or negligible. By performing a Monte-Carlo analysis and by applying the symmetric moving average (SMA) generation algorithm [
13], 1000 timeseries with 20 years of length are generated, and their cross–correlations are estimated for various Hurst parameters (specifically, ranging from 0.5 to 0.95 with a 0.05 step), similarly to the methodology described in [
14]. To calculate the Hurst parameters of the timeseries given, there are multiple methods that can be assessed. In this study, the preferred analysis is the classic rescaled-range analysis introduced in [
12,
15], but there are multiple others that can be selected, such as wavelets [
16], or by choosing a maximum likelihood estimator, as described in [
17].
A common practice advocated in the literature is to pre-whiten them first before estimating their cross–correlations (e.g., [
18]). This process entails a transformation of the two variables using a filter, with the reasoning that it disentangles any autocorrelation between the two variables, while retaining any linear relationships between them. Then, for two mutually independent series, the empirical cross–correlation coefficient follows approximately N(0, 1/sqrt(
n)), if at least one of the series is a white-noise process [
19]. However, the pre-whitening procedure distorts several stochastic properties and there is no apparent reason to apply it. In our case, the SMA algorithm has been used to generate series in such a way that no pre-whitening is required, and thus, avoiding any added artifacts into the simulation. Thus, to determine the empirical distribution of cross–correlation, an alternative method can be proposed.
The distribution of the estimator of the cross–correlation between the uncorrelated series is Gaussian, and may be approximately Gaussian when the Hurst parameter is close to 0.5 (i.e., white-noise). Therefore, we can determine the probability that a high cross–correlation value is estimated between uncorrelated samples. Specifically, for series exhibiting LRD (i.e.,
H > 0.5), the resulting distribution of the estimator of the cross–correlation coefficient is shown to highly deviate from Gaussianity, and becomes flatter than the Gaussian bell, corresponding to a higher kurtosis. Thus, in
Figure 2, a comparison is made between the empirical distributions of the cross–correlations estimated from an ensemble of 100,000 normally distributed variables with
H = 0.5, and the one estimated from 5,000,000 timeseries with
H = 0.9, with each distribution having the same length of 60 years.
Among several candidates, the generalized Gaussian distribution is selected to fit the cross–correlation estimations (
Figure 2), which is a parametric family with the following probability density function [
20]:
where
μ denotes location,
α denotes scale, Γ denotes the gamma function, and
β is a shape parameter. When
β = 2, this distribution corresponds to the normal one.
A limitation of the selected distribution for the estimator of the linear cross–correlation is that it cannot accurately represent heavy power-law tails. A more advanced methodology for the estimation of the cross–correlation is described in [
21], where an estimator is introduced based on the scale domain rather than the lag domain through the correlation function (as adopted in the current analysis), or the frequency domain through the power-spectrum (see discussion and comparisons in [
22]). Nevertheless, it is considered important to perform this analysis using the classical (and most widely applied in the literature) estimator of the linear Pearson cross–correlation to highlight and assess its increased variability in the presence of the long-range dependence behavior.
Besides the Hurst parameter (
H), the length (
n) of the series is expected to also have a great impact on the cross–correlation estimations. In order to determine the influence of both parameters (i.e.,
H and
n) on the statistical significance of the cross–correlation estimations, multiple synthetic series of normally distributed processes are generated using the SMA algorithm, and by varying both
n and
H. Specifically, in this test, the series lengths range from 10 to 100 (with a step of 10), and the Hurst parameter ranges from 0.5 to 0.95 (with a 0.05 step). For each combination, 5,000,000 synthetic timeseries are generated, the cross–correlation between them is calculated, and the resulting distributions are compared (for illustration, see
Figure 3) to the distribution of cross–correlation estimations generated from a white-noise process (i.e.,
H = 0.5) with the same length.
From this comparison, the influence of the Hurst parameter on the distributions becomes even more apparent. Initially, a small Hurst parameter (indicating a weak LRD) corresponds to a cross–correlation distribution that is nearly Gaussian, while as the Hurst parameter increases, the distribution of the cross–correlation estimations becomes flatter (i.e., the kurtosis is increased).
Based on the above analysis, it is possible to determine the upper and lower bounds of the statistical significance for any estimated cross–correlation coefficient, using a method similar to the
t-test [
23], but by also taking into account the LRD. This behavior can be introduced in a statistical significance test through the generalized Gaussian distribution. Specifically, the estimated cross–correlation coefficient between two processes exhibiting long-range dependence is assumed (null hypothesis,
Ho)/not assumed (alternative hypothesis,
H1) statistically significant, based on whether it is estimated within/outside of the confidence limits of the generalized Gaussian distribution.
For the determination of the confidence limits, an expression is constructed among the linear cross–correlation coefficient’s quantile
c, the length
n of the sample, and the Hurst parameter
H of the process (see
Figure 4 and
Figure 5), i.e.,
where
H is the Hurst parameter,
q is the level of confidence,
n is the length of the sample and
p1,
p2,
p3 are coefficients that can be selected from
Table 1. It is noted that
R2 > 0.99 in all expressions. An important remark is that the above expressions correspond to Hurst values between 0.5 and 0.9, while values outside these limits could lead to erroneous extrapolations.
After a thorough analysis of the performance of normal random values and defining the stochastic test, real world timeseries of hydroclimatic processes can be studied. From a global-scale database of the National Oceanic and Atmospheric Association containing more than 15,000 land-based stations [
24], the timeseries of temperature, wind speed and dew point are extracted from approximately 7500 stations that are still operational up to 2018 (i.e., access year). Most of these timeseries have a three-hour resolution, whereas some stations in recent years have included 30 min resolution observations. To select high-quality stations, only the ones with twenty or more years of data are included in the analysis. Subsequently, all the extracted timeseries are transformed to the annual resolution, while a year that contains less than 300 days of values is considered null. This choice is made to allow a more realistic comparison and reduce any uncertainty caused by large gaps in some data timeseries [
5]. Finally, the zero-lag cross–correlations are estimated for stations containing all three annual timeseries (i.e., temperature, wind speed and dew point). After all filters are applied, the final number of analyzed stations is 2090. Of these 2090 stations, 1479 contain thirty or more years of data, which is ideal for longer-term data analysis. That being said, the coverage of these older stations is mainly limited to Europe and North America, leaving out many newer-built stations in regions such as Africa, Australia and the Southern Pacific Ocean. Therefore, we proceed with the original 2090, as they are more widely spread out throughout the globe. Furthermore, specifically for wind speed, the input stations return both speed and direction. The direction of wind speed was omitted from our analysis to maintain simplicity. To estimate cross–correlations between precipitation and the abovementioned processes, we employ the NOAA’s database containing approximately 100,000 operational land-based stations with daily precipitation measurements [
24]. From the latter, we utilize 66,000 daily stations that have more than 20 years of data and aggregate them to an annual resolution timeseries, applying the same quality control as previously described. However, the precipitation timeseries are generally recorded at stations other than the ones at the previous application. Therefore, for the estimation of cross–correlations, we identify pairs of stations in proximity by implementing the following algorithm. A precipitation station and a station measuring temperature, wind speed and dew point are assumed to be within the same region when they are both located within a maximum distance of 0.5 geographical degrees, and the relative elevation difference between them is as small as possible. These two criteria are incorporated as percentages of the maximum possible values, and then combined to be used as an index. Naturally, the lowest possible score of this index indicates the optimum station pairs. For 1032 out of the 2090 stations measuring temperature, wind speed and dew point, there is a corresponding precipitation measurement station at the same location. For the remaining 1058, the above-mentioned algorithm identifies the corresponding precipitation measurement station. For a visualization of the typical distances and elevation differences for these measurement station pairs, see
Figure 6.
For the cross–correlations between precipitation and temperature, each process has a different
H parameter, and thus, it is necessary to adapt the stochastic test for variables with different
H parameters. Using the same model, the simulated series with lengths of 100,000 are generated by selecting the
H parameters as 0.6 for precipitation and 0.8 for temperature. The confidence limits of these simulations for different lengths are compared with the ones obtained for the series with equal
H parameters, 0.6 and 0.8, respectively (see results in
Figure 7). Finally,
Table 2 contains the
a and
b parameters corresponding to Equation (2), as calculated from the analysis, for various confidence intervals.
4. Discussion and Conclusions
From a global-scale analysis of cross–correlations of hydroclimatic processes including precipitation, temperature, wind speed and dew point, the only consistent emerging pattern is the strong positive correlation estimated between temperature and dew point. Generally, a moderately positive is observed around arid areas, and a strong positive cross–correlation near the seafront. However, there are locations where this cross–correlation is zero or even negative, but these occurrences could be related to microclimates or large variability that increase the uncertainty of the estimations. Case-by-case studies are required wherever statistically significant outliers are noted, and then the analysis of similar variables could help explain the causes of a statistically significant cross–correlation occurring. However, when conducting research on these cases, care must be taken to avoid regional or seasonal biases.
Inaccuracies in measurements from the various meteorological stations are expected, but due to the large amount of data and the applied high-quality filters, they are considered to have a small impact. The availability of data has potential for growth, especially as for stations measuring temperature, wind speed and dew point, there are currently only a few records of length greater than 30 years and without significant gaps, which is a relatively lower quality than desired, especially for processes such as these that exhibit long-range dependence.
Comparisons between the other processes show mild cross–correlations, with the average global mean estimated close to zero. This does not mean that the processes are uncorrelated, but only that there is no evidence of a global pattern. On the contrary, significant cross–correlations between any pair of hydroclimatic processes may occur, but the results may be regional, and require further research to assess their statistical significance and spatial dependence. Specifically for wind speed, an important factor that could be expanded upon is the change in direction; this was not included in this study for simplicity but could be important. For example, changes in wind direction could be linked to extreme weather events, indicated by abrupt changes in precipitation or temperature. Another factor that could be expanded upon is the timescale; the available timeseries for temperature, wind speed and dew point have resolutions quite different from precipitation, at approximately 30 min to three hours for some stations. This study focuses on identifying long-term relationships, but there is merit in analyzing hydroclimatic processes on clusters of shorter timescales to study extreme events such as storms. Of course, this would probably require an in-depth study on a location-by-location basis.
Since these processes are not serially independent, standard tests such as the t-test cannot be correctly used. Instead, a stochastic approach using the Monte-Carlo analysis, such as the one introduced in this study, is considered more robust for handling hydroclimatic timeseries. This method can also estimate the statistical significance of a correlation between processes, even if the available timeseries are relatively short in length (however, no less than 20 years). Furthermore, this stochastic test can derive conclusions without requiring the pre-whitening of the timeseries, a procedure which requires careful consideration in order to be applied correctly, and can exacerbate statistical flaws and cause variance inflation. Through the proposed stochastic test, it is evident that a high cross–correlation has a low probability of being outside the confidence limits, especially for large lengths of samples. Thus, any prominent recurrences resulting from a local analysis can be considered statistically significant if the resulting cross–correlations are higher than the values indicated by the stochastic test for a selected confidence interval.
Finally, this study indicates that extreme caution must be exercised when attempting to derive robust conclusions from small samples of processes that are known to exhibit long-range dependence, such as the hydroclimatic ones. Ignoring the enhanced natural variability of hydroclimatic processes and conducting classical statistical tests that disregard long-range dependence may lead to flawed results.