Investigating Water Quality Data Using Principal Component Analysis and Granger Causality

: This work investigates the inter-relationships among stream water quality indicators, hydroclimatic variables (e.g., precipitation, river discharge), and land characteristics (e.g., soil type, land use), which is crucial to developing effective methods for water quality protection. The potential of using statistical tools, such as Principal Component (PC) and Granger causality analyses, for this purpose is assessed across 10 watersheds in the Eastern United States. The PC analysis shows consistency across the ten locations, with most of the variation explained by the first two PCs, except for the least developed watershed that presents three PCs. Results show that stronger Granger causality relationships and correlation coefficients are identified when considering a lag of one day, compared to longer lags. This is mainly due to the watersheds’ limited size and, thus, their fast hydrological response. The strongest Granger causalities are observed when water temperature and dissolved oxygen concentration are considered as the effect of the other variables, which corroborates the importance of these two water properties. This work also demonstrates how watershed size and land use can impact causalities between hydrometeorological variables and water quality, thus, highlighting how complex these relationships are even in a region characterized by overall similar climatology. interest because of two main reasons: (i) it is one of the most rapidly growing urban coastal areas in the United States and (ii) its tributaries feed into the Chesapeake Bay, listed as an impaired water body according to the Clean Water Act. Thus, investigating how water quality relates to the local hydrometeorology and urbanization is crucial to developing effective and sustainable methods for water quality protection. An extensive pre-processing is applied to the collected data, as a first step to fill in any missing observations in the time series and to remove any cyclic pattern and ensure stationarity, before applying the PCA and testing for Granger causality. Results from the PCA show that most of the variation in each watershed can be explained by only consid- ering the first two principal components, except for watershed 4, which is the least devel- oped watershed (almost entirely covered with forest) among all watersheds. This may be due to the highly non-linear hydrological processes ongoing in a more natural environ- ment, where the relationship between precipitation, discharge, and water quality is com- plicated by the presence of dense vegetation that intercepts rainfall, slows down infiltra- tion, and withholds a portion of stormwater runoff and the pollutants it carries. The se- lection of important indicators is based on the absolute value of loading. Higher loadings point to a strong relationship between an indicator and specific component. The most im- portant variables identified by the PCA in all watersheds are turbidity, water temperature, and air temperature. Furthermore, while discharge is an important variable in highly ur- banized watersheds, precipitation is more fundamental in less urbanized watersheds with higher soil infiltration rates and therefore lower volumes of runoff. Results from the Granger causality analysis show how different lag times (1, 2, and 3 days) affect the causality relation between hydroclimatic variables and water quality in- dicators. In general, lag 1 shows more and stronger Granger causality relationships in comparison to lags 2 and 3. This is due to the limited size of the basins (characterized by hydrological responses faster than 2 days), but also to the fact that most watersheds are highly developed and therefore characterized by low soil infiltration rates (which corre- sponds to relatively short retention times). The strongest Granger causalities are observed when water temperature and dissolved


Introduction
Water quality information is essential to protect lives and manage water resources effectively [1]. This requires state-of-the-art collection procedures [2] and in-depth analysis skills [3,4]. However, data scarcity in Earth science is a well-known problem because of the high cost of monitoring systems and low reliability of the measurements. Across the United States (U.S.), the U.S. Geological Survey's (USGS) National Water Information System (NWIS) is in charge of acquiring, processing, and storing water quality data [5]. Analyzing and interpreting these data is complicated by several factors. First, water quality data are complex in nature, as multiple water quality indicators are commonly combined to holistically characterize the condition of streams, lakes, and groundwater (including pH, dissolved oxygen level, nutrients' concentration, among others). Second, water quality changes both in time-and is often affected by seasonality-and space, influenced by topography, urban effluents, farm waste, fertilizer runoffs, and industrial waste discharge. Third, different water bodies may be affected by different environmental issues. For instance, the concentration of nutrients such as nitrates and phosphates may be more problematic in lakes during summertime, because of the risk eutrophication. On the other hand, conductivity may be an indicator for control in rivers where fishery is a predominant activity, since conductivity is an indirect measure of the saltiness of the water and freshwater may not tolerate large increases in saltiness.

Study Area and Dataset
This study investigates ten watersheds across the District of Columbia, Maryland, and Virginia (also known as the DMV region), one of the most rapidly growing urban coastal areas in the United States [24]. Its population growth is still on the rise and is predicted by the Northern Virginia Regional Commission to continue to at least 2040 [25]. Its proximity to the coast makes the region vulnerable to hydro-meteorological hazards exacerbated by sea level rise and increased urban development. Population growth and multiple climatic stressors call for comprehensive plans that increase the community's resilience to ensure clean water to the entire population. Moreover, excessive nutrient loads in the Chesapeake Bay area and its tidal tributaries often resulted in eutrophication in the past years [26]. The recovery is slow and the Bay is listed as an impaired water body according to the Clean Water Act [27]. As a result, monitoring water quality in the area has been of interest to many researchers and engineers [28]. Figure 1 displays the location of the 10 watersheds located across the DMV selected for this study. Moreover, a USGS station is located at the discharge point (or outlet) of each watershed. Watersheds and their associated USGS IDs (Hydrological unit code) are listed based on their size from large to small in the legend of Figure 1. In the discussion of results, the first three watersheds are considered large, watersheds 4, 5, and 6 medium, and the last four are classified as small. Watershed areas vary from 7 to 168 km 2 with an average of 50 km 2 and a standard deviation of 56.8 km 2 .
Watershed information, streamflow discharge, and water quality data are collected from USGS [5] during the period January 2010-May 2019. However, 9 years of hydrological readings are not available for every single watershed. All data used in this study along with their sources and description are listed in Table 1. These include watershed characteristics like area, land cover, and soil type. Precipitation data are obtained from the North America Land Data Assimilation Systemb (NLDAS) [28]. The four water quality indicators are chosen based on their availability across the 10 watersheds and include water temperature (WT), dissolved oxygen (DO), turbidity (Tu), and specific conductivity (K). DO concentration is a measure of how much oxygen is dissolved in water, turbidity is a measure of the relative clarity of water, and specific conductivity is the ability of a solution to conduct electricity.  Figure 1. Location, area, and extension of the 10 watersheds selected for the study in the DMV region. Figure 1. Location, area, and extension of the 10 watersheds selected for the study in the DMV region. the small watersheds (7-10) show a prevalence of soil type C with slow infiltration. Information regarding the land use and soil type can be particularly useful when interpreting relationships among water quality indicators and environmental characteristics, as both land use and soil type impact infiltration rates, and, as a consequence, both streamflow and stormwater runoff and the contaminants they carry (thus affecting water quality).

Methodology
The flowchart in Figure 2 maps out the methodological process followed in this work. A set of pre-processing steps is required before applying the PCA and Granger causality analysis to the collected data, as described in the next sub-section.

Data Pre-Processing
Before applying the PCA and Granger causality analysis to the collected data, there are a few pre-processing steps that need to be considered. First off, missing observations in time series are very common and many methods are available to overcome this issue [29]. Many time series analysis techniques assume there is no gap in data frame, and the analysis of incomplete time series may result in biased results. Thus, in this work, the missing values are imputed with the median of the nearest values to obtain continuous time series.
Secondly, time series of environmental variables often show complex cyclic patterns. Some of the data considered here exhibit daily and seasonal patterns. For instance, as shown in Figure 3 for the Difficult Run watershed near Vienna, VA during a 6-year period (June 2011-August 2017), both water temperature and DO present a strong seasonal pattern. Moreover, such variables present a diurnal cycle (temperature is higher during the day and lower at night). On the other hand, other variables, like Tu and K, are more closely related to streamflow (which is driven by precipitation events) and do not present any strong temporal cycle in their timeseries ( Figure 3).

Data Pre-Processing
Before applying the PCA and Granger causality analysis to the collected data, there are a few pre-processing steps that need to be considered. First off, missing observations in time series are very common and many methods are available to overcome this issue [29]. Many time series analysis techniques assume there is no gap in data frame, and the analysis of incomplete time series may result in biased results. Thus, in this work, the missing values are imputed with the median of the nearest values to obtain continuous time series.
Secondly, time series of environmental variables often show complex cyclic patterns. Some of the data considered here exhibit daily and seasonal patterns. For instance, as shown in Figure 3 for the Difficult Run watershed near Vienna, VA during a 6-year period (June 2011-August 2017), both water temperature and DO present a strong seasonal pattern. Moreover, such variables present a diurnal cycle (temperature is higher during the day and lower at night). On the other hand, other variables, like Tu and K, are more closely related to streamflow (which is driven by precipitation events) and do not present any strong temporal cycle in their timeseries ( Figure 3). [29]. Many time series analysis techniques assume there is no gap in data frame, and the analysis of incomplete time series may result in biased results. Thus, in this work, the missing values are imputed with the median of the nearest values to obtain continuous time series.
Secondly, time series of environmental variables often show complex cyclic patterns. Some of the data considered here exhibit daily and seasonal patterns. For instance, as shown in Figure 3 for the Difficult Run watershed near Vienna, VA during a 6-year period (June 2011-August 2017), both water temperature and DO present a strong seasonal pattern. Moreover, such variables present a diurnal cycle (temperature is higher during the day and lower at night). On the other hand, other variables, like Tu and K, are more closely related to streamflow (which is driven by precipitation events) and do not present any strong temporal cycle in their timeseries ( Figure 3).  In order to identify relationships among variables that are independent of any diurnal or seasonal patterns, it is recommended to remove any cyclic behavior from the original data. Thus, we fit temperature and DO timeseries with a Fourier transformation, which is subsequently subtracted by the initial time series. A Fourier series f (x) is defined according to Tolstov [30] as: is a periodic function with period 2p, defined within (−p, p), and n is the number of cycles. Residuals at each time t (R t ) are calculated as follows: where S t is an observation at time t and F t is the fitted Fourier series. As a third pre-processing step, stationarity needs to be tested, as Granger causality adopted in this study assumes stationary time series. A time series is stationary if its statistical properties (e.g., mean, variance, and autocorrelation) do not change over time. One common test to assess stationarity is the Augmented Dickey Fuller (ADF) test, which is the extended version of the Dickey Fuller (DF) [31] test and allows for higher order autoregressive processes: autoregressive process, and ε is the magnitude of the random error, i.e., the white noise [32]. θ is a constant and if it is equal to zero, then the time series is not stationary. If the ADF test is not cleared, methods to remove non-stationary trends need to be considered.

PCA
PCA is a data reduction technique that transforms a dataset into a new set of variables, the principal components (PCOMs), which are a linear combination of the original variables. The main goal of PCA is to maintain the original variation of the data [33], while creating an uncorrelated dataset. It also reveals patterns that might not be apparent using common analysis and graphic techniques. As a first step, the covariance matrix is calculated. If X is the original dataset in a matrix format, with m rows (which account for different measurements of a specific attribute) and n columns (which represent the attributes), then the covariance matrix C x is: Next, eigenvalues and eigenvectors are computed. The eigenvector → v is defined as: where λ is a scalar value, i.e., the eigenvalue. The following equations show steps to solve for eigenvalue and eigenvector: where I is the identity matrix of the same dimension as C x . As a result, each eigenvector is produced by each λ times → v which is called principal component. The number of principal components is equal to the dimension of the dataset, however, PCA loads the maximum possible information in the first component, the maximum remaining information in the second component, and so on. The number of PCs is usually based on the number of eigenvalues greater than 1 [34]. The ratio between the eigenvalue of a component and the sum of the eigenvalues shows the percent of variance of in the original dataset represented by that component.

Granger Causality
The notion of Granger causality was introduced by Granger [35] and soon found application in many fields (e.g., economics) because of its simplicity and robustness [21]. For this study, we adopt the first-order Granger causality test, which investigates the linear causal interaction between time series of data. The causal relation exists if the following two conditions are fulfilled: (i) the cause precedes the effect; and (ii) the cause contains information about the effect that is not available in other variables. As mentioned above, one main assumption to test Granger causality is the stationarity of the time series. The bi-variate Granger causality between two stationary time series (X and Y) is formulated as follows: where a and b are coefficient (b = 0) and ε is white noise. In this case, variable X Granger causes variable Y.
It is important to mention that Granger causality measures precedence and information content, rather than the effect or the result. Granger causality tries to answer the question of how much of the current variable can be explained by the past values of a different values and whether adding lagged values can improve such explanation [36].

Pre-Processing
After imputing all the hydrometeorological variables (i.e., the four water quality indicators, air temperature, precipitation, and discharge), the diurnal and seasonal patterns in water temperature and DO were removed by fitting a Fourier series and calculating the residuals in each watershed. Figure 4 shows time series DO and water temperature in watershed 10 as an example. The DO and water temperature readings started from June of 2011 to August of 2017, which results in six cold weather seasons and almost eight hot weather seasons. Such effects together with any diurnal cycle were removed in the residuals, which are used in any further analysis. Besides DO and water temperature, no other variable presented a strong cyclic pattern and therefore did not undergo the de-cycling process.

Pre-Processing
After imputing all the hydrometeorological variables (i.e., the four water quality indicators, air temperature, precipitation, and discharge), the diurnal and seasonal patterns in water temperature and DO were removed by fitting a Fourier series and calculating the residuals in each watershed. Figure 4 shows time series DO and water temperature in watershed 10 as an example. The DO and water temperature readings started from June of 2011 to August of 2017, which results in six cold weather seasons and almost eight hot weather seasons. Such effects together with any diurnal cycle were removed in the residuals, which are used in any further analysis. Besides DO and water temperature, no other variable presented a strong cyclic pattern and therefore did not undergo the de-cycling process. Next, the ADF test is applied to imputed time series of precipitation, discharge, turbidity, specific conductivity, DO, air temperature, and water temperature to check if such datasets are stationary. Results show that residuals of DO and water temperature are stationary at a 95% confidence level. Precipitation, discharge, turbidity, specific conductivity, and air temperature also pass the ADF test, at an even higher (99%) confidence level.
Before investigating the PCA and Granger causality, some preliminary analyses are Next, the ADF test is applied to imputed time series of precipitation, discharge, turbidity, specific conductivity, DO, air temperature, and water temperature to check if such datasets are stationary. Results show that residuals of DO and water temperature are stationary at a 95% confidence level. Precipitation, discharge, turbidity, specific conductivity, and air temperature also pass the ADF test, at an even higher (99%) confidence level.
Before investigating the PCA and Granger causality, some preliminary analyses are performed. Specifically, residuals of water temperature are plotted against residuals of DO for three different lags (1 day, 2 days, and 3 days). For the sake of brevity, one large (1), one medium (4), and one small watershed (7) are shown in Figure 5. In other words, for lag 1, we assess the relationship between the water temperature of yesterday (or t-1) and the DO of today (at time t). Similarly, for lags 2 and 3, the temperature recorded at t-2 and t-3 is linked to the DO observed at time t. As expected, DO and water temperature are negatively correlated: when water temperature rises, the DO concentration drops. What is interesting though is that such a relationship is even stronger in large and medium watersheds relative to smaller watersheds ( Figure 5). This may be due to the lower number of readings in the small watersheds where data were mainly collected during the cold months. As shown in Figure 5, the correlation decreases when the lag increases. That is, the relationship between today's DO and yesterday's temperature is stronger than the one with the temperature measured 2 or more days ago. for lag 1, we assess the relationship between the water temperature of yesterday (or t-1) and the DO of today (at time t). Similarly, for lags 2 and 3, the temperature recorded at t-2 and t-3 is linked to the DO observed at time t. As expected, DO and water temperature are negatively correlated: when water temperature rises, the DO concentration drops.
What is interesting though is that such a relationship is even stronger in large and medium watersheds relative to smaller watersheds ( Figure 5). This may be due to the lower number of readings in the small watersheds where data were mainly collected during the cold months. As shown in Figure 5, the correlation decreases when the lag increases. That is, the relationship between today's DO and yesterday's temperature is stronger than the one with the temperature measured 2 or more days ago. Correlation coefficients for each case are also shown.

PCA
The PCA analysis is performed on hydrometeorological variables (precipitation, discharge, T, WT, DO, Tu, K) in every watershed. In most watersheds, the first two principal components explain most of the variance, except for watershed #4 that presents three principal components ( Figure 6). Interestingly, this watershed is the least developed among all the ones considered in the study with almost 78% of its entire area covered by forest. Hydrological processes in a more natural environment are highly non-linear, because the relationship between precipitation, discharge, and water quality is complicated by the presence of dense vegetation that intercepts rainfall, slows down infiltration, and withholds a portion of runoff and the pollutants it carries. Furthermore, most watersheds show similar behavior, with between 30% and 50% of the variance explained by the first component and between 20% and 30% explained by the second component. As a result, PCOM 1 and 2 cumulatively explain more than 60% of the variation in every watershed.

PCA
The PCA analysis is performed on hydrometeorological variables (precipitation, discharge, T, WT, DO, Tu, K) in every watershed. In most watersheds, the first two principal components explain most of the variance, except for watershed #4 that presents three principal components ( Figure 6). Interestingly, this watershed is the least developed among all the ones considered in the study with almost 78% of its entire area covered by forest. Hydrological processes in a more natural environment are highly non-linear, because the relationship between precipitation, discharge, and water quality is complicated by the pres-ence of dense vegetation that intercepts rainfall, slows down infiltration, and withholds a portion of runoff and the pollutants it carries. Furthermore, most watersheds show similar behavior, with between 30% and 50% of the variance explained by the first component and between 20% and 30% explained by the second component. As a result, PCOM 1 and 2 cumulatively explain more than 60% of the variation in every watershed. are positively correlated. For instance, the loadings associated with WT and DO always show opposite signs, confirming the well-known negative correlation between the two variables. Stronger loadings correspond to variables that contribute more to the variation in the original dataset. DO and discharge usually are associated with strong loadings (greater than 0.5) in PCOM 1 (except for watersheds 3 and 7), showing that they carry fundamental information that cannot be disregarded in most of the watersheds. In PCOM 2, Tu, K, and precipitation appear to be the most prominent variables (with larger loadings), thus, identifying a set of variables that are also significant sources of variability in the dataset. The PCA results indicate that in small watersheds the most relevant variables (i.e., the ones that explain most of the variance) are Tu, discharge, DO, T, and WT. Small watersheds are also the most urbanized watersheds with a low soil infiltration (i.e., soil type C). As a result, a higher volume of water travels in shorter amount of time in these watersheds. On the other hand, in medium watersheds, most of the variation is explained by Tu, precipitation, DO, T, and WT. Medium watersheds are the least urbanized watersheds in comparison to the others with higher infiltration rates (i.e., soil type B) and the largest forest land cover. This corresponds to lower volumes of runoff and higher chance of infiltration in comparison to the small watersheds. In summary, discharge plays a more important role in small urbanized watersheds, whereas precipitation plays that role in medium watersheds. Similarly, in large watersheds, most of the variability in data is explained by Tu, discharge, T, and WT. In terms of land cover, large watersheds are similar to the small watersheds, i.e., highly urbanized with lower percentage of forest. As a result, discharge plays a more important role than precipitation, which once again is in line with the fact that the more urbanized areas tend to have higher volume of runoff in a shorter period of time.   Table 3 shows how individual hydrometeorological variables contribute to each principal component. If the loading associated with one variable is positive and the loading associated with another variable is negative, then those two variables are negatively correlated. Vice versa, if the loadings associated with two variables have the same sign, they are positively correlated. For instance, the loadings associated with WT and DO always show opposite signs, confirming the well-known negative correlation between the two variables. Stronger loadings correspond to variables that contribute more to the variation in the original dataset. DO and discharge usually are associated with strong loadings (greater than 0.5) in PCOM 1 (except for watersheds 3 and 7), showing that they carry fundamental information that cannot be disregarded in most of the watersheds. In PCOM 2, Tu, K, and precipitation appear to be the most prominent variables (with larger loadings), thus, identifying a set of variables that are also significant sources of variability in the dataset.
The PCA results indicate that in small watersheds the most relevant variables (i.e., the ones that explain most of the variance) are Tu, discharge, DO, T, and WT. Small watersheds are also the most urbanized watersheds with a low soil infiltration (i.e., soil type C). As a result, a higher volume of water travels in shorter amount of time in these watersheds. On the other hand, in medium watersheds, most of the variation is explained by Tu, precipitation, DO, T, and WT. Medium watersheds are the least urbanized watersheds in comparison to the others with higher infiltration rates (i.e., soil type B) and the largest forest land cover. This corresponds to lower volumes of runoff and higher chance of infiltration in comparison to the small watersheds. In summary, discharge plays a more important role in small urbanized watersheds, whereas precipitation plays that role in medium watersheds. Similarly, in large watersheds, most of the variability in data is explained by Tu, discharge, T, and WT. In terms of land cover, large watersheds are similar to the small watersheds, i.e., highly urbanized with lower percentage of forest. As a result, discharge plays a more important role than precipitation, which once again is in line with the fact that the more urbanized areas tend to have higher volume of runoff in a shorter period of time.

Granger Causality
Granger causality assesses whether one variable at time t-lag causes another variable at time t. In our analysis, we considered three different lags (i.e., 1 day, 2 days, and 3 days). This choice was dictated by the fact that most of the watersheds have a fast hydrological response, due to the fact that they are limited in size, highly developed, and characterized by low to very low soil infiltration rates. Thus, going beyond a 3-day lag would not be recommended. The Granger causality test is performed on each of the four water quality indicators (WT, DO, Tu, K), which are considered the effects, and all the hydrometeorological variables (WT, DO, Tu, K, T, precipitation, and discharge) considered as possible causes. The null hypothesis is defined as follows: there is no Granger causality between the cause and effect. Thus, lower p-values correspond to stronger causality and vice versa. Figure 7 shows p-values for the Granger causality test, when WT is considered as the effect. The Granger causality relationship is strong in any lag for all the variables. Water temperature is known to be an important physical property and any change in the other variables can impact it. When discharge is the cause, the variability around median p-value is larger in lag 1 and gets smaller when moving to lag 3, showing that such relationship is not as strong in all the watersheds Nevertheless, this uncertainty is not due to uncertainty in the precipitation relationship, which shows very low p-values. Therefore, this may be due to the fact that the WT is highly dependent on the air temperature and precipitation rather than discharge.
When DO is considered as the effect, as shown in Figure 8, the Granger causality relationship is very similar to the previous case, where WT is Granger caused by the rest of the variables. This means that the streamflow and environmental conditions of the previous days impact the amount of dissolved oxygen in the water today. However, higher variability is observed around K, which can be explained by the fact that the ability of water to conduct an electrical current does not impact the amount of dissolved oxygen directly. time is short in the studied watersheds. In addition, the variability around the median pvalue decreases in WT and DO when the lag increases, but the opposite happens for precipitation and K. This means that past values of WT and DO (values from 2 and 3 days ago) impact the Tu of today, which is nonetheless not impacted by the precipitation and K of 2 or 3 days ago. This may be due to the fact that the variation in WT and DO values is not as significant as the variation in precipitation and K.   time is short in the studied watersheds. In addition, the variability around the median pvalue decreases in WT and DO when the lag increases, but the opposite happens for precipitation and K. This means that past values of WT and DO (values from 2 and 3 days ago) impact the Tu of today, which is nonetheless not impacted by the precipitation and K of 2 or 3 days ago. This may be due to the fact that the variation in WT and DO values is not as significant as the variation in precipitation and K.   When Tu is the effect (Figure 9), the Granger causality relationships are strong (i.e., low p-values) when WT and DO are the causes in all lags. This means that the values of WT and DO measured 1, 2, and 3 days ago influence the number of suspended particles in the water today. Granger causality relationship between T, precipitation, discharge, and K is strong at lag 1, but weaker for longer lags, which is expected since the retention time is short in the studied watersheds. In addition, the variability around the median p-value decreases in WT and DO when the lag increases, but the opposite happens for precipitation and K. This means that past values of WT and DO (values from 2 and 3 days ago) impact the Tu of today, which is nonetheless not impacted by the precipitation and K of 2 or 3 days ago. This may be due to the fact that the variation in WT and DO values is not as significant as the variation in precipitation and K.
Water 2021, 13, x FOR PEER REVIEW 14 of 19 Figure 9. Same as in Figure 7, but for Tu as an effect in the Granger causality relationship.
When K is Granger caused by the rest of the variables, as shown in Figure 10, the relationship is very strong when WT and DO are the cause. This can be easily explained, since WT influences the ability of water to conduct electrical current and DO is highly related to WT, as discussed above. The Granger relationships are more uncertain when the other variables are considered as the causes, as shown by the wider box plots. This demonstrates that the ability of water to conduct electrical current is not as closely linked to variables like precipitation, air temperature (that is more variable than water temperature), discharge, and K at least at lags between 1 and 3 days. Nevertheless, investigating lags shorter than 1 day may identify stronger causality among these variables in the watersheds analyzed here.  When K is Granger caused by the rest of the variables, as shown in Figure 10, the relationship is very strong when WT and DO are the cause. This can be easily explained, since WT influences the ability of water to conduct electrical current and DO is highly related to WT, as discussed above. The Granger relationships are more uncertain when the other variables are considered as the causes, as shown by the wider box plots. This demonstrates that the ability of water to conduct electrical current is not as closely linked to variables like precipitation, air temperature (that is more variable than water temperature), discharge, and K at least at lags between 1 and 3 days. Nevertheless, investigating lags shorter than 1 day may identify stronger causality among these variables in the watersheds analyzed here. When K is Granger caused by the rest of the variables, as shown in Figure 10, the relationship is very strong when WT and DO are the cause. This can be easily explained, since WT influences the ability of water to conduct electrical current and DO is highly related to WT, as discussed above. The Granger relationships are more uncertain when the other variables are considered as the causes, as shown by the wider box plots. This demonstrates that the ability of water to conduct electrical current is not as closely linked to variables like precipitation, air temperature (that is more variable than water temperature), discharge, and K at least at lags between 1 and 3 days. Nevertheless, investigating lags shorter than 1 day may identify stronger causality among these variables in the watersheds analyzed here.  After analyzing all four scenarios, it is concluded that the strongest relationships (i.e., smallest p-values) are observed for lag 1. Based on these results, lag 1 relationships are further investigated as a function of the watershed size and urbanized area. For the sake of brevity, we only show the Granger causality relationship when DO is caused by the other six variables based on different watershed sizes ( Figure 11). Relationships are strong when WT and precipitation are the cause across all watersheds, regardless of their size, which may be in part due to the similar environmental and climatic characteristics of the watersheds. The Granger causality is also strong when T, discharge, and Tu are the cause, but in this case the p-value depends on the watershed size. As a result of lower retention time in smaller watersheds, the causality relationship between Tu and DO weakens as the watershed size decreases. The p-value is higher when DO is Granger caused by K in comparison to when other variables are the cause. After analyzing all four scenarios, it is concluded that the strongest relationships (i.e., smallest p-values) are observed for lag 1. Based on these results, lag 1 relationships are further investigated as a function of the watershed size and urbanized area. For the sake of brevity, we only show the Granger causality relationship when DO is caused by the other six variables based on different watershed sizes ( Figure 11). Relationships are strong when WT and precipitation are the cause across all watersheds, regardless of their size, which may be in part due to the similar environmental and climatic characteristics of the watersheds. The Granger causality is also strong when T, discharge, and Tu are the cause, but in this case the p-value depends on the watershed size. As a result of lower retention time in smaller watersheds, the causality relationship between Tu and DO weakens as the watershed size decreases. The p-value is higher when DO is Granger caused by K in comparison to when other variables are the cause.  Figure 12 displays Granger causality test results when DO is caused by the other six variables based on the level of urbanization in the watersheds. Watersheds 1, 2, 4, 5, and 10 show a level of urbanization lower than 70%, whereas the others are considered highly urbanized. Watersheds characterized by less urban area show stronger Granger causality relationship when WT, discharge, precipitation, and Tu are the cause of a change in DO. When soil infiltration is slow (which is commonly the case in more developed regions), the retention time is possibly shorter than 1 day, which is why such relationships are weaker. Figure 11. Box plots of p-values resulting from the Granger causality test based on watershed size (small, medium, and large) when DO is Granger caused by WT, T, discharge, precipitation, Tu, and K. Dots show the median and the vertical lines indicate +/− standard deviation. Figure 12 displays Granger causality test results when DO is caused by the other six variables based on the level of urbanization in the watersheds. Watersheds 1, 2, 4, 5, and 10 show a level of urbanization lower than 70%, whereas the others are considered highly urbanized. Watersheds characterized by less urban area show stronger Granger causality relationship when WT, discharge, precipitation, and Tu are the cause of a change in DO. When soil infiltration is slow (which is commonly the case in more developed regions), the retention time is possibly shorter than 1 day, which is why such relationships are weaker.

Conclusions
This study proposes a set of analytical tools to better understand the relationship between a suite of water quality indicators (WT, DO, Tu, and K) and select hydroclimatic variables (precipitation, discharge, and air temperature) in a series of watersheds in the Figure 12. Same as in Figure 11, but for different urbanization percentages (<70% and >70%).

Conclusions
This study proposes a set of analytical tools to better understand the relationship between a suite of water quality indicators (WT, DO, Tu, and K) and select hydroclimatic variables (precipitation, discharge, and air temperature) in a series of watersheds in the DMV region in the Eastern United States. Stream water quality in this region is of particular interest because of two main reasons: (i) it is one of the most rapidly growing urban coastal areas in the United States and (ii) its tributaries feed into the Chesapeake Bay, listed as an impaired water body according to the Clean Water Act. Thus, investigating how water quality relates to the local hydrometeorology and urbanization is crucial to developing effective and sustainable methods for water quality protection.
An extensive pre-processing is applied to the collected data, as a first step to fill in any missing observations in the time series and to remove any cyclic pattern and ensure stationarity, before applying the PCA and testing for Granger causality. Results from the PCA show that most of the variation in each watershed can be explained by only considering the first two principal components, except for watershed 4, which is the least developed watershed (almost entirely covered with forest) among all watersheds. This may be due to the highly non-linear hydrological processes ongoing in a more natural environment, where the relationship between precipitation, discharge, and water quality is complicated by the presence of dense vegetation that intercepts rainfall, slows down infiltration, and withholds a portion of stormwater runoff and the pollutants it carries. The selection of important indicators is based on the absolute value of loading. Higher loadings point to a strong relationship between an indicator and specific component. The most important variables identified by the PCA in all watersheds are turbidity, water temperature, and air temperature. Furthermore, while discharge is an important variable in highly urbanized watersheds, precipitation is more fundamental in less urbanized watersheds with higher soil infiltration rates and therefore lower volumes of runoff.
Results from the Granger causality analysis show how different lag times (1, 2, and 3 days) affect the causality relation between hydroclimatic variables and water quality indicators. In general, lag 1 shows more and stronger Granger causality relationships in comparison to lags 2 and 3. This is due to the limited size of the basins (characterized by hydrological responses faster than 2 days), but also to the fact that most watersheds are highly developed and therefore characterized by low soil infiltration rates (which corresponds to relatively short retention times). The strongest Granger causalities are observed when water temperature and dissolved oxygen concentration (which are highly correlated) are considered as the effect of the hydrometeorological variables and other water quality indicators, which corroborates the importance of these two water properties, since any change in the other variables can impact them. When dissolved oxygen concentration is caused by water temperature and precipitation, the watershed size does not play a role in the Granger causality relationships. In contrast, the watershed size changes the strength of such relationships when air temperature, discharge, and turbidity are the cause when dissolved oxygen is the effect. Moreover, urbanization triggers weaker Granger causality when discharge, conductivity, and turbidity are the cause and dissolved oxygen is the effect.
This work explored how PCA and Granger causality analysis can inform relationships between water quality, hydroclimatic variables, and watershed characteristics. A main conclusion is that even within watersheds characterized by similar climate, land use distribution, and size, such relationships vary largely. Thus, if a predictive model were to be built, such information should be carefully considered and predictions like watershed size and urban area should be included. For instance, machine learning algorithms for estimating and predicting water quality variables, based on the set of hydrometeorological variables and watershed information identified here, could be developed and adopted for the assessment of such relationships during extreme weather events, when collecting in-situ data is the most difficult, but also crucial.
Future studies should extend the proposed framework to watersheds characterized by different climatology and size in other regions of the world, to verify the impact of different climates on the identified relationships between hydrometeorology and water quality. Future work could also expand our analyses to more water quality indicators (e.g., pH, nitrate and phosphorous concentrations). Finer temporal resolutions should also be considered to investigate hydrological responses that are shorter than 1 day. Furthermore, a wider set of water quality indicators (including for instance nitrate concentration) should be investigated to generalize the results of this work and make the proposed analyses more useful for areas affected by different types of pollution.