Identifying the Inﬂuence of Land Cover and Human Population on Chlorophyll a Concentrations Using a Pseudo-Watershed Analytical Framework

: Increasing agricultural development and urbanization exacerbates the degradation of water quality in vulnerable freshwater systems around the world. Advances in remote sensing and greater availability of open-access data provides a valuable resource for monitoring water quality but harmonizing between databases remains a challenge. Here, we: (i) developed a pseudo-watershed analytical framework to associate freshwater lakes with adjacent land cover and human population data and (ii) applied the framework to quantify the relative inﬂuence of land cover and human population on primary production for 9313 lakes from 72 countries. We found that land cover and human population explained 30.2% of the variation in chlorophyll a concentrations worldwide. Chlorophyll a concentrations were highest in regions with higher agricultural activities and human populations. While anthropogenic land cover categories equated to only 4 of the 18 categories, they accounted for 41.5% of the relative explained variation. Applying our pseudo-watershed analytical framework allowed us to quantify the importance of land cover and human population on chlorophyll concentration for over 9000 lakes. However, this framework has broader applicability for any study or monitoring program that requires quantiﬁcation of lake watersheds.


Introduction
Freshwater is essential for human life and provides habitat to over 10% of all species on earth [1]. While freshwater lakes are distributed around the world, accessible freshwater is finite and comprises less than 1% of all water on the planet and is facing degradation [2]. Chlorophyll is a commonly used proxy for primary production and is used globally as a measure of lake trophic status and water quality [3]. Factors influencing chlorophyll concentrations include nutrient inputs [4,5], climate [6] and human influences [7]. Nitrogen and phosphorus are known drivers of chlorophyll [5,8] and are commonly found in excess in runoff from urban and agricultural landscapes [9,10]. Further, climate-related drivers such as warm air temperatures and low wind speeds can increase the stability of stratification in the water column which can result in increased algal biomasses [11]. Increases in the global human population and land-cover changes have also negatively affected freshwater quality [12]. Increased agricultural demand [13] and urbanization [14,15] both have contributed to increased runoff into lakes. All types of land cover contribute nutrients from the watershed into waterways, although the amounts and types of nutrients that come from nonpoint sources depend on a variety of drivers that include weather, topography, soil properties, land characteristics and management practices [16,17]. Here, we focus on further understanding the role of land cover and human population on chlorophyll concentrations in lakes worldwide.
The UN estimates that 70% of the world population will live in urban areas by 2050 [18] Urbanized regions have increased impervious surfaces which have been linked to hydrological changes that degrade water quality [19,20]. Runoff and stormwaters from such regions contain higher sediment content, nutrients, heavy metals, oil and grease, pesticides and bacteria [21] which directly end up in aquatic systems [16]. In response to a growing population, a substantial expansion of agricultural production and water use will be needed in the coming decades to provide the necessary food and products to support the population growth and to replace agricultural lands lost to urban expansion [13,22] Agricultural land can impact water quality, such as through croplands [23] that are frequently sprayed with phosphorus rich pesticides [24] and administered nutrient rich fertilizers [25]. The inputs of nitrogen and phosphorus from nonpoint sources like fertilizers and pesticides can degrade water quality and have been associated with higher chlorophyll a concentrations in watersheds with agricultural land cover [26,27]. Pasturelands from raising livestock can also affect water quality through increased soil erosion [28,29]. Quantifying the cover of agriculture and urbanization within lake watersheds can help manage lake water quality.
Chlorophyll a concentrations can be elevated in watersheds with higher population density [30]. Such regions have multiple sources of nutrient inputs into freshwater lakes such as greater discharge of food, sewage, industrial waste and wastewater, which ends up loading freshwater bodies [4,31]. For example, human population density has been found as a driver of chlorophyll in the Laurentian Great Lakes [25]. Thus, understanding anthropogenic effects on lakes requires investigation into human population densities and urban land cover.
Although land cover and human population are known determinants of chlorophyll concentrations in lakes, we are currently limited by our ability to appropriately quantify the proportion of each land cover category and the number of people within a lake's watershed. Advances in remote sensing and data availability provide an opportunity to harmonize land cover and human population data from disparate sources [32] in order to develop a broader understanding of the effects for lakes. A novel analytical framework was necessary for the creation of watersheds for lakes where the delineated watershed is not known, particularly for small and remote lakes. This framework allows lakes with known and unknown watersheds to be equally analyzed. Studies to date on the impacts of land cover and human population on primary production have been largely regional, examining the effects of these drivers within a single or few watersheds that have been previously delineated [27,[33][34][35]. Using the proposed methodology for generating watersheds, we can analyze globally drivers of chlorophyll in lakes.
Our goal was to quantify the relative influence of land cover and human population on chlorophyll a concentrations for over 9000 freshwater lakes distributed across 72 countries and on every continent. First, we developed a novel analytical framework to quantify the proportion of each land cover category and the number of people living within a watershed. Second, we tested the analytical framework to quantify the influence of land cover and human density on chlorophyll a concentrations. We hypothesized that chlorophyll a concentrations would be higher in areas with larger human populations and a greater human footprint (i.e., more agriculture or urban areas) because of increased nutrient inflow from human activities. We expect that land cover and human population within the developed pseudo-watersheds will explain a significant portion of the variation in chlorophyll a concentrations.

Data Acquisition
We acquired chlorophyll a, water chemistry and lake morphology data from an online database of lakes distributed globally [36]. The database included 228,166 unique observations from 11,941 lakes [36]. Lakes were removed if their watersheds consisted of more than 50% water which could occur due to the lake being coastal or surrounded by other lakes. This reduced the number of lakes in this study to 9313, although 75% of these lakes were found in North America due to the increased availability of data. If data were collected for multiple years for a lake, the data from the most recent year was chosen, such that only spatial patterns were assessed. If data had more than one observation in a year, we selected the month with the highest concentration for chlorophyll a. Across the dataset, chlorophyll concentrations were expressed in mg/L.
Human population data were acquired from a NASA dataset called "Gridded population of the world" [37]. This dataset estimated the distribution of humans using population prediction in combination with census data and spatial density at a variety of spatial resolutions. We chose the best resolution available at 1 km [37]. Land cover data was acquired from NASA Earth Data and used Moderate Resolution Imaging Spectroradiometer (MODIS) to categorize land cover into 17 categories. The categories included 500 m × 500 m cells of a variety of land covers including forest, grasslands, wetlands, urban, agriculture, permanent snow, barren land and water [38]. We used the 2015 version of both the land cover and the human population data because 2015 provided the best balance between recency and accuracy with the sampling year of the studied lakes. In the dataset, land cover categories are presented as a proportion of the cells in the pseudo-watershed. Human population was expressed as the total number of people within the pseudo-watershed.

Development of an Analytical Framework
We developed a framework to estimate land cover and human population for the watershed surrounding a lake ( Figure 1). This could be applied to the 1.4 million lakes larger than 10 ha for which shapefiles of shoreline polygons are readily accessible [39]. Open-access code to conduct this framework is available at: https://doi.org/10.5281/zenodo.4161959.
Step 1: Input Lake Data We imported basic lake information that included lake name, latitude and longitude, chlorophyll a, and, where available, lake area. Our chlorophyll a database contains over 250,000 occurrences as there are many lakes with multi-site and/or multi-year entries [36]. We selected unique instances of lakes by their coordinates to ensure only one watershed is drawn for each lake. Chlorophyll a concentrations were also reviewed to avoid cases of multiple sampling coordinates within the same lake.
Step 2: Look for HydroLAKES match We used the lake coordinates (lat,lon) and implemented the function find_hydrolakes_match to identify the nearest HydroLAKES polygon, as well as the distance from the sample coordinate to the edge of the HydroLAKES polygon. If the distance is 0 km (i.e., the sampling point is within the polygon), we called this a match, extracted the HydroLAKES polygon shapefile and recorded the corresponding lake area, watershed area and shoreline length. Finding a HydroLAKES match is the ideal situation as HydroLAKES provides easily accessible and readily available shoreline polygons, shoreline length, surface area, mean depth, volume and residence time for 1.4 million lakes [39]. In situations where there were no HydroLAKES matches, we used a surface area value from our original chlorophyll database to complete the creation of a pseudo-watershed. This was necessary for 22.6% of the lakes included in the analysis. [36].

Step 3: Generate Watershed Target Area
We used three methods to define a watershed target area (lake surface area + watershed area) around each lake based on the combined area of the surface of the lake and its watershed: (Case 1) Using the sum of surface area and watershed area values provided by HydroLAKES (68.3% of lakes), (Case 2) for lakes without a HydroLAKES match but with a surface area value (22.6% of lakes), we calculated the target area by using the slope of the log-log relationship between surface area and target area (i.e., surface area + watershed area) from the lakes within the HydroLAKES repository [39] (Case 3) if the lakes have neither a watershed area nor a surface area (9.1% of lakes), they are assigned the smallest target area which is a circle of radius 1 km and area 3.14 km 2 . These lakes were assigned the smallest target area because HydroLAKES includes coverage of almost all large lakes and thus lakes without a match are likely too small to be identified by remote sensing or GIS methods.
Step 4: Create a target shape Each lake has a unique shape that should be represented by an accurate shape or an estimation of its shape based on its surface area. We used two methods for creating a target shape depending on whether a shapefile existed from a HydroLAKES match. First, for lakes with a HydroLAKES match, the lake possessed a shapefile as part of the HydroLAKES repository that indicated the shape of the lake as a polygon [39]. The watershed was drawn as an "expansion" of this lake shape, equally in all directions, along the perimeter of the shape file of the lake to respect the shape of the lake. If there was more than one shapefile provided by HydroLAKES (likely from islands being present), we selected the primary and largest portion of the lake (i.e., the exterior perimeter). Second, if there was no HydroLAKES match, the final target shape is a circle with the theorized target area ("Case 2" in Step 3) centered on the sampled location coordinates in the lake. If the lake also had no value for surface area, a circle with the minimum target area of 3.14 km 2 was applied ("Case 3" in Step 3), this polygon was centered on the location of the sampled coordinates using the function Point(lat,lon).

Step 5: Convert geographic coordinates
We converted the geographical units from latitude-longitude to the Universal Transverse Mercator (UTM) in order to standardize the (horizontal and vertical) map units and solve for the size of the buffer of each watershed in meters.
Step 6: Calculate buffer size for HydroLAKES matches To estimate the length of the watershed buffer value that will be applied equally and perpendicularly from the shoreline of the lake, we treated the lakes and watersheds as rectangles. From the lake's shoreline length (extracted from the HydroLAKES database), the lake's area and the target area, we determined an estimate of the buffer.
This final quadratic equation was used to solve for the length of buffer. If the calculated buffer size was smaller than 1 km, it was adjusted to 1 km.
Step 7: Draw watersheds on lakes After the buffer values were calculated and adjusted for accuracy, they could be applied to the lakes with a predefined shape. The polygons for all lakes were reduced to maximum 10 sides in order to reduce processing load, with the caveat that the area of polygons may be slightly larger or smaller than the original. The calculated buffer was added directly outward, perpendicular to the perimeter of the simplified polygon. This created an expanded polygon that includes the lake area with a distinct watershed area. The geographical reference frame was converted back to latitude-longitude.
Step 8: Adding corresponding land cover and human population data Using the generated pseudo-watersheds, the land cover and human population data were extracted for each watershed. In this way, we could obtain a more realistic count of people within the watersheds and the proportion of each land cover category within the watershed. In the past, we would be limited to quantifying land cover or population estimates within a pre-defined buffer surrounding a lake without consideration of lake shape or watershed size.
Land cover and human population cells were added to the framework using our custom-built analytical framework provided. The pseudo-watershed polygon (or circle) was reprojected to the grids of the human population data and land cover data, extracting only pixels that lie within the boundary of the polygon. This yielded the number of pixels of each land cover category within each watershed buffer and the total human population within the watershed buffer.

Statistical Analysis
We quantified the relationship of land cover and human population on chlorophyll using Spearman correlations, Principal Component Analysis and Random Forests. We assessed the normality of all variables using an Anderson-Darling normality test which revealed that chlorophyll was not normally distributed (p < 0.05). Log transformed chlorophyll concentrations were also not normally distributed but were used for this analysis. We quantified the correlation between the proportion within each pseudo-watershed for each land cover category and human population with chlorophyll a using Spearman correlations (cor.test) in R [40]. The Spearman correlation is a nonparametric test which measures the strength and direction of association between ranked variables [41].
We conducted a Principal Component Analysis (PCA) with function prcomp to assess the strength of the relationship between chlorophyll and the 17 categories of land cover [42]. Principal axes were generated to explain the maximal amount of variation in the data. The PCA algorithm generates principal components (axes) where the data shows the maximum amount of variation [43]. The PCA outputs a two-dimensional ordination plot which visualizes the relationships between predictor variables [43]. A broken stick analysis (Package PCDimension, [44]) and scree plot test were used to identify the three significant axes in the Principal Component Analysis. The PCA was visualized using the R function fviz_pca_var [42].
We conducted a random forest (Package: randomForest) to identify the hierarchical importance of land cover and human population variables on chlorophyll a concentrations [45]. The random forest used 500 regression trees to identify the level of importance of each of the land cover categories and human population. Random forests are built as ensembles of decision trees and work through bootstrap aggregation [46,47]. Subsets of data are randomly sampled upon which models are fit and it is based on these aggregated predictions that the best model is selected [46,48]. In this study chlorophyll is being tested against land cover and human populations (drivers) to identify which predictors individually explain the most variation.

Analytical Framework for the Development of Pseudo-Watersheds
The chlorophyll concentrations were mapped to visualize the geographical distribution of the 9313 sampled lakes in this analysis (Figure 2). The dataset consists of 9313 lakes in 72 countries on all continents including Antarctica. However, the majority of lakes are located in the United States (8266 lakes). Of the 9313 lakes, the analytical framework created 6361 watersheds that were guided by the shoreline polygon from HydroLAKES while the remaining 2952 lakes had a circular watershed. The average pseudo-watershed size was 1894.37 km 2 . The largest watershed was for the Aswan Reservoir in Egypt, which was 2,755,957.97 km 2 in size, the large size is because of the Nile River. The largest pseudo-watershed in North America, where the majority of the lakes are located, is the Lake Winnipeg pseudo-watershed which measured 815,024.47 km 2 . The smallest watersheds were 3.14 km 2 which were lakes that had neither a HydroLAKES nor a value for surface area (Case 3).
Inside the watersheds, the average pseudo-watershed was represented by all land cover categories. Woody savannas and croplands represented the largest average proportions of land cover occupying 18.4% and 18.1% of the average watershed, respectively. On the contrary, closed shrublands and deciduous needleleaf forests were the least present in the pseudo-watersheds representing only 0.02% and 0.05% of the average watershed, respectively. There were only 341 pseudo-watersheds that were completely uninhabited while the largest population was 254,200,944 inhabitants located within the pseudo-watershed of the Three Gorges Reservoir in China. The average pseudo-watershed was made up of 73.9% natural land cover and 26.1% anthropogenic land cover with 148,645 inhabitants.

Quantification of Land Cover and Human Density on Chlorophyll Concentrations
Chlorophyll concentrations were higher in watersheds with more croplands (rho = 0.33, p < 0.0001), cropland mosaics (rho = 0.2, p < 0.0001) and people (rho = 0.2; p < 0.0001; Figure 3). Of the natural land cover categories, chlorophyll concentrations were slightly higher in grasslands (rho = 0.044, p < 0.0001) but lower in mixed forests (rho = −0.29, p < 0.0001), deciduous broadleaf forests (rho = −0.24, p < 0.0001) and woody savannas (rho = −0.22, p < 0.0001; Figure 3).  Land cover categories and total human population cumulatively explained 30.2% of the variation (mean of squared residual of 0.25) of chlorophyll a concentrations across 9313 lakes distributed worldwide. Of the predictors, croplands and total human population size within the watershed had the greatest influence on chlorophyll a, explaining 15.7% and 15.4% of the relative explained variation respectively ( Figure 5). In total, anthropogenic variables-including the proportion of urban, cropland and cropland mosaic, in addition to human population-explained 41.5% of the relative variation of chlorophyll a concentrations. Of the natural land cover categories, grasslands and woody savannas were the next most important variables, explaining 9.1% and 8.8% of the variation respectively ( Figure 5).

Analytical Framework
We developed a pseudo-watershed analytical framework to quantify the influence of land cover and human population on chlorophyll a concentrations for 9313 lakes. Capitalizing on advances in remote sensing, GIS and in situ datasets, we provided a tool to harmonize disparate datasets to further understand lakes and their watersheds. Our analytical framework can be readily applied to lakes greater than 10 ha within the HydroLAKES database [39] to more realistically quantify land cover and human population within a lake's watershed. This tool integrated data from a global chlorophyll a dataset, HydroLAKES, MODIS Land Cover and NASA Gridded Population of the World to create a workflow and framework with a variety of applications. Our analytical framework allowed us to capture 30% of the variation in chlorophyll a concentrations simply based on land cover and human population alone. The analytical framework we developed could address a variety of research questions that connect land cover, human population and other watershed characteristics to a number of freshwater response variables including water quality, contaminant levels, bacterial concentrations, fish populations and aquatic community composition [47]. We applied this analytical framework to showcase the importance of land cover and human population as significant drivers of chlorophyll a concentrations in freshwater lakes and emphasized the importance of mitigating nutrient inputs from agricultural and urban landscapes associated with human settlement.
The shape of a lake's watershed is quite complex as it involves tracing all upstream inputs into a given lake. This detailed information is not readily available for most lakes on a global scale. With this in mind and with the goal of applying a consistent framework to 9313 lakes globally, we developed an approximation of the footprint of a lake's watershed using a uniform (isotropic) enlargement of a polygon representation of the lake using HydroLAKES [39] or simply a circular representation for lakes less than 10 ha. By its nature, this isotropic enlargement results in the inclusion of not only upstream but also downstream areas; some upstream area is excluded at the expense of including downstream flows. When coupled with the MODIS land cover data and population data, this results in a bias for individual lake analyses. With 9313 lakes, it is anticipated that no bias exists. that is, populations are not biased to live either upstream or downstream of lakes globally, nor are forests, crops nor urbanized areas (land cover). For a random sampling of~100 lakes, we visually confirmed the correct location of the pseudo watershed: MODIS land cover "water" classified pixels overlapped with the HydroLAKES lake polygon and the total population within the lake polygon was near or at zero.

Agricultural Influences on Chlorophyll
As expected, we found that chlorophyll a concentrations were higher in lakes whose watersheds have a higher proportion of agricultural activities. Anthropogenic land covers and human population were responsible for 41.5% of the relative explained variation in chlorophyll while representing only four of eighteen variables. We found that the proportion of croplands within a watershed was the most important driver of chlorophyll concentrations (rho = 0.33). The impact from cropland on chlorophyll a concentrations is mainly derived from the use of fertilizers used in agricultural practices which have led to a rise of nitrogen and phosphorus being added to the aquatic ecosystem [49]. The contribution of agriculture to eutrophication is caused by both current farming practices and the legacy of previous fertilizer use over the 20th century [50]. Crop uptake of nitrogen and phosphorus can be inefficient, resulting in overuse of fertilizer to ensure adequate growth rates [51]. Rain events cause the unused fertilizer to be washed away and deliver large quantities of nitrogen and phosphorus to lakes and other freshwater systems [52] For example, agricultural activities were responsible for 82% of nitrogen loading and 43% of phosphorus loading into freshwater lakes across Denmark [53]. Agriculture thus continues to be a significant driver of eutrophication in freshwater lakes globally.
Systematic land planning and modifications to nutrient application could allow for strong agricultural production coexisting with freshwater quality. Although cropland/natural vegetation mosaics were positively correlated with chlorophyll concentrations, the relationship was considerably weaker relative to cropland land cover. This suggests that mosaic landscapes might have less of an impact of chlorophyll because of the inclusion of more natural functions as a buffer to nutrient input [54] This indicates that strategic planning of agricultural land could alter mitigate negative effects water quality. A more conscious approach to fertilization could also mitigate the amount of nutrient loaded runoff that is currently contributing to higher chlorophyll concentrations [51].

Natural and Anthropogenic Land Cover
Watersheds with high chlorophyll a concentrations were typically had higher human populations (rho = 0.2) and more urbanized landscapes (rho = 0.17). High population densities around freshwater bodies are connected to modification in land cover and such changes result in wastewaters, stormwaters and runoff having increased nutrients [55,56]. Higher nutrient runoff is a particular concern in highly urbanized locations [57,58]. For example, Dianchi Lake is one of the most eutrophic lakes in China and its watershed is located near an urban area which received 240 million liters of runoff in 2000 [55].
As watersheds become more urbanized and their population grows, the number of human activities that degrade freshwater quality rises, for example, see Reference [8]. Moore et al. (2001) surveyed 30 lakes in Washington state and found that lakes located near urban areas were more likely to receive nutrient loading from wastewater and have higher values for chlorophyll as compared to lakes in undeveloped areas [33].
Increasing the level of vegetation within urban land cover offers the opportunity to improve the absorption of nutrients while reducing the amount of impervious surfaces in the landscape. The result is a decline in the overall volume of runoff that reaches freshwater systems [59]. The implementation of green infrastructure, such as stormwater ponds or green roofs, in urban landscapes can reduce nutrient loading into freshwater systems [60,61] while providing a co-benefit for natural systems [62]. Additionally, a reduction of wastewater discharge and an improvement in the treatment of wastewater offers an opportunity for both cities and citizens to reduce their impact on freshwater. Continued improvement of the process and standard of wastewater treatment is a key step in reducing point source nutrient loading from urban areas.
We found that forested landscapes were negatively associated with chlorophyll a concentrations (Mixed Forest rho= −0.029, Deciduous Broadleaf Forest rho = −0.24) and appeared to be essential in limiting the amount of runoff that is bound for freshwater systems [63]. In forested landscapes, soil is more porous and prone to water infiltration which reduces surface runoff and increases nutrient uptake [64]. Forested landscapes have the opposite effect of agricultural landscapes by reducing the inputs of total nitrogen and total phosphorus into lakes [50]. As such, maintaining undisturbed forests is key for mitigation efforts because activities such as logging and deforestation result in large scale modifications to the surrounding ecosystem and watershed [65].

Sustainable Water Management
The sustainability of water quality continually faces threats. The needs of the future must be considered equally with the demands of today for water resources to be truly sustainable [66]. In order to provide freshwater for all ecological and human requirements, availability, quality and access of freshwater must all be maintained to support its ecosystem services [67], including habitat, drinking water, nutrient cycling and recreation [68]. Activities within watershed are an important driver of water quality. For example, anthropogenic land cover has had a significant impact on the earth systems including freshwater [69,70]. Based on our results, watersheds with urban and agricultural land cover along with a high human population are linked to elevated chlorophyll concentrations and overall eutrophication of lakes. The sustainability of water resources is threatened by increased water demands and high volumes of wastewater [8,55] from urbanized regions and the addition of nutrient rich runoff from agriculture [49,53]. A large human population increases pressure on agriculture and other ecosystem services and also creates strain on the freshwater sustainability of that region [69]. Understanding the potential threats of degraded freshwater at the watershed scale is integral to the protection of water and ensuring it remains safe and accessible for today and the future for both ecological and anthropogenic needs.

Conclusions and Future Directions
Agricultural watersheds and human populations are established drivers of chlorophyll a concentrations, which can also influence phosphorus and nitrogen inputs [51]. Anthropogenic nutrient loading in combination with warming air and water temperatures, increased solar radiation and a stable water column can cause blooms of harmful cyanobacteria [71]. The effects of eutrophication and harmful algal blooms can go beyond effects on natural systems by impacting drinking water, tourism and real estate [72]. Managing chlorophyll concentrations is essential to maintaining the health of aquatic and terrestrial ecosystems while supporting public safety.
Based on their landscape position, lakes are supplied with nutrient rich runoff, wastewater and pollutants that are delivered by river network or overland drainage from anywhere in the watershed [73]. We identified that the land cover in watersheds has a strong effect on lake water quality. These findings suggest that conservation efforts for freshwater quality must extend beyond the lake to the watershed scale [21,27]. In regions where cropland is established as a significant portion of the watershed, we recommend: (i) limiting the excessive use of fertilizers [74]; (ii) increasing the efficiency of the water usage to reduce the amount of runoff [75]; (iii) creating natural buffers between agricultural fields and waterways to reduce potential nutrient runoff [76]; and (iv) increasing the proportion of natural land covers that reduce chlorophyll concentrations through land conservation or restoration. Mitigating nutrient inputs from agricultural and urbanized landscapes will be essential to preserving water quality and the essential ecosystem services freshwater provides [4]. Lastly, our study developed a novel analytical framework to connect remote sensing with in-situ data to further understand the relationships between freshwaters and their watersheds. We highlighted the importance of land cover and human population as significant drivers of chlorophyll concentrations in lakes across the world.