Local Phenomena Shape Backyard Soil Metabolite Composition

Soil covers most of Earth’s continental surface and is fundamental to life-sustaining processes such as agriculture. Given its rich biodiversity, soil is also a major source for natural product drug discovery from soil microorganisms. However, the study of the soil small molecule profile has been challenging due to the complexity and heterogeneity of this matrix. In this study, we implemented high-resolution liquid chromatography–tandem mass spectrometry and large-scale data analysis tools such as molecular networking to characterize the relative contributions of city, state and regional processes on backyard soil metabolite composition, in 188 soil samples collected from 14 USA States, representing five USA climate regions. We observed that region, state and city of collection all influence the overall soil metabolite profile. However, many metabolites were only detected in unique sites, indicating that uniquely local phenomena also influence the backyard soil environment, with both human-derived and naturally-produced (plant-derived, microbially-derived) metabolites identified. Overall, these findings are helping to define the processes that shape the backyard soil metabolite composition, while also highlighting the need for expanded metabolomic studies of this complex environment.


Introduction
Soil is a highly complex and diverse mixture of minerals and organic material ubiquitous on the Earth's surface [1]. Its composition is influenced by large-scale factors such as climate, temperature and humidity, but also local phenomena such as human activity. Soil composition plays an important role in the regulation of many processes, such as plant growth, water systems and microorganism biology [2,3]. The biodiversity and microbial competition in the soil is also a rich source for natural product drug discovery [1,4,5]. Indeed, small molecules (metabolites) are major effectors of biological function, reflecting the active phenotype resulting from an environment's genetic potential [6]. However, the study of the soil small molecule profile has been limited by the complexity and heterogeneity of this matrix. In recent years, due to its sensitivity and high throughput capabilities, liquid chromatography-tandem mass spectrometry (LC-MS/MS) has become an attractive and powerful analytical tool for targeted and untargeted analysis of metabolite profiles from many sources, including the soil. Targeted LC-MS soil analysis has focused predominantly on the quantification of soil contaminants hazardous to human health, such as pesticides and herbicides, rocket fuel, polycyclic aromatic hydrocarbons and antibiotics [7][8][9][10][11]. Untargeted analyses, in contrast, have focused on environments less impacted by human activity. For example, Ladd et al. developed an untargeted LC-MS/MS method to analyze polar metabolites from an Arctic soil core [12]. Swenson et al. implemented hydrophilic interaction liquid chromatography (HILIC)-MS, across 20 arid biocrust samples from a USA national park, relating metabolite profile following a wetting event to microbial growth [2], while Jenkins et al. studied the metabolite profile of a single soil sample collected from the Oak Ridge Field Research Center by HILIC chromatography [13]. In contrast, Hewavitharana et al. used reversed-phase liquid chromatography to assess the impact of a plant pathogen disinfestation method on an orchard metabolome [14].
Given the importance of soil in agriculture, many of these prior untargeted metabolomic studies have naturally focused on soil samples relevant to this activity. However, the urban ecosystem also has high significance in terms of human health and civil engineering decisions; urban soils may also reflect human activity, although this has yet to be studied by metabolomics [15]. Likewise, given that most untargeted metabolomic studies have focused on a limited number of samples, there is a need for larger-scale metabolomics studies of soils to determine the factors affecting soil composition and assess the metabolic diversity of this environment. To address these gaps, in this study, we leveraged LC-MS/MS and large-scale data analysis tools (principal coordinate analysis, molecular networking [16], MolNetEnhancer [17]). Our goal was to characterize the relative contributions of city, state and regional processes on the backyard soil metabolite composition, and to determine whether this sample type can be used to study human behavior, using samples collected through a crowdsourced citizen science initiative (https://whatsinyourbackyard.org/). Analysis of 188 samples collected from 14 USA States, representing five USA climate regions, led to the detection of 3407 metabolite features, including anthropogenic, plant-derived and microbially-derived metabolites. City, state and regional factors affected the overall metabolite composition, with many metabolite features unique to a given backyard sample. This diversity supports the need for expanded studies of this complex environment using the methodologies implemented here.

Impact of Collection City, State and Climate Region on the Overall Soil Metabolite Composition
Metabolites were analyzed from 188 backyard soil samples collected from 45 cities, across 14 states, and representing five of the United States National Oceanic and Atmospheric Administration (NOAA) climate regions (Figure 1a). To determine the relative impact of these geographic factors on the overall soil metabolite composition, principal coordinate analysis (PCoA) was performed. PCoA data showed statistically significant clustering by collection city, state and NOAA climate region, indicating that all these factors influence backyard soil metabolite composition (PERMANOVA analysis, p < 0.05 for each metadata category). PERMANOVA analysis of this PCoA data further indicated that NOAA region accounts for 7.38% of the chemical variation in the data, collection state accounts for 15% of the chemical variation and collection city accounts for 33.8% of the chemical variation (Figure 1b, PERMANOVA p = 0.001, R 2 = 0.0738 for NOAA region; Figure 1c, PERMANOVA p = 0.001, R 2 = 0.150 for collection state; Figure 1d, PERMANOVA, p = 0.001, R 2 = 0.338 for collection city). These differences were also apparent when the data were restricted to the cities for which we had the most samples available (Figure 1e, PERMANOVA, p = 0.001, R 2 = 0.203 by city). Overall, collection city, therefore, had the highest effect size, with collection state explaining more of the variation in the data than the NOAA region, but less than collection city. These findings indicate that local phenomena have the strongest impact on backyard soil metabolite profile.
had the most samples available (Figure 1e, PERMANOVA, p = 0.001, R 2 = 0.203 by city). Overall, collection city, therefore, had the highest effect size, with collection state explaining more of the variation in the data than the NOAA region, but less than collection city. These findings indicate that local phenomena have the strongest impact on backyard soil metabolite profile. There were 3407 metabolite features retained in our analysis after blank removal. Of those, a core 326 metabolite features were common across all regions (Figure 2a). These include the insect repellent diethyltoluamide (DEET), and plant-derived metabolites (oleanolic acid methyl ester, uvaol and betulinic acid; Table 1, Figure S1). However, when these data were separated by state or city, there was considerable heterogeneity within a region or between cities in a given state (Figure 2b,c). Indeed, 52% of metabolite features were only detected in one soil sample, with 80% in five samples There were 3407 metabolite features retained in our analysis after blank removal. Of those, a core 326 metabolite features were common across all regions (Figure 2a). These include the insect repellent diethyltoluamide (DEET), and plant-derived metabolites (oleanolic acid methyl ester, uvaol and betulinic acid; Table 1, Figure S1). However, when these data were separated by state or city, there was considerable heterogeneity within a region or between cities in a given state (Figure 2b,c). Indeed, 52% of metabolite features were only detected in one soil sample, with 80% in five samples or less ( Figure 2d). This indicates that there is still considerable scope for metabolite discovery in soil samples, and highlights the need for large-scale analyses of soil samples.
Metabolites 2020, 10, 86 4 of 17 or less (Figure 2d). This indicates that there is still considerable scope for metabolite discovery in soil samples, and highlights the need for large-scale analyses of soil samples. To determine the local metabolites driving the differences between sampling sites, we focused on Oklahoma, the state for which we had the most samples, and restricted our analysis to cities in Oklahoma with at least ten distinct soil samples analyzed: Norman, Oklahoma City and Binger. Even between these closely-located cities, only 161 overlapping metabolite features were identified (11.7%, Figure 2c). In contrast, 35.5% of metabolite features detected in Binger were not detected in Norman or Oklahoma City, 60% of metabolite features detected in Oklahoma City were not shared with the other two cities and 43.1% of the metabolite features detected in Norman were not detected in the other two cities. Annotatable metabolite features uniquely detected in only one of these three locations compared to the other two cities include human activity-derived metabolites that could reflect differences in behavior between inhabitants of these cities or season of sample collection, as well as plant-derived metabolites that may represent differences in gardening choices between locations. For example, the sunscreen constituent oxybenzone (m/z 229.086, retention time (RT) 5.92 min, Figure S1) was detected in Oklahoma City but not Norman or Binger (although it was also detected in Ladera Ranch, CA and Blue Springs, MO). Another sunscreen constituent, m/z 179.070, RT 7.89 min, annotated as 2-propenoic acid, 3-(4-methoxyphenyl), was only found in Oklahoma City ( Figure S1). Likewise, the herbicide indaziflam (m/z 302.177, RT 4.94 min, Figure S1) was only detected in Norman and in no other sampled city. The veterinary anthelminthics oxfendazole (m/z 316.075, RT 3.43 min) and fenbendazole (m/z 300.080, RT 4.68 min) were both found in Oklahoma To determine the local metabolites driving the differences between sampling sites, we focused on Oklahoma, the state for which we had the most samples, and restricted our analysis to cities in Oklahoma with at least ten distinct soil samples analyzed: Norman, Oklahoma City and Binger. Even between these closely-located cities, only 161 overlapping metabolite features were identified (11.7%, Figure 2c). In contrast, 35.5% of metabolite features detected in Binger were not detected in Norman or Oklahoma City, 60% of metabolite features detected in Oklahoma City were not shared with the other two cities and 43.1% of the metabolite features detected in Norman were not detected in the other two cities. Annotatable metabolite features uniquely detected in only one of these three locations compared to the other two cities include human activity-derived metabolites that could reflect differences in behavior between inhabitants of these cities or season of sample collection, as well as plant-derived metabolites that may represent differences in gardening choices between locations. For example, the sunscreen constituent oxybenzone (m/z 229.086, retention time (RT) 5.92 min, Figure S1) was detected in Oklahoma City but not Norman or Binger (although it was also detected in Ladera Ranch, CA and Blue Springs, MO). Another sunscreen constituent, m/z 179.070, RT 7.89 min, annotated as 2-propenoic acid, 3-(4-methoxyphenyl), was only found in Oklahoma City ( Figure S1). Likewise, the herbicide indaziflam (m/z 302.177, RT 4.94 min, Figure S1) was only detected in Norman and in no other sampled city. The veterinary anthelminthics oxfendazole (m/z 316.075, RT 3.43 min) and fenbendazole (m/z 300.080, RT 4.68 min) were both found in Oklahoma City and not in Norman or Binger, with oxfendazole not detected in any of the other cities we analyzed ( Figure S1, Table 1). These differences may reflect pet ownership and differential veterinary or seasonal practices. Several plant-derived metabolites were found at different levels between Binger, Norman and Oklahoma City, including isoliquiritin (m/z 257.081, RT 4.65 min) and globulol (m/z 163.148, RT 8.39 min), both detected only in Oklahoma City backyards ( Figure S1, Table 1).
To identify metabolites with differential abundance between these three locations, we also built a random forest classifier on metabolites recovered from these locations, classifying by city. Of the top 30 most differential metabolites between Binger, Oklahoma City and Norman (Table S1), only one had an annotation in Global Natural Products Social Molecular Networking (GNPS) that passed our quality criteria (see Methods), the plant metabolite phytol (m/z 279.304, RT 7.95 min), which was higher in the more rural Binger compared to Norman and Oklahoma City (Kruskal-Wallis p = 2.70e−05). Differences in plant-derived metabolites between sampling sites may be due to the types and amounts of plants selected by each household, or to the season of sampling. It is, however, important to note that most metabolites are unique to a given sample (a given backyard) and not shared between multiple locations, even within the same city ( Figure 2d).

Specific Chemistries Identified in Backyard Soil Samples
To explore the specific metabolites found in backyard soils, we performed feature-based molecular networking [16,18] and chemical ontology analyses of the detected metabolites [17]. Feature-based molecular networking analysis grouped our 3407 detected metabolite features into 171 chemical families (sub-networks) of ≥3 members (Figure 3a), 227 families of two metabolite features (454 network nodes) and 1637 singletons. There was often significant heterogeneity in geographic distribution within a given chemical family. To illustrate this heterogeneity, we focused on terpenes (including triterpenoids and diterpenoids). These are common plant-derived metabolites [19] that were readily annotatable in our dataset. Some triterpenoids were found uniquely in a given state (e.g., m/z 409.346, RT 7.56 min, annotated as echinocystic acid and found only in CA), while other triterpenoids are found in multiple states and regions (e.g., m/z 443.389, RT 8.30 min, annotated as uvaol and found in Central, Northeast, West and South regions (OK, PA, CA, TN)) ( Figure 3c). Chemical ontology analysis further showed that the soil samples are chemically diverse, with detected features grouped into 13 ClassyFire [20] chemical super classes, 75 classes and 118 subclasses (Figure 3b). The most common chemical superclass was lipids and lipid-like molecules (853 metabolite features), which is consistent with the fact that lipids are commonly found in soils [21,22], and organoheterocyclic compounds (487 metabolite features).

Discussion
In this study, we report the metabolomic analysis of 188 soils from across the USA. City, state and NOAA climate region affected the overall metabolite composition, with most metabolite features unique to a given backyard sample (Figure 2d). The impact of climate region on soil metabolites was minor (PERMANOVA R 2 = 0.0738), indicating that the difference between each region was small in comparison with other phenomena (Figure 1b-e). Source state had a larger impact on soil metabolites (PERMANOVA R 2 = 0.150), with samples from OK, MO and CA showing partially distinct clustering from other states (Figure 1c). The largest geographic impact on overall soil metabolite profile was observed at the city level (R 2 = 0.338, Figure 1d,e), indicating that local phenomena explain more of the variation in soil metabolites than broader geography. This was further supported by the considerable heterogeneity in metabolite abundance between locations, even within a given chemical family (Figure 3a,c). Such local factors influencing the soil may include temperature, light radiation, or human factors such as pollution and other human behavior-associated factors. Indeed, several airborne pollutants such as polycyclic aromatic hydrocarbons (PAHs) deposit onto soils [25]. Although our instrumental conditions did not enable us to detect PAHs, we observed many man-made chemicals in our soil samples, including pesticides, insecticides, medication, personal care products and coatings (Table 1).
Roughly 10% of soil metabolite features (326 out of 3407) were, however, observed in at least one sample from each region, indicating a core backyard soil metabolite profile. Indeed, in common with prior soil analyses [21,22,26], we detected high frequencies of lipids (853 metabolite features), as well as amino acids and organic nitrogen compounds. Likewise, similar to studies of root-associated metabolites [27][28][29], we too observed hydroxycinnamic acid derivatives, flavonoids, triterpenoids and other organic acids. Annotatable metabolites detected in our study differed significantly from annotatable metabolites described in prior untargeted analyses of agricultural soils [13,14], likely due to differences in metabolite extraction and instrumental protocols. Overlapping metabolites between the study of agricultural soil by Jenkins et al. [13] and this study include valine and adenosine. A majority of annotatable metabolites in Hewavitharana et al. [14] were lipids, and indeed most annotatable metabolites in our study were lipids and lipid-like molecules (Figure 3b), with pentacyclic triterpenoid derivatives (e.g., ursolic acid) found in both studies. However, these studies of agricultural soils exclusively reported metabolites of natural origin and did not report man-made chemicals. The soil samples we analyzed were collected from backyards and contained many human activity-derived chemicals. These are likely to represent runoff from activities specific to the corresponding household, or to each household's specific gardening practices. As such, they may represent a fingerprint of household behavior, reminiscent of prior metabolomics studies of the built environment [30][31][32].
A major strength of this study is the large number of samples analyzed from across a broad geographic range, in contrast with many prior soil metabolomics studies (e.g., [12,26]). However, due to the location of this citizen science soil collection project, OK was over-represented compared to other states. Likewise, samples are self-submitted by participants, so that limited metadata is available concerning temperature, weather, plants or human intervention on these soils, even though these are known to affect soil metabolites [2,33]. As with any metabolomics studies, metabolite recovery is affected by experimental procedures. As such, our observations are limited to metabolites soluble in methanol/dichloromethane/ethyl acetate/acetonitrile and ionizable in positive mode. Annotation remains a challenge in metabolomics; indeed, in our dataset, out of 3407 metabolite features, only 55.5% of metabolite features could be assigned to a ClassyFire chemical class, and only one of the most differential metabolites observed between Norman, Binger and Oklahoma City, OK, could be annotated. Nevertheless, our study significantly expands our understanding of soil metabolites, both natural and man-made, and serves as the foundation for further large-scale studies of soil metabolomics.

Sample Selection
Soil samples were obtained from the University of Oklahoma Citizen Science Soil Collection Program (https://whatsinyourbackyard.org/). In total, 188 soil samples were analyzed, representing 45 cities, across 14 states in five of the USA NOAA climate regions (Figure 1a, Table 2).

Metabolite Extraction
Metabolite extraction methods were adapted from previous publications [10,12,34]. Briefly, soil samples were lyophilized overnight. Dried samples were weighed and ca. 50 mg retained for analysis. Dried soils were then pulverized with a 5 mm stainless steel bead (Qiagen), in a TissueLyzer II (Qiagen) set to 25 Hz for 5 min. Pulverized samples were homogenized at 25 Hz for 5 min in methanol/dichloromethane/ethyl acetate/acetonitrile (1:1:1:1 v/v) spiked with 2 µM sulfachloropyridazine at a 1 mL extraction solvent per 50 mg of soil ratio. Homogenates were shaken in a rotary shaker (Innova 2000, Eppendorf, 150 rpm, 60 min) and then sonicated for 30 min. Finally, the samples were centrifuged (14,800 rpm, 15 min) at room temperature. A 200 µL volume of final supernatant was transferred into 96-well plates to perform LC-MS/MS analysis.

LC-MS/MS
Samples were analyzed in randomized order, with an injection volume of 5 µL. Before each draw, the auto-injector was washed for 2s (10 µL/s) with 10% methanol. Ultra-high performance liquid chromatography was performed on a Thermo Vanquish instrument, using a 1.7 µm Kinetex C18 50 × 2.1 mm column, 100 Å pore size, protected by a SecurityGuard ULTRA C18 Guard Cartridge (Phenomenex), with water + 0.1% formic acid as mobile phase A and acetonitrile + 0.1% formic acid as mobile phase B, at a 0.5 mL/min flow rate. Autosampler temperature was set to 10 • C and column compartment to 40 • C. LC gradient was set according to Table 3. Table 3. LC gradient.

Data Analysis
Raw MS dataset was converted into mzXML format using MSconvert [35] and imported into MZmine v.2.51 [36]. MS features were identified using parameters described in Table 5. Data were filtered to remove all metabolite features with intensity within 3-fold of blank samples and normalized to total signal (TIC normalization). Principal coordinate analysis (PCoA) was performed using the Bray-Curtis-Faith dissimilarity metric, using QIIME1 version 1.9.1 [38] and visualized in EMPeror [39]. Global Natural Products Social Molecular Networking platform (GNPS) was used to perform feature-based molecular networking [16,18], with the following parameters: precursor ion mass tolerance of 0.02 Da, fragment ion mass tolerance of 0.02 Da, minimum cosine score of 0.7 and 4 or more matched fragment ions. The maximum shift allowed between two MS/MS spectra was 500 Da, 10 maximum neighbor nodes allowed and the maximum difference between precursor ion mass of searched MS/MS spectrum and library spectra was 100 Da. Molecular network visualization was done in Cytoscape 3.4.0 [40]. Chemical structural information within the molecular network was obtained using the GNPS MolNetEnhancer workflow [17] which incorporated in silico structure annotations from GNPS Library Search, Network Annotation Propagation (NAP) [41] and DEREPLICATOR [42]. DEREPLICATOR was run as part of our feature-based molecular networking job (Advanced external tools: Run Dereplicator enabled). Parameters were therefore kept to defaults: precursor and fragment ion mass tolerance, 0.02 Da; search analogs (VarQuest [43]), enabled; PNP database; max charge, 2; accurate P-values, disabled; minimum number of amino acids, 5. NAP (version 1.2.5) parameters were as follows: N first candidates for consensus score, 10; cosine value to sub-select inside a cluster, 0.8; use fusion result for consensus, enabled; accuracy for exact mass candidate search, 15 ppm; acquisition mode, positive; adduct ion type, [M+H]; structure databases: HMDB, GNPS, SUPNAT, CHEBI; maximum number of candidate structures in the graph, 10; workflow type, MZmine. All annotations are at confidence level 2-3 according to the metabolomics standards initiative [24]. Maps were generated in R using packages muRL, zipcode, ggplot2, mapproj, viridis and RColorBrewer [44]. Random forest analysis was implemented as in our prior work [30], in R using the package randomForest, with 1000 trees and correction for unequal group sizes (see Jupyter Notebook on GitHub at https: //github.com/mccall-lab-OU/soils for code details).