An Untargeted Metabolomics Approach to Study the Variation between Wild and Cultivated Soybeans

The differential metabolite profiles of four wild and ten cultivated soybeans genotypes were explored using an untargeted metabolomics approach. Ground soybean seed samples were extracted with methanol and water, and metabolic features were obtained using ultra-high-performance liquid chromatography coupled with high-resolution mass spectrometry (UHPLC-HRMS) in both positive and negative ion modes. The UHPLC-HRMS analysis of the two different extracts resulted in the putative identification of 98 metabolites belonging to several classes of phytochemicals, including isoflavones, organic acids, lipids, sugars, amino acids, saponins, and other compounds. The metabolic profile was significantly impacted by the polarity of the extraction solvent. Multivariate analysis showed a clear difference between wild and cultivated soybean cultivars. Unsupervised and supervised learning algorithms were applied to mine the generated data and to pinpoint metabolites differentiating wild and cultivated soybeans. The key identified metabolites differentiating wild and cultivated soybeans were isoflavonoids, free amino acids, and fatty acids. Catechin analogs, cynaroside, hydroxylated unsaturated fatty acid derivatives, amino acid, and uridine diphosphate-N-acetylglucosamine were upregulated in the methanol extract of wild soybeans. In contrast, isoflavonoids and other minor compounds were downregulated in the same soybean extract. This metabolic information will benefit breeders and biotechnology professionals to develop value-added soybeans with improved quality traits.


Introduction
Soybean is an important leguminous crop providing meal and oil for food (~20%), animal feed (~76%), biodiesel, and other industrial uses (~4%) [1]. According to an estimation by the U.S. Department of Agriculture, in the 2019/2020 harvest year, the world soybean production was 337 million tons, and an increase of 8% (362 million tons) was observed in the 2020/2021 harvest year [2]. Three countries, namely Brazil, the United States of America, and Argentina, accounted for approximately 80% of the world's production of soybeans [3]. In the United States, the total soybean production was 4.3 billion bushels in 2022 [4]. Soybeans contain diverse primary and secondary metabolites, including proteins, carbohydrates, lipids, amino acids, and phytochemicals (isoflavones, saponins) that contribute to their nutritional and health-promoting properties [5]. Global breeding programs and biotechnology approaches have been used to develop new varieties of soybeans with improved quality traits. A better understanding of how agronomic practices, environmental conditions, pre-and post-harvest storage, and processing conditions impact metabolite changes in soybeans is essential to further develop soybeans with improved quality traits. cultivars and identify the metabolites responsible for the differentiation. Furthermore, we also wanted to evaluate the role of extraction solvents in untargeted metabolomics.

Results and Discussion
In this study, we selected four wild and ten cultivated soybean genotypes from different regions of the world for untargeted metabolomics investigation. Methanol and water extracts of all samples were analyzed to investigate their metabolic profiles in both positive and negative ion modes.

Identification of Compounds
Thousands of metabolite features were obtained using the Compound Discoverer 3.3 software program. Among them, 98 metabolites belonging to amino acids, organic acids, sugars, isoflavones, and soy saponins in methanol and water extracts were identified by careful analysis of the MS/MS fragments and comparison with the available literature data. A total of 64 compounds were detected with the positive and negative ion modes in the water extract, whereas 35 compounds were identified in the methanol extract. Some compounds were detected in both extracts (methanol and water) and in both ionization modes (positive and negative). The details of the compounds identified are summarized in Table 1 (methanol extract) and Table 2 (water extract). Table 1. Identification of metabolites using high-resolution mass spectrometry of methanol extract of fourteen soybean cultivars (ten cultivated and four wild) in positive and negative ion modes.    Soybean seeds are one of the most concentrated natural sources of isoflavones in human diets [22]. In soybeans, isoflavones are found in free and conjugated forms. In conjugated forms, they can occur with sugars (glucosides) and/or acids (acetyl/malonyl). The three common free isoflavones in soybean seeds are genistein, daidzein, and glycitein. The [M + H] + and [M − H] − for daidzein, genistein, and glycitein were observed at m/z 255.0650, 271.0612, 285.0758, and 253.0506, 269.0454, 283.0611 respectively. In the present study, we identified five analogs of daidzein at t R 0.99, 5.82, 6.26, 6.58, and 8.23 min. The compounds can be conjugated with sugars, acetylated, and/or malonylated. There are several reports of the presence of such analogs in the literature during targeted analysis [23][24][25].
Amino acids were detected as the second major group of metabolites in soybean seeds. Soybean seeds provide a rich source of plant-based proteins. The protein content in soybean seeds is around 40%, as documented in the published literature [26,27]. We identified 17 free amino acids and two acetylated analogs of amino acids in the water extracts using positive and negative ion modes. Only six amino acids were detected in the methanol extracts. Similar results of the presence of amino acids from soybeans using targeted analysis after derivatization with aminopyridyl-N-hydroxysuccinimidyl carbamate (APDS) reagent [28] have been reported.
In addition to amino acids and isoflavones, over 60 other organic compounds were also detected in water and methanol extracts of soybean seeds. These include organic acids, flavonoids, sugars, saponins, fatty acids, and other phytochemicals. Organic acids are one of the major components affecting soybeans' overall quality and taste. A recent study by Hyeon et al. reported some organic acids in soybean, such as malic acid, citric acid, and succinic acid [29]. In another recent study, ten organic acids were reported using NMR analysis by Song et al. in soybeans [30]. Similarly, the presence of epicatechin and sugars have also been reported previously by Hyeon et al. and Song et al. [29,30]. We also reported in our earlier publication the presence of sugars in soybean by ion chromatography and fatty acids after derivatization using GC-MS analysis with targeted analysis [20,21].

Comparison of Metabolites among Cultivars
All metabolites can be broadly classified into five major subgroups: amino acids, phenolics, organic acids, sugars, and miscellaneous. To compare the amounts of metabolites produced in each cultivar, all metabolites were organized based on the area under the curve for the mass ion extracted with two different solvents, water ( Figure 1A) and methanol ( Figure 1B). The total amount of amino acids, one of the significant classes of metabolites obtained from water extract, varied significantly between 25% and 75% within cultivated and wild soybean cultivars. Based on the areas under the curve for the targeted mass ion maximum amount of amino acids was obtained in the Asian cultivar (C8). Assessments of protein quality of 14 soybean cultivars using targeted amino acid analysis and twodimensional electrophoresis were investigated by Zarkadas et al. The authors indicated that all fourteen cultivars contained a good balance of essential amino acids [31]. Similar free amino acids were observed in fermented and unfermented soybeans and mung beans using targeted amino acid analysis after derivatization. The content of free amino acids was increased by 13-fold and 32-fold in fermented mung and soybeans, respectively. The authors showed that fermentation improved the amino acid content in a single soy and mung bean cultivar [32]. A similar analysis and characterization of the amino acid content of thua nao, a traditionally fermented food of northern Thailand, was studied by Dajanta et al. [33]. Significant variation in the phenolics area was seen in different cultivars. A maximum amount of phenolics was detected in the modern elite cultivar C11. The relative percentage of phenolics in other cultivars varied between 7% and 62%, with the lowest amount in cultivar C2. Similar variations in the total phenolic content (6.67 µg −1 in Pureunkong to 72.33 µg −1 in Poongsannamulkong) were observed in seven cultivars of soybeans by Kim et al. [34]. However, the maximum amount of organic acids were detected in ancestral (C9) and modern elite (C4) cultivars, with others showing variation between 25 and 77%.
However, compared to the metabolites from water extract, phenolics were the predominant metabolites in methanol extract. It has been documented that methanol, ethanol, acetone, water, and their water mixtures, with or without acids, are the most widely used solvents for extracting phenolic compounds [35,36]. Boeing et al. reported that among the pure solvents, methanol is the most effective solvent for the extraction of antioxidant compounds [37,38].
Significant variations of phenolics content were observed in the present study between cultivars (33-95%). The maximum amount of phenolics was detected in wild cultivar C13, with the lowest amount in Asian cultivar C8. This was different from the water extract, where the modern elite C11 cultivar showed the maximum amount. Similarly, the total amount of amino acid varied between 40 and 80% among cultivars. Similar to water extract maximum amount of amino acid was obtained in the Asian cultivated cultivar (C8) in methanol extract. However, the maximum amount of organic acid was detected in Asian (C2 and C5) cultivars, with others showing variation between 20 and 90%. The total amount of sugar varied between 30-90%, and Asian cultivar C8 produced the maximum amount of sugar compared to other cultivars. Since no distinct systematic variations in the area under the curve for the mass ions of different metabolites between cultivars were observed, multivariate analysis was done to differentiate between cultivars.

Classification of Wild and Cultivated Soybeans Using Principal Component Analysis (PCA) and Volcano Plots
Non-supervised analysis of the entire UHPLC-HRMS data using the Progenesis QI resulted in the detection of several thousands of ion features from each extract. The intensity of each ion was extracted and used for principal component analysis. PCA of the normalized intensity data of methanol and water extracts showed certain differentiation of wild and cultivated soybeans, as shown in Figure 2A and Figure 2B. The variances captured by the two components (PC1 and PC2) were between 29-49%. A further supervised partial least square discriminant analysis (PLS-DA) was performed, and the score plots are shown in Figure 3A and Figure 3B. Clear separations were observed between the wild and cultivated soybeans on the score plots; however, the separation between the cultivated soybeans was not obvious. For the metabolites from methanol extraction, the PLS-DA model resulted in the cross-validated predictive ability Q2(Y) of 39.9%. A value of 37.4% of the variance in X [R2(X)] was used to account for 21.1% of the variance of Y [R2(Y)]. For the metabolites from water extraction, the PLS-DA model resulted in the cross-validated predictive ability Q2 of 64.8%. A value of 84.2% of the variance in X [R2(X)] was used to account for 84.8% of the variance of Y [R2(Y)]. It suggested that the model from the metabolites from the water extraction gave us better prediction ability. The t-test (p-value) and fold change served as criteria for selecting the most discriminatory metabolites.
(p-value) and fold change served as criteria for selecting the most discriminatory metabolites.  Two volcano plots were constructed to identify the metabolites from both methanol and water extracts that were differentially expressed in wild and cultivated soybeans (Figure 4A and Figure 4B). Around 150-400 metabolite ion features with selected threshold fold change (≥4) and t-tests threshold (p ≤ 0.05) were selected as cutoff values for the volcano plots to identify the prominent ions responsible for the variation of metabolites. The data were categorized into three fractions: statistically insignificant (blue), upregulated (orangish-brown), and downregulated (grey). As seen with the filtered data set, which contained several hundreds of metabolites, a few compounds were either up or downregulated between wild and cultivated soybeans. Careful analysis of the fragmentation ions and comparison with the literature data significantly reduced the number of putatively identified compounds that were downregulated in cultivated and wild soybeans. Metabolites upregulated in the methanol extract of wild soybeans compared to the cultivated soybeans were catechin analog, cynaroside, hydroxylated unsaturated fatty acid derivatives, and uridine diphosphate-N-acetylglucosamine. Similar observations were also identified by Hyeon et al., where the authors showed with PCA that amino acids, organic (p-value) and fold change served as criteria for selecting the most discriminatory metabolites.  Two volcano plots were constructed to identify the metabolites from both methanol and water extracts that were differentially expressed in wild and cultivated soybeans (Figure 4A and Figure 4B). Around 150-400 metabolite ion features with selected threshold fold change (≥4) and t-tests threshold (p ≤ 0.05) were selected as cutoff values for the volcano plots to identify the prominent ions responsible for the variation of metabolites. The data were categorized into three fractions: statistically insignificant (blue), upregulated (orangish-brown), and downregulated (grey). As seen with the filtered data set, which contained several hundreds of metabolites, a few compounds were either up or downregulated between wild and cultivated soybeans. Careful analysis of the fragmentation ions and comparison with the literature data significantly reduced the number of putatively identified compounds that were downregulated in cultivated and wild soybeans. Metabolites upregulated in the methanol extract of wild soybeans compared to the cultivated soybeans were catechin analog, cynaroside, hydroxylated unsaturated fatty acid derivatives, and uridine diphosphate-N-acetylglucosamine. Similar observations were also identified by Hyeon et al., where the authors showed with PCA that amino acids, organic Two volcano plots were constructed to identify the metabolites from both methanol and water extracts that were differentially expressed in wild and cultivated soybeans ( Figure 4A and Figure 4B). Around 150-400 metabolite ion features with selected threshold fold change (≥4) and t-tests threshold (p ≤ 0.05) were selected as cutoff values for the volcano plots to identify the prominent ions responsible for the variation of metabolites. The data were categorized into three fractions: statistically insignificant (blue), upregulated (orangish-brown), and downregulated (grey). As seen with the filtered data set, which contained several hundreds of metabolites, a few compounds were either up or downregulated between wild and cultivated soybeans. Careful analysis of the fragmentation ions and comparison with the literature data significantly reduced the number of putatively identified compounds that were downregulated in cultivated and wild soybeans. Metabolites upregulated in the methanol extract of wild soybeans compared to the cultivated soybeans were catechin analog, cynaroside, hydroxylated unsaturated fatty acid derivatives, and uridine diphosphate-N-acetylglucosamine. Similar observations were also identified by Hyeon et al., where the authors showed with PCA that amino acids, organic acids, and fatty acids were higher in cultivated black soybeans as compared to wild black soybeans [30]. However, higher content of isoflavones and other flavonoids derivative was determined in the cultivated soybeans compared to wild soybeans. Some of the metabolites identified in the water extract showed similar trends. The isoflavones analogs, soy saponin, flavonoid analogs, amino acids (glutamine and guanine), lactic acid derivative, and 6-hydroxy caproic acid were determined in higher amounts in cultivated soybean as compared to wild soybeans. Upregulated compounds in wild soybeans were tentatively identified as phloretin, hypoxanthine, glutaric acid, and tyrosine. These results will be of significant value to soybean breeders and biotechnology researchers to develop new varieties of value-added soybeans with improved qualitative traits.
acids, and fatty acids were higher in cultivated black soybeans as compared to wild black soybeans [30]. However, higher content of isoflavones and other flavonoids derivative was determined in the cultivated soybeans compared to wild soybeans. Some of the metabolites identified in the water extract showed similar trends. The isoflavones analogs, soy saponin, flavonoid analogs, amino acids (glutamine and guanine), lactic acid derivative, and 6-hydroxy caproic acid were determined in higher amounts in cultivated soybean as compared to wild soybeans. Upregulated compounds in wild soybeans were tentatively identified as phloretin, hypoxanthine, glutaric acid, and tyrosine. These results will be of significant value to soybean breeders and biotechnology researchers to develop new varieties of value-added soybeans with improved qualitative traits. Figure 4. Volcano plots of identified metabolites from ten cultivated and four wild soybeans extracted with methanol (A) and water (B). The numbers noted in the score plot are marked as * in Table 1 for the methanol extract and Table 2 for the water extract.

Solvents and Materials
LC-MS-grade solvents, including acetonitrile, methanol, water, and formic acid, were used for the extraction and chromatographic separation. These organic solvents were purchased from Fisher Scientific (Pittsburgh, PA, USA). Extractions were carried out in a 15 mL centrifuge tube obtained from Thermo Scientific (Waltham, MA, USA). Polyvinylidene difluoride (PVDF) syringe filters with a pore size of 0.45 μm were purchased from National Scientific Company (Duluth, GA, USA).

Samples
Fourteen soybean cultivars were collected for this study. Four of these cultivars are wild (C3, C12-C14), and the other ten cultivated cultivars are categorized into three groups, namely, Asian landraces (C2, C5, C6, and C8), ancestral (C7 and C9), and modern  Table 1 for  the methanol extract and Table 2 for the water extract.

Solvents and Materials
LC-MS-grade solvents, including acetonitrile, methanol, water, and formic acid, were used for the extraction and chromatographic separation. These organic solvents were purchased from Fisher Scientific (Pittsburgh, PA, USA). Extractions were carried out in a 15 mL centrifuge tube obtained from Thermo Scientific (Waltham, MA, USA). Polyvinylidene difluoride (PVDF) syringe filters with a pore size of 0.45 µm were purchased from National Scientific Company (Duluth, GA, USA).

Samples
Fourteen soybean cultivars were collected for this study. Four of these cultivars are wild (C3, C12-C14), and the other ten cultivated cultivars are categorized into three groups, namely, Asian landraces (C2, C5, C6, and C8), ancestral (C7 and C9), and modern elite (C1, C4, C10, and C11). All soybean samples were obtained from the soybean germplasm collection (USDA, Urbana, IL, USA). The cultivar details, i.e., accession number, origin, and genotype information (wild soybean (G. soja), soybean bred for seed traits, and soybean landraces), were previously reported in our earlier publication [22]. In the present study, soybean seeds were ground in a commercial coffee grinder and stored in an ultralow temperature (<−60 • C) freezer prior to analysis.

Extraction of Metabolites
An amount of 150 ± 0.05 milligrams (mg) of soybean seed powder of each sample were taken into 15 mL centrifuge tubes and extracted with 5 mL of methanol (polarity index 5.1) in an ultrasonic bath (power 400 watts, Advanced Sonic Processing Systems, Oxford, CT, USA) for 15 min (twice). Similarly, the extraction of samples with water (polarity index 10.2) was also carried out. The extracts were centrifuged at 4000 rpm for 15 min and filtered using a 0.45 µm PVDF filter. The clean filtrate (500 µL) containing extracted metabolites was transferred to 2 mL HPLC vials and subjected to UHPLC-MS/MS analyses. All analyses were carried out in triplicate.

Data Acquisition
The Vanquish UHPLC system (Thermo Scientific), consisting of a binary pump, column compartment, autosampler, and detectors (Photodiode Array detector) PDA and (Charged aerosol detector) CAD coupled with an Exploris 240 mass spectrometer, was used to acquire high-resolution mass data in full-scan and data-dependent acquisition mode for all samples. An aliquot of each extract was analyzed using both positive and negative ionization modes. The metabolites were separated on a C 18 Agilent column (Eclipse Plus, 4.5 × 50 mm, 1.8 µm, 1200 bar pressure limit) using the gradient programs; 10% B at 0 min, gradually moves to 30% at 5 min, reach 60% at 10 min, reach 95% at 15 min, run 95% at 15-18 min, then reduced to 10% at 18.5 min. Water and acetonitrile acidified with 0.1% formic acid were used as mobile phases A and B, respectively. The flow rate and the injection volume were maintained at 0.5 mL/min, and 10 µL, respectively.
The HRMS mass range was from 100-2000 m/z, and the ESI conditions were as follows; sheath gas, auxiliary, and sweep gas at 50, 10, and 1 (arbitrary units), respectively, spray voltage at 3.4 kV, and capillary temperature at 320 • C, and vaporizer temperature 350 • C. The full scan mass spectra and three DD-MS 2 events were acquired at a resolving power of 12,000. An isolation width of 1 amu, maximum ion injection time of 100 ms, stepped collision energy starting from 30, 50, and 150, and an activation time of 10 ms was used for MS n activation. Xcalibur 4.4, including FreeStyle 1.8 software packages, has been used to analyze the mass spectral data.

Identification of Compounds
Compound Discoverer (Version 3.3, Thermo Scientific) software was used for the putative identification of metabolites. This involved application of several filters, namely background subtraction, MS n fragmentation information, ∆Mass (±5 ppm), minimum area, Fish score, and screening data with multiple databases (in-house mass library for soybean, mzCloud, ChemSpider, Metabolika, and other online available databases available). Furthermore, mass spectral data (molecular ion mass (m/z) and MS/MS fragmentation patterns) were compared with literature-reported data [22][23][24].

Data Processing
Raw LC-MS data were analyzed using the Compound Discoverer program to collect features for statistical analysis. The overall workflow of the program includes the detection of chromatographic peaks, extraction of the MS spectrum, deconvolution of the overlapping ions based on their isotope patterns, and integration of their respective peak areas. Acquired UHPLC-HRMS raw files were processed by using Nonlinear Progenesis QI (Durham, NC, USA) for peak detection, noise filtering, and peak alignment. Important deconvolution parameters were mass tolerance of 5 ppm, retention time tolerance of 0.2 min, peak rating threshold of 4, minimum peak intensity of 10 9 , chromatographic threshold S/N of 1.5, CV contribution of 10, and an area contribution of 3. The resulting areas of each sample in triplicate were exported to Microsoft Excel 365 (Microsoft, Redmond, WA, USA) for volcano plot construction.

Statistical Analysis
A data matrix was generated from Progenesis Qi, including a variable index (paired m/z-retention time), sample names (observations), and peak intensities. The peak intensities in each sample were scaled by Pareto scaling before further multivariate analysis using SIMCA 13.0 (Sartorius Stedim Biotech, Umeå, Sweden). Key metabolites responsible for separating different soybean genotypes were further isolated by constructing volcano plots using the Microsoft Excel application. Based on the loadings scores and a threshold of 0.05 for the Student's t-test of individual samples, key metabolites were selected and identified. The log 10 value of the peak area was used to compare the levels of metabolites between samples.

Conclusions
Advances in technology, data collection, and analysis software allowed for easy differentiation of wild and cultivated soybean using UHPLC coupled with HRMS without any chemical derivatization in conjunction with PCA analysis. Several recent publications on metabolomics analysis in peer-reviewed journals often use a single aqueous alcohol solvent mixture for sample extraction. As plants produce hundreds and thousands of metabolites, it is critical to investigate multiple solvent compositions of varying polarity to optimize the extraction of a wide array of metabolites with varying polarity. In this manuscript, we showed that the metabolites extracted and putatively identified in two different solvents with polarity indexes of 5.1 and 10.2 were significantly different, with some overlapping metabolites. A total of 98 metabolites were putatively identified as isoflavones, organic acids, lipids, sugars, amino acids, saponins, and other compounds. The PCA and PLS-DA analysis of the HRMS data allowed easy classification of wild and cultivated soybeans. The major metabolites that allowed the differentiation of wild and cultivated soybeans were isoflavonoids, amino acids, and fatty acids. Several metabolites were up and downregulated in wild soybeans as compared to cultivated ones. In general, metabolites upregulated in the wild soybeans were catechin analogs, cynaroside, hydroxylated unsaturated fatty acid derivatives, amino acid, and uridine diphosphate-N-acetylglucosamine. Downregulated metabolites were identified as isoflavonoids and other minor compounds. These marker metabolites may link to characteristic performance traits desired in soybean breeding for crop improvement. In conclusion, the metabolomics extraction and workflow with solvents of varying polarity indexes can result in a significant increase in the number of metabolites extracted and identified. This will allow researchers to extract and identify multiple biomarkers that can provide insights into metabolic pathways and also increase our understanding of how plants interact with varying climatic and growth conditions. It will also enable researchers to breed plants sustainable to various abiotic and biotic stresses. In addition, the detailed metabolomics information will aid researchers in producing foods with better nutritional traits and yields. This will be needed to improve sustainable agriculture practices and alternatives to animal-based protein products that may potentially provide solutions for the global food security challenge with increasing global population and decreasing agricultural land acreages.