Bioinformatic Extraction of Functional Genetic Diversity from Heterogeneous Germplasm Collections for Crop Improvement

: E ﬃ cient utilization of genetic variation in plant germplasm collections is impeded by large collection size, uneven characterization of traits, and unpredictable apportionment of allelic diversity among heterogeneous accessions. Distributing compact subsets of the complete collection that contain maximum allelic diversity at functional loci of interest could streamline conventional and precision breeding. Using heterogeneous population samples from Arabidopsis , Populus and sorghum, we show that genomewide single nucleotide polymorphism (SNP) data permits the capture of 3–78 fold more haplotypic diversity in subsets than geographic or environmental data, which are commonly used surrogate predictors of genetic diversity. Using a large genomewide SNP data set from landrace sorghum, we demonstrate three bioinformatic approaches to extract functional genetic diversity. First, in a “candidate gene” approach, we assembled subsets that maximized haplotypic diversity at 135 putative lignin biosynthetic loci, relevant to biomass breeding programs. Secondly, we applied a keyword search against the Gene Ontology to identify 1040 regulatory loci and assembled subsets capturing genomewide regulatory gene diversity, a general source of phenotypic variation. Third, we developed a machine-learning approach to rank semantic similarity between Gene Ontology term deﬁnitions and the textual content of scientiﬁc publications on crop adaptation to climate, a complex breeding objective. We identiﬁed 505 sorghum loci whose deﬁned function is semantically-related to climate adaptation concepts. The assembled subsets could be used to address climatic pressures on sorghum production. To face impending agricultural challenges and foster rapid extraction and use of novel genetic diversity resident in heterogeneous germplasm collections, whole genome resequencing e ﬀ orts should be prioritized.


Introduction
Plant germplasm collections safeguarded in gene banks conserve the raw materials necessary to confront agricultural challenges. Agricultural challenges come in many forms, from the immediate-disease outbreaks [1,2]; to the persistent-greater yield from fewer inputs (e.g., in maize, [3][4][5]); to future unknowns-the adaptation of cropping systems to no-analog climates and pest assemblages [6,7]. Gene banks conserve DNA sequence variation packaged into reproductive propagules. This DNA sequence variation forms the material basis of the potential phenotypic variation available in a collection that can be used to address challenges.
In plant populations, whether natural or artificial (as in the case of gene bank accessions), segregating DNA sequence variation consists of haplotype blocks-contiguous spans of sequence that are inherited as a unit. Extraction of improved or novel traits from a collection depends on the mobilization of haplotype blocks covering a desired set of genes into breeding lines, and eventually, to elite cultivars. This process has been aided by use of "core collections", subsets of the broader collection designed to contain maximum genetic variation in a compact number of accessions [8]. These subsets decrease the number of crosses, and hence the scale of experiments, necessary to explore the potential phenotypic variation held in a collection.
Germplasm subsets have often been assembled to maximize variation in traits, using phenotypic measurements from field evaluations [9]. The degree to which phenotypic trait variation can be used to capture, and ultimately distribute, underlying genetic diversity is not known although it is unlikely to be a very effective proxy. Much functional genetic diversity is cryptic with respect to phenotype, observable only when recombined into a different genetic background [10][11][12][13]. Molecular marker data is a popular alternative for guiding subset assembly [14][15][16]. While shown to capture more allelic diversity than random sampling [17], typical marker data sets are sparsely distributed across the genome thus haplotype capture at particular loci of agronomic importance may fail [18]. In cases where trait evaluation data and/or molecular marker data are incomplete or unavailable, geographic variation (samples come from widely dispersed sites) and environmental variation (samples come from different environments) have been treated as surrogates for genetic variation and likewise maximized [19][20][21][22]. Reeves and Richards [23,24] showed that the use of geographic and environmental surrogates produces subsets with little added genomewide haplotypic diversity compared to subsets selected at random and, moreover, that the use of these surrogates results in biased capture of important functional genetic variation.
In order to deliver breeding subsets enriched in haplotypic diversity, we support the notion that germplasm collections should be curated and accessed with a central focus on DNA sequence variation [25]. In this study, using three exemplar data sets, we quantify the improvement in genomewide haplotype capture that can be gained by using dense single nucleotide polymorphism (SNP) data sets instead of geographic or environmental surrogate predictors of genetic diversity. Using landrace sorghum we demonstrate three bioinformatic procedures that might be used to assemble subsets enriched in haplotypic diversity at targeted sets of functional loci. This includes a novel machine-learning approach which, by leveraging the information encoded in written language, transduces scientific publications related to complex breeding objectives into pertinent functional genes, at which haplotypic diversity can be efficiently captured in subsets.

Material and Methods
Genomewide SNP data was acquired from published studies on wild European Arabidopsis thaliana (L.) Heynh., a model species, wild North American Populus trichocarpa Torr. & Gray, a source of wood fiber, and landrace African Sorghum bicolor (L.) Moench., a grain/feedstock commonly cultivated in arid areas [26][27][28]. Many published data sets of genomewide SNP variation sample a single individual per accession. Such data are not suitable for representing the genetic diversity present in heterogeneous germplasm collections of wild and landrace material. Therefore, populations/accessions used here were subsampled from the original studies. All contained > 4 individuals, equivalent to accepting a worst-case minor allele frequency (MAF) of 0.125. The Arabidopsis data set contained 40 populations with 441 individuals and 214,051 SNPs on 5 chromosomes; Populus, 56 populations, 251 individuals, 32,860 SNPs on 19 chromosomes; sorghum, 113 individuals, 22 accessions, 404,614 SNPs on 10 chromosomes. For Populus and sorghum, haplotype phase and missing genotypes were imputed using fastPHASE [29]. Provenance details and maps displaying the geographic distribution of the populations/accessions used here are contained in prior publications [23,24,[26][27][28].
Geographic coordinates (latitude/longitude) at site of origin of each population/accession were used to extract 19 environmental variables expressing precipitation and temperature regimes at the sampling site from the BIOCLIM set of data layers (http:/www.worldclim.org). Altitude was also included as an environmental variable. The continuous geographic and environmental variables were quantized into categorical states using the hierarchical method described in Reeves and Richards [24] in order to assemble subsets.
To simulate varying levels of linkage disequilibrium (LD) across the genome, SNP data were recoded into haplotype blocks using the software HAPLOTYPISTA (https://github.com/NCGRP/ haplotypista), which calls alleles according to the genotype induced by concatenating adjacent SNPs. By varying the number of concatenated SNPs from 2 to 50, we explored average physical block lengths from 0.555 ± 1.46 Kbp (kilobase pairs) (± 1 sd) to 27 Loci of interest were identified in heterogeneous landrace sorghum using three scenarios applicable to breeding programs. Scenario one, a candidate gene approach, used an expert-curated list of 135 genes belonging to 10 gene families involved in lignin biosynthesis, derived from Xu et al. [30]. Genomic intervals containing these genes were identified using the Sorghum v1.4 annotation and were found on all 10 sorghum chromosomes. Eighteen intervals contained no SNPs in our 22 accession data set so were not used, leaving 117. We call the set of 117 loci of interest for scenario one "lignin biosynthetic genes".
In the second scenario, we used a keyword search strategy against the Gene Ontology to identify genes with regulatory properties. A body of text ("corpus") containing the GO term ID, name, namespace, and definition lines from the human-readable ontology description file 'go-basic.obo' (release 2016- [12][13][14][15][16][17][18][19][20] was created for sorghum. In this corpus, a single GO term ID/name/namespace/defline combination is treated as a "document". The corpus contained 2697 documents, each corresponding to a GO term occurring in Sorghum v1.4, as determined from a custom gene association file (which excluded annotation qualifiers 'not', 'contributes_to', and 'colocalizes_with') that contained 58,610 GO annotations of 16,262 named sorghum genes. A keyword search for the root word 'regulat' returned 161 terms (151 biological process, 8 molecular function, 2 cellular component) meant to encompass broad-sense regulatory gene functional diversity in sorghum. The set of loci of interest for scenario two, called "regulatory genes", was made up of 1040 genomic intervals containing genes annotated to 161 GO terms.
In the third scenario we used natural language processing to transduce the written language of scientific publications into relevant genomic intervals. For proof of concept we chose the topic "crop adaptation to climate in the United States", a concept for which breeding objectives are complex. We used two highly-cited journal articles on the subject: "Nonlinear temperature effects indicate severe damages to U.S. crop yields under climate change" by Schlenker and Roberts [31], and "Greater sensitivity to drought accompanies maize yield increase in the U.S. Midwest" by Lobell et al. [32], plus a primer published as United States Department of Agriculture Technical Bulletin 1935 entitled "Climate change and agriculture in the United States: effects and adaptation" [33]. Documents were rendered as plain text and cleaned of low-relevance and unwanted information (e.g., author affiliations, citations). Latent semantic analysis [34,35], a natural language processing procedure, was performed using the machine-learning Python toolkit Scikit-learn [36]. Latent semantic analysis processes a text corpus into a system of vectors using singular value decomposition (SVD). A query, in this case the combined text of the scientific publications, is then fit to the vector system so that a distance between the query and each document in the corpus can be calculated based on similarity in word occurrence frequency.
Cellular component GO terms were removed from the corpus prior to vectorization using TfidfVectorizer(), configured with preprocessing directives to eliminate common English stop words ("it", "and", "the", etc.) and normalize the character set by stripping accents, punctuation, and other non-standard characters. Single word and adjacent two-word pairs were permitted as "tokens". Tokens were "tf-idf" weighted ("term frequency-inverse document frequency") which scales the importance of a token (here, a word or two word phrase) in relation to its frequency of occurrence in a document (here, a single GO term) and to the inverse of its frequency in the corpus (all GO terms).
A matrix containing the weighted frequency of words and word pairs for each GO term was then subject to singular value decomposition for reduction into a vector system, allowing up to 2500 component axes suchs that the distance between each individual GO term and query could be computed without the progressive reduction of corpus size usually necessary to achieve stable document-query ranking. TruncatedSVD() was used with ten randomized iterations, method fit_transform() created the vector system for the corpus, and method transform() mapped the query text onto that system. To express similarity in concepts, cosine distances between the query and each GO term were calculated using sklearn.metrics.pairwise_distances() then ranked. The top 50 GO terms most semantically related to the query text were identified (Table 1), resulting in a set containing 505 genes, here called "climate adaptation genes". The software M+ (https://github.com/NCGRP/Mplus) [18,23] was used to assemble optimized subsets of populations/accessions containing maximum diversity in reference data sets. M+ is a parallelized implementation of the M strategy, a commonly used subset optimization algorithm that outperforms stratified (population structure-based) sampling procedures for retaining allelic diversity [17]. The M strategy focuses only on retaining the maximum number of distinct alleles in a minimized subset relative to the complete collection. It does not attempt to maintain representative allele frequencies across a stratified sample as they occur in the complete collection. Both subset optimization philosophies are useful, but the former quantity, raw allelic diversity or high allele count, is the quantity of interest here. Reference data sets included geographic, environmental, and whole genome haplotypic diversity, as well as, for sorghum, the lignin biosynthetic gene, regulatory gene, and the climate adaptation gene sets. M+ returns a normalized metric of diversity captured at user-defined variables (here, loci or quantized geographic/environmental data) in optimized subsets (m opt ) as well as in subsets that are randomly assembled (m ran ). To assign a value indicating the capacity of a reference set of information to predict diversity at each SNP position in the genome, we calculated the deviation m opt − m ran for each genomic interval defined by a haplotype block, for all possible subset sizes. We have previously termed this value the "diversity retention index" (D RI , details [24]). D RI > 0 indicates that allelic diversity at target loci can be predicted (and therefore captured) better than random by using reference data. D RI ≤ 0 indicates that the reference data contain no useful information to predict allelic diversity at the target loci. Using this analytical construct we can estimate the relative enrichment of allelic diversity in subsets targeting the whole genome or specific functional gene sets of interest. By varying haplotype block length, we can also evaluate the effect of the extent of LD on those estimates. Annotated bioinformatic pipelines and other resources relevant to performing the analyses have been archived at https://github.com/NCGRP/agr1suppl.

Results
Genomewide SNP data broadly outperformed geographic and environmental data in capturing haplotypes across the genome ( Figure 1). Using genomewide SNP data allowed more haplotypic diversity to be captured across 99% of the Arabidopsis genome (98%, Populus; 89%, sorghum). In all three taxa, genomewide SNP data allowed capture of significantly more haplotypic diversity than geographic and environmental data (t-test, p < 2E−16; Figure 2). In Arabidopsis, maximizing geographic and environmental diversity resulted in the capture of less haplotypic diversity than would be expected in random subsets. In Populus, genomewide SNP data led to the capture of 3-9 fold more haplotypic diversity than geographic and environmental data. In sorghum, genomewide SNP data captured 19-78 fold more haplotypic diversity while geographic and environmental data types captured little more than random.
In sorghum, we evaluated the ability of reference data sets comprised of genomewide SNP data, geographic data, or environmental data to capture diversity at loci in the lignin biosynthetic gene, regulatory gene, and climate adaptation gene sets. Genomewide SNP data showed superior ability to capture haplotypes at all target gene sets relative to geographic and environmental data ( Figure 3). This was true whether the haplotypic diversity used for subset optimization was derived from the whole genome or solely from the haplotype blocks covering the target genes. Geographic and environmental data performed little better than random, regardless of the LD extent simulated. When the lignin biosynthetic gene set was used as reference there was a modest advantage to using short block lengths (<~10 Kbp), consistent with estimates of the rate of LD decay in sorghum [37] ( Figure 3a). A related pattern was observed for the regulatory genes, where, for long LD, subsets that maximized whole genome haplotypic diversity captured more regulatory gene haplotypes than subsets that maximized haplotypic diversity in the regulatory genes themselves. For short LD, regulatory gene haplotype capture was equivalent. Regardless of LD extent, climate adaptation gene haplotypes were better captured by maximizing diversity in the target gene set itself rather than the whole genome.
We also evaluated the ability of the target gene sets, used as reference, to capture haplotypes genomewide, again varying the maximum block length to simulate different levels of LD around the loci. This is a measure of genomewide haplotypic diversity that might be "lost" if subset optimization only considered the target gene sets. Target gene sets captured significantly fewer haplotypes genomewide than genomewide SNP data, but significantly more than geographic and environmental data (t-test, Holm-Bonferroni corrected p < 2E−33) (Figure 4). Geographic and environmental data captured slightly more haplotypic diversity genomewide than random subsets when LD was short, and similar levels to random when LD was long.   Figure 1. Relative improvement in genomewide haplotype capture achieved by using genomewide single nucleotide polymorphism (SNP) data instead of geographic or environmental data. Y-axis measures relative enrichment, where zero is the haplotypic diversity recovered in random subsets. Vertical lines mark chromosome boundaries. Points show haplotypic diversity captured at each SNP position by maximizing genomewide diversity (purple), geographic diversity (yellow), or environmental diversity (teal) in subsets. Subsets that maximize geographic or environmental diversity capture less haplotypic diversity than randomly assembled subsets across much of the genome.
Breeder access to these large, complex collections is dependent on the generation and distribution of manageable subsets that contain genetic diversity relevant to a breeding objective. Subsets drawn from germplasm collections for breeding purposes are often ad hoc assemblages based on experience with the collection, intuition, imperfect phenotypic data, and some expression of diversity in provenance suchs as geographic (source locality) or environmental (cultivation conditions) variation, intended as a surrogate for molecular genetic diversity. Given limited information, these are all valuable factors. Increasingly, breeders may possess a set of candidate genes that, effectively, defines the breeding objective, e.g., in cases like metabolic pathway engineering [43] or NBS-LRR gene-mediated disease resistance breeding [44]. Alternatively, they may have a general idea of categories of genes relevant to the breeding objective, e.g., drought resistance genes [45], secondary cell wall deposition genes [46], photosynthetic genes [47]. Or, they may face complex, multidimensional challenges where many genes are likely to be involved, e.g., yield increase [48,49], generalized stress resistance [50,51], or adaptation to new climates [52].
Agronomy 2020, 10, x 7 of 16 Figure 2. Average haplotype capture in subsets assembled using genomewide single nucleotide polymorphism (SNP), geographic, and environmental data. Y-axis measures relative enrichment, zero is haplotypic diversity of random subsets. Error bar shows one standard deviation from the mean. White bars, genomewide SNP data; black, geographic data; grey, environmental data. All pairwise comparisons within taxa differ significantly (t-test, p < 2E−16).
In sorghum, we evaluated the ability of reference data sets comprised of genomewide SNP data, geographic data, or environmental data to capture diversity at loci in the lignin biosynthetic gene, regulatory gene, and climate adaptation gene sets. Genomewide SNP data showed superior ability to capture haplotypes at all target gene sets relative to geographic and environmental data ( Figure 3). This was true whether the haplotypic diversity used for subset optimization was derived from the whole genome or solely from the haplotype blocks covering the target genes. Geographic and environmental data performed little better than random, regardless of the LD extent simulated. When the lignin biosynthetic gene set was used as reference there was a modest advantage to using short block lengths (<~10 Kbp), consistent with estimates of the rate of LD decay in sorghum [37] (Figure 3a). A related pattern was observed for the regulatory genes, where, for long LD, subsets that maximized whole genome haplotypic diversity captured more regulatory gene haplotypes than subsets that maximized haplotypic diversity in the regulatory genes themselves. For short LD, regulatory gene haplotype capture was equivalent. Regardless of LD extent, climate adaptation gene haplotypes were better captured by maximizing diversity in the target gene set itself rather than the whole genome.
We also evaluated the ability of the target gene sets, used as reference, to capture haplotypes genomewide, again varying the maximum block length to simulate different levels of LD around the loci. This is a measure of genomewide haplotypic diversity that might be "lost" if subset optimization only considered the target gene sets. Target gene sets captured significantly fewer haplotypes genomewide than genomewide SNP data, but significantly more than geographic and environmental data (t-test, Holm-Bonferroni corrected p < 2E−33) (Figure 4). Geographic and environmental data captured slightly more haplotypic diversity genomewide than random subsets Average haplotype capture in subsets assembled using genomewide single nucleotide polymorphism (SNP), geographic, and environmental data. Y-axis measures relative enrichment, zero is haplotypic diversity of random subsets. Error bar shows one standard deviation from the mean. White bars, genomewide SNP data; black, geographic data; grey, environmental data. All pairwise comparisons within taxa differ significantly (t-test, p < 2E−16).
The purpose of this study is to demonstrate the superiority of genomewide DNA sequence data for navigating complex germplasm collections to produce subsets containing genetic diversity appropriate to a breeding objective. Using genomewide SNP data sets from three heterogeneous species samples, we visualized haplotype capture at each SNP position across the genome (Figure 1). We found geographic and environmental surrogate predictors of genetic diversity to be profoundly limited in their ability to capture haplotypic diversity genomewide. Subsets optimized using genomewide SNP data should be expected to contain an order of magnitude or more genetic diversity than those produced using geographic or environmental data alone (Figure 2). We have made this observation in a general sense before [23,24] but here illustrate it with previously unobtained granularity, down to the level of the nucleotide position. Geography and environment remain useful for assembling subsets for ecological conservation (landscape restoration, reintroductions, etc.) [53], where the object is to mimic natural distributions in allele frequency, because geography and environment are good predictors of allele frequency differences across species ranges. However, they do not appear useful for producing subsets with maximum haplotypic diversity, as would be desirable for crop breeding purposes.  9 Figure 4. Genomewide haplotype capture in sorghum using genomewide SNPs (single nucleotide polymorphisms), three sets of target genes, geographic data, and environmental data to assemble subsets. X-axis, average physical length of haplotype blocks (Kbp). Y-axis, rescaled relative enrichment of haplotypic diversity genomewide, zero is random expectation. Points are coded by reference data source: triangle, whole genome; 'l', lignin biosynthetic genes; 'r', regulatory genes; 'c', climate adaptation genes; plus sign, geography; circle, environment.
Breeder access to these large, complex collections is dependent on the generation and distribution of manageable subsets that contain genetic diversity relevant to a breeding objective. Subsets drawn from germplasm collections for breeding purposes are often ad hoc assemblages based on experience with the collection, intuition, imperfect phenotypic data, and some expression of diversity in provenance such as geographic (source locality) or environmental (cultivation conditions) variation, intended as a surrogate for molecular genetic diversity. Given limited information, these are all valuable factors. Increasingly, breeders may possess a set of candidate genes that, effectively, defines the breeding objective, e.g., in cases like metabolic pathway engineering [43] or NBS-LRR gene-mediated disease resistance breeding [44]. Alternatively, they may have a general idea of categories of genes relevant to the breeding objective, e.g., drought resistance genes [45], secondary cell wall deposition genes [46], photosynthetic genes [47]. Or, they may face complex, multidimensional challenges where many genes are likely to be involved, e.g., yield increase [48,49], generalized stress resistance [50,51], or adaptation to new climates [52].  . Genomewide haplotype capture in sorghum using genomewide SNPs (single nucleotide polymorphisms), three sets of target genes, geographic data, and environmental data to assemble subsets. X-axis, average physical length of haplotype blocks (Kbp). Y-axis, rescaled relative enrichment of haplotypic diversity genomewide, zero is random expectation. Points are coded by reference data source: triangle, whole genome; 'l', lignin biosynthetic genes; 'r', regulatory genes; 'c', climate adaptation genes; plus sign, geography; circle, environment.
Agronomy 2020, 10, x 10 of 16 Figure 5. Uneven phenotyping in United States sorghum collection. Columns contain 2378 accessions with genomewide SNP data. Rows contain one trait measured in one study (787 rows). The same trait may occur multiple times in the matrix, measured in independent studies. White indicates a measurement, black, missing data. Horizontal white traces indicate traits that have been measured for many accessions in a single study. Vertical traces mark accessions that have been measured for many traits.
The purpose of this study is to demonstrate the superiority of genomewide DNA sequence data for navigating complex germplasm collections to produce subsets containing genetic diversity appropriate to a breeding objective. Using genomewide SNP data sets from three heterogeneous species samples, we visualized haplotype capture at each SNP position across the genome ( Figure  1). We found geographic and environmental surrogate predictors of genetic diversity to be profoundly limited in their ability to capture haplotypic diversity genomewide. Subsets optimized using genomewide SNP data should be expected to contain an order of magnitude or more genetic diversity than those produced using geographic or environmental data alone (Figure 2). We have made this observation in a general sense before [23,24] but here illustrate it with previously Figure 5. Uneven phenotyping in United States sorghum collection. Columns contain 2378 accessions with genomewide SNP data. Rows contain one trait measured in one study (787 rows). The same trait may occur multiple times in the matrix, measured in independent studies. White indicates a measurement, black, missing data. Horizontal white traces indicate traits that have been measured for many accessions in a single study. Vertical traces mark accessions that have been measured for many traits.
Some level of phenotypic characterization is available for many species held in gene banks thus phenotypic data is an important component of the informational complexity of a collection. In this study we have not quantified the utility of phenotypic data. Phenotypic data are often incomplete, inapplicable to a breeding objective, or imperfectly sampled. In the US sorghum collection, for example, 94 traits have been measured in 101 studies over 26 years at 12 locations (9 US, 2 Mexico, 1 Brazil).
In much of the collection (45396 accessions) at least one trait has been measured, but few traits have been measured for broad swaths of the collection and few accessions have been phenotyped for multiple traits ( Figure 5). In order to be useful for assembling subsets from the entire collection, many traits would need to be phenotyped across most accessions at the same site in the same growing season. This is a practical impossibility for large collections. Moreover, in order to be especially valuable for breeding, accessions would have to be phenotyped at multiple sites over multiple growing seasons so that genotype by environment interactions affecting trait expression could be mitigated. Phenotypic data is laborious to obtain and, ultimately, is not a reliable predictor of trait values in different genetic backgrounds [10,13]. Assuming comprehensive genomewide sequence data is held by the gene bank, phenotyping is a responsibility that could logically rest with the user. Within a breeding program, sample sizes are manageable, and scoring of traits and site selection can be carefully tailored to the breeding objective.
We considered three scenarios for accessing germplasm collections bioinformatically instead of using passport (e.g., geographic or environmental variation) and evaluation data. In the first, where the user begins with a modest-sized list of candidate genes (here, putative lignin biosynthetic genes), we showed that much more haplotypic diversity can be captured in a subset by using SNPs in the candidate loci for the optimization, rather than surrogate predictors or even genomewide SNPs (Figure 3a). Thus breeders with sets of candidate loci can easily mine gene banks for haplotypic diversity of direct relevance to them, provided the collection has been characterized with genomewide DNA sequence data.
The second and third scenarios relied on the Gene Ontology to relate breeding objectives with genomic locations. The Gene Ontology is a controlled vocabulary designed to specify what genes "do" (molecular function and biological process terms) and where they act (cellular component terms). By associating each gene with a set of GO terms, genome annotations assign functional concepts to genomic intervals. GO term definitions contain human-readable, descriptive text passages organized into a formalized set of machine-readable fields so can be leveraged, computationally, to bridge the gap between breeding objective, expressed as phenotype, and genomic locations, which hold haplotypic diversity. The theory behind expressing phenotypes as GO terms was developed in the biomedical literature [54,55].
In scenario two, we imagined a subset containing the maximum potential to alter phenotypes in a general sense (i.e., no particular trait or set of traits is targeted). Regulatory genes, by definition, alter gene expression and are well-established generators of morphological trait variation [56][57][58], thus these regions should be included in suchs a subset. Most sequence variation is neutral with respect to phenotype, residing in non-functional, non-genic regions, thus these regions should be excluded. By identifying Gene Ontology terms with a regulatory connotation then extracting the associated genomic regions in sorghum, we were able to assemble subsets that specifically maximized regulatory gene haplotypic diversity while excluding non-regulatory and non-genic regions from the optimization. Subsets enriched in regulatory gene diversity could contribute greater general phenotypic variation to a breeding program than those that maximize haplotypic diversity genomewide since the latter include much non-functional variation.
In the third scenario we posited a situation where the breeding objective is sufficiently complex that an intuition-based query of the Gene Ontology would be ill-advised. For this we used a natural language processing procedure, latent semantic analysis, to transduce concepts presented in scientific publications into GO terms, which map to specific regions of the genome that can be targeted, as proposed by Reeves and Richards [24]. Scientific knowledge exists in two places: the mind (as thoughts) and publications (as written words). In cases where problems contain too many variables for human thought processes to represent them adequately, machine learning procedures can help. Computers excel at multidimensional problems and, due to rapid advances in artificial intelligence, can now readily parse written words into basic "meaning" [59]. As proof of concept, we chose the subject "crop adaptation to climate in the United States", which represents a long-standing breeding objective (crop variety development for new growing regions) coupled with the contemporary challenge of mitigating climate change effects on agriculture (crop variety development for new growing conditions in the same regions). We processed three well-regarded scientific publications on the subject [31][32][33] into GO terms and the associated intervals in the sorghum genome. The 50 GO terms most semantically-related to the textual content of the publications are shown in Table 1. While we have no objective means at present to test their validity for producing subsets that maximize phenotypic variation in traits related to climate adaptation, the GO terms found by latent semantic analysis are intuitively reasonable, representing responses to changing moisture and temperature regimes as well as other plant stressors.
The three scenarios demonstrate that, when armed with genomewide SNP data, one can produce efficient breeding subsets from heterogeneous collections that specifically target genes of interest. More importantly, the procedure to produce suchs subsets can be adapted to any breeding objective and even automated, to the point of requiring very little prior scientific knowledge on the part of the collection curator or gene bank user (e.g., scenario three). Accordingly, we support the proposal that germplasm collections be curated and accessed with a central focus on DNA sequence diversity [25], rather than phenotypic variation or associated passport data. DNA sequence diversity forms the underlying physical basis of the potential phenotypic variation held in collections and, now that large scale resequencing is practical (e.g., [40,41]), should be used as the primary feature by which collections are interrogated. Genomewide sequence data sets provide the added opportunity to catalog allelic variants at loci of interest and enter them directly into crop improvement programs utilizing transgenics or genome editing, thereby circumventing traditional breeding and associated linkage drag lag time [60,61].
Our proposed methods for accessing functional genetic diversity in plant germplasm collections come with caveats, the most important of which is that the correlation between haplotypic diversity at genes identified by GO term, and phenotypic diversity at traits implied by GO term, is untested. It could be tested with large phenotype-genotype data sets from model organisms (e.g., [62]), but ultimately should be worked out in the field using the crop of interest because genetic background and environmental conditions are essential components of phenotypic expression. A technical caveat concerns the capture of low frequency variants. Genomewide resequencing projects typically define a MAF cutoff (often 5%) in order to exclude singletons and sequencing errors from the final data set. Low frequency SNP variants may not be directly observable in suchs data sets and thus rare haplotypes could be excluded from subset optimizations. Decreasing the MAF cutoff necessitates increasing sequence read depth, which increases costs. Additionally, the causal genetic variation underlying phenotypic variation may, for some traits, be attributable to copy number variation or indels (CNV) rather than nucleotide substitutions. By simulating haplotype blocks prior to subset assembly this concern is likely mitigated by virtue of linkage between SNPs and adjacent CNVs. Advances in pangenome analysis [63][64][65][66] may further ameliorate this concern because CNVs can be used in subset optimization the same as SNPs. The same caveat holds for non-coding regulatory sequence variation, suchs as that in promoter and enhancer sequences. This variation would be excluded, by definition, from subsets assembled to maximize variation in gene sets discovered by interrogating the Gene Ontology. Operationally, the possibility of "missing" any of this esssential variation would be partly mitigated by linkage with adjacent, included, genic regions, and could be explicitly tackled by extending the genomic interval contained in the gene definition to include putative cis-regulatory sequence when maximizing haplotypic diversity. Sequence variation in distant enhancer elements could still be overlooked.

Conclusions
The bioinformatic procedures described here support the proposal to use DNA sequence as the primary means to organize plant germplasm collections, and bioinformatics as the primary means of access [25]. Implicit in that support is an advocacy for systematic whole genome resequencing projects and software development (e.g., DivSeek, GOBII, InterMine initiatives) directed at these carefully curated and maintained public resources. Increased focus on the fundamental unit of conservation, the haplotype block, and increased utilization of artificial intelligence approaches to translate complex genomic information into manageable, relevant, and distributable genetic diversity will be necessary to address future agricultural challenges.