Genetic and Genomic Resources for Soybean Breeding Research

Soybean (Glycine max) is a legume species of significant economic and nutritional value. The yield of soybean continues to increase with the breeding of improved varieties, and this is likely to continue with the application of advanced genetic and genomic approaches for breeding. Genome technologies continue to advance rapidly, with an increasing number of high-quality genome assemblies becoming available. With accumulating data from marker arrays and whole-genome resequencing, studying variations between individuals and populations is becoming increasingly accessible. Furthermore, the recent development of soybean pangenomes has highlighted the significant structural variation between individuals, together with knowledge of what has been selected for or lost during domestication and breeding, information that can be applied for the breeding of improved cultivars. Because of this, resources such as genome assemblies, SNP datasets, pangenomes and associated databases are becoming increasingly important for research underlying soybean crop improvement.


Introduction
Cultivated soybean (Glycine max) is a major protein and oil crop and reached a worldwide production of 349 million tons in 2018, equivalent to a total export value of USD 59 billion (http://www.fao.org/faostat, accessed on 15 October 2021).
The global importance of soybean as a crop enabled the growing amount of soybean breeding research on varieties ranging from wild and semi-wild relatives to domesticated landraces and modern elites, including genome and transcriptome sequencing, functional assays, phenotype and trait discovery. The wide range of assemblies, pangenome and variant resources as well as databases support researchers in studying soybean.
The progress of soybean research resources in the last decade has been recently reviewed with a focus on gene discovery [5]. We expand this by summarizing resources that support researchers in the field of soybean breeding research.
In this review, we elucidate milestones in soybean genetics and genomics research ( Figure 1) and provide details on the currently available soybean genetic and genomic databases. We detail available marker technologies for soybean and summarize soybean whole-genome resequencing studies as gold standard for variation studies across populations. We provide the step-by-step development of the current high quality reference genomes and pangenomes and highlight the challenges of data interoperability, metadata 2 databases. We detail available marker technologies for soybean and summarize soybean whole-genome resequencing studies as gold standard for variation studies across populations. We provide the step-by-step development of the current high quality reference genomes and pangenomes and highlight the challenges of data interoperability, metadata annotation and scarcity of associated data, including data for proteomics, metabolomics and phenomics, which limit the application of these data for crop improvement. Finally, we propose approaches that may support more integrated data management and analysis; so, as databases continue to improve and expand, they can be applied for the improvement of this important crop.

SNP Marker Arrays
SNP marker arrays are a cost-effective option for capturing genetic variation across a population. These marker arrays report the allelic state of specific loci for individuals across the genome, designed to provide an overview of the genome, or target regions of interest, with applications both in breeding and research [6]. The first major genotyping array spanning the soybean genome, the Soy50KSNP array [7], allowed researchers to characterize 52,041 variant sites. This was applied to genotype the 18,480 domesticated and 1168 wild accessions in the USDA Soybean Germplasm Collection [8]. Genotyping using this array can be carried out in conjunction with trait association analysis, such as GWAS, to identify regions underlying agronomically important traits related to seed composition [9][10][11], flooding tolerance [12][13][14] and sudden death syndrome [15,16]. Denser marker genotyping arrays were subsequently developed, including the 180K AXIOM ® SoyaSNP array [17] and NJAU 355K SoySNP array [18], allowing for more in depth inquiry into the landscape of genomic diversity in soybean and providing insights into the history of soybean domestication [19][20][21].
Although dense SNP arrays are preferred when studying soybean evolution, studies suggest that, to obtain maximum efficiency in genomic prediction breeding models, only

SNP Marker Arrays
SNP marker arrays are a cost-effective option for capturing genetic variation across a population. These marker arrays report the allelic state of specific loci for individuals across the genome, designed to provide an overview of the genome, or target regions of interest, with applications both in breeding and research [6]. The first major genotyping array spanning the soybean genome, the Soy50KSNP array [7], allowed researchers to characterize 52,041 variant sites. This was applied to genotype the 18,480 domesticated and 1168 wild accessions in the USDA Soybean Germplasm Collection [8]. Genotyping using this array can be carried out in conjunction with trait association analysis, such as GWAS, to identify regions underlying agronomically important traits related to seed composition [9][10][11], flooding tolerance [12][13][14] and sudden death syndrome [15,16]. Denser marker genotyping arrays were subsequently developed, including the 180K AXIOM ® SoyaSNP array [17] and NJAU 355K SoySNP array [18], allowing for more in depth inquiry into the landscape of genomic diversity in soybean and providing insights into the history of soybean domestication [19][20][21].
Although dense SNP arrays are preferred when studying soybean evolution, studies suggest that, to obtain maximum efficiency in genomic prediction breeding models, only 1000-2000 markers are required [22,23]. As a result, there have been advancements towards targeting smaller, non-redundant sets of informative markers. The BARCSoySNP6K was developed for cost-effective recombination tracing in biparental populations [24,25], though it has also found utility in global population research [26,27].
Recent progress in bioinformatics has maximized the lower density genotype information gained from SNP marker arrays and genotyping-by-sequencing (GBS) through imputation using haplotypes identified in more detailed whole-genome resequencing (WGRS) data [28,29]. The GmHapMap was constructed using 1007 whole-genome resequenced individuals and enables the inference of allelic states with 96% accuracy at all SNP positions across the genome from only the 42,508 SNPs genotyped by the Soy50KSNP array [30].

Whole-Genome Resequencing
Currently, the gold standard method used to map genetic diversity in detail for breeding and genomic research is Whole-Genome Resequencing (WGRS) [31]. WGRS involves low-coverage sequencing individuals with short reads before being aligned to a genomic reference to identify nucleotides or regions that vary from the reference. Compared to SNP marker arrays, WGRS is often more expensive per individual, though it can provide high-density genome-wide allelic information for all loci in a reference [32]. Beyond small variants, such as SNPs, resequencing individuals can allow for the identification of structural diversity, such as copy number variation underlying soybean cyst nematode resistance [33].
The first major population-level WGRS project for soybean was published in 2010 for 17 wild and 14 domesticated soybean individuals, sequenced to an average depth of 5X [34]. This dataset was used in one of the first whole-genome investigations of structural variation, which revealed low levels of linkage disequilibrium decay compared to other plant species [34]. The same dataset was later used to identify a gene underpinning salt tolerance in wild soybean [35].
In 2015, WGRS increased substantially with the release of data for 302 wild and domesticated individuals for GWAS, characterizing selective signals related to domestication and improvement [36], as well as maternal lineages in the chloroplast genome [37]. Since 2015, the global soybean community steadily accumulated WGRS data through a series of projects, with an increasing focus on characterizing regional germplasm collections in Brazil [38], China [39,40], Canada [41], the USA [42], Japan [43] and Korea [29]. The largest soybean WGRS dataset to date contains 2898 wild and domesticated accessions, the majority originating from China, which was aligned to a Zhonghuang 13-based graph pangenome [44].
The growing availability of larger, more diverse WGRS datasets, including previously unstudied exotic individuals, holds tremendous promise for integrated research studying the underlying genomic basis of trait variability in soybean lineages.
High-quality genome assemblies for domesticated and wild soybean support researchers and breeders improving and adapting soybean for changing climate conditions, associated biotic and abiotic stresses, or market changes in soybean demand. Current, assemblies for domesticated soybean accessions capture the genetic diversity from the USA and China, but there is no high-quality assembly for the Brazilian germplasm. Genome assemblies for G. max are complemented by wild and perennial Glycine assemblies that allow researchers to identify changes in modern soybean due to domestication, as well as potentially beneficial genetic diversity that may have been lost.

Genome Assemblies
The first assembly for cultivated soybean, Glycine max var. Williams 82 (Wm82.a1), was published in 2010 [3] (Figure 2), with a size of 950 Mb and 46,430 gene models [3,45], which is more than eight times the size of the Arabidopsis thaliana genome and twice the size of many other legumes [46]. The assembly identified multiple rounds of Glycine-specific genome duplication that has led to 75% of genes becoming non-unique, and partially explains the large, repetitive G. max genome [3]. This assembly provided a foundation for functional genomics in soybean to accelerate crop trait dissection and support breeding programs. The Wm82.a1 assembly was ordered based on linkage maps using a limited number of markers and recombinant inbred lines, resulting in limited assembly quality in regions with low marker density [47]. In 2016, a new version of the G. max Wm82 assembly, Wm82.a2, which was published using two high-density linkage maps with a total assembly size of 978.5 Mb [47] (Figure 2). In 2019, the Wm82 assembly was further improved (Wm82.a4), closing 3600 gaps and adding another 5 Mb to the assembly size [48] ( Figure 2). The same study also released an assembly for the southern US accession Lee, with an assembly size of 985 Mb and a high structural similarity when compared with Wm82.a4 ( Figure 2). Both the Wm82.a4 and Lee assemblies represent much of the genetic diversity present in USA soybean cultivars, building a strong foundation for US soybean genetic research. explains the large, repetitive G. max genome [3]. This assembly provided a foundation for functional genomics in soybean to accelerate crop trait dissection and support breeding programs. The Wm82.a1 assembly was ordered based on linkage maps using a limited number of markers and recombinant inbred lines, resulting in limited assembly quality in regions with low marker density [47]. In 2016, a new version of the G. max Wm82 assembly, Wm82.a2, which was published using two high-density linkage maps with a total assembly size of 978.5 Mb [47] (Figure 2). In 2019, the Wm82 assembly was further improved (Wm82.a4), closing 3600 gaps and adding another 5 Mb to the assembly size [48] ( Figure 2). The same study also released an assembly for the southern US accession Lee, with an assembly size of 985 Mb and a high structural similarity when compared with Wm82.a4 ( Figure 2). Both the Wm82.a4 and Lee assemblies represent much of the genetic diversity present in USA soybean cultivars, building a strong foundation for US soybean genetic research. Outside the USA, a reference genome for the Chinese soybean cultivar Zhonghuang 13 was released in 2018. The assembly size was 1025 Mb with 52,021 gene models and 250,000 structural variations compared to Wm82.a2 [49] (Figure 2). This assembly was later improved using PacBio reads, optical mapping and Hi-C sequencing, and the total number of protein coding genes increased to 55,443 by integrating RNAseq data into the annotation [50] (Figure 2).
Following the draft assemblies of seven wild soybean accessions in 2014 [51], the first reference-grade assemblies of two wild soybean accession were published in 2019 ( Figure  2). The G. soja accession W05 was assembled with a size of 1013 Mb and 55,539 genes [52], identified an inversion in the seed color locus, a translocation between chromosome 11 and 13, and highlighted copy number variations for several gene clusters [52] (Figure 2). A second G. soja accession PI 483463 was also sequenced, with a 962 Mb assembly, demonstrating significant sequence diversity [48]. An assembly for Glycine latifolia accession PI 559298, a perennial relative, was released in 2018 [53] presenting high levels of genetic diversity and agronomically favorable traits, including sclerotinia stem rot and soybean Outside the USA, a reference genome for the Chinese soybean cultivar Zhonghuang 13 was released in 2018. The assembly size was 1025 Mb with 52,021 gene models and 250,000 structural variations compared to Wm82.a2 [49] (Figure 2). This assembly was later improved using PacBio reads, optical mapping and Hi-C sequencing, and the total number of protein coding genes increased to 55,443 by integrating RNAseq data into the annotation [50] (Figure 2).
Following the draft assemblies of seven wild soybean accessions in 2014 [51], the first reference-grade assemblies of two wild soybean accession were published in 2019 (Figure 2). The G. soja accession W05 was assembled with a size of 1013 Mb and 55,539 genes [52], identified an inversion in the seed color locus, a translocation between chromosome 11 and 13, and highlighted copy number variations for several gene clusters [52] (Figure 2). A second G. soja accession PI 483463 was also sequenced, with a 962 Mb assembly, demonstrating significant sequence diversity [48]. An assembly for Glycine latifolia accession PI 559298, a perennial relative, was released in 2018 [53] presenting high levels of genetic diversity and agronomically favorable traits, including sclerotinia stem rot and soybean rust resistance that are absent in G. max [54][55][56][57]. The assembly of 939 Mb and 54,475 genes included hundreds of candidate disease-resistance genes, including 367 LRR genes, less than the 467 LRR genes found in G. max [36,53].
Recently, a genome assembly of the popular Korean soybean cultivar Hwangkeum, known for its resistance to all the USA soybean mosaic virus strains, was released with an assembly size of 933.12 Mb and 58,550 genes [58] (Figure 2). While SNPs, indels and structural variants were identified when comparing Hwangkeum with Wm82.a4, no large genomic rearrangements were identified, which is in contrast to four large scale chromosomal rearrangements identified between Wm82.a4 and Zhonghuang 13 [49,50].
The global importance of soybean as a crop is reflected in the regularity of improvements to soybean genome assemblies. The reference assembly for Wm82 has been improved twice since its initial release in 2010, and together with the reference assemblies for Lee, Zhonghuang 13 and Wm82, the latter of which has been improved twice since its initial release in 2010, provide the foundation of modern soybean research.

Pangenomes
Comparative genomic studies have demonstrated that single reference genome assemblies do not represent the full genomic diversity of a species. To address this, pangenomes have been assembled that represent the gene content of a species rather than of a single individual [59][60][61]. Pangenomes have been assembled for several plant species, such as banana [62], sorghum [63], bread wheat [64], Brassica oleracea [60], Brassica napus [65], the Brassica genus [66], chickpea [67], tomato [68], sunflower [69], pigeon pea [70], cotton [71] and rice [72]. These studies have revealed extensive gene presence/absence variation and that some genes that are not present in all accessions may have important biological functions, such as biotic and abiotic stress tolerance.
The first soybean pangenome was published in 2014 and was one the first pangenomes developed in plants [51] (Figure 2). The study mapped whole-genome resequencing data for seven representative G. soja accessions to the Wm82.a1 reference and identified 3.63 to 4.72 million SNPs, 0.5 to 0.77 million indels and a total of 338 genes that were absent in the G. max reference. Variable genes were enriched for defense response, cell growth and photosynthesis [51].
Soybean pangenomics expanded in 2020 with the analysis of 2898 accessions, including the de novo assembly of 26 individuals representing distinct diversity clusters [44] ( Figure 2). These 26 accessions were combined into a graph-based pangenome using vg [73] with Zhonghuang 13 as the primary reference genome. Finally, data from the full set of 2898 accessions were mapped to the pangenome graph and structural variants identified. This process identified a total of 57,492 gene families, of which only 35.9% were present in all 27 accessions [44]. Variable gene families were more diverse and had a higher rate of positive selection compared to core genes, and they were also enriched for abiotic and biotic stress response annotation. The study identified 14.6 million SNPs and 12.7 million indels when comparing the pangenome with the Zhonghuang 13 reference [44]. The wealth of small and variant information collated in this dataset has been used to characterize structural variants associated with iron use efficiency and flowering time as well as inversions and gene fusion events associated with soybean domestication [44].
Two further pangenome studies were published in 2021 [74,75] (Figure 2). PanSoy was constructed using the GmHapMap dataset [30], processed with the EUPAN pipeline [76] based on the Wm82 reference, and resulted in a total pangenome size of 1086 Mb with 54,531 genes, including 1659 novel genes. Of these genes, 7% were variable and enriched for annotations associated with the regulation of immune and defense responses, signaling and plant development [74]. The other pangenome was constructed using a previously published iterative method [60] and based on the Lee soybean assembly. It represents the USDA soybean collection, including wild lines, landraces and modern cultivars. The resulting pangenome had an assembly size of 1213 Mb with 51,414 genes [75]. Of these, 13.2% were variable and enriched in annotations associated with response to biotic and abiotic stress, including defense response, response to abscisic acid and response to salt stress. In addition, the USDA soybean pangenome identified genes that changed in frequency when comparing individuals with different breeding histories [75]. These three pangenomes capture the majority of the genomic diversity present in G. max and G. soja. However, the overall genetic diversity in this gene pool still remains low and limits the crops' potential in yield and resilience [77].
The expansion of the known gene pool in soybean is the focus of the most recent study by Zhuang, et al. [78] (Figure 2), which de novo assembled five diploid perennial Australian Glycine species (2n = 40), G. falcata, G. stenophita, G. cyrtoloba, G. syndetika and G. tomentella D3 and the perennial Australian allopolyploid G. dolichocarpa (2n = 80) at the chromosome level. The assembly sizes of the 5 diploids range from 941 to 1374 Mb with 55,376 to 58,312 protein coding genes and the allopolyploid G. dolichocarpa had an assembly size of 1948 Mb and 113,697 genes. The assembled diploid perennial genomes and 26 selected annual soybean genomes were then used to construct a super-pangenome framework that annotated 109,827 genes in the pool of perennials with 29% perennial core genes and 129,006 genes in the annuals with 24.5% annual core genes. Of the perennial core genes, 56.2% overlapped with annual core genes, 27.2% with variable annual genes and 16.6% were perennial specific. A total of 82.3% of variable perennial genes were not found in the annual gene pool. The identification of perennial specific genes is the first step to expand soybean pangenomics across species boundaries and links genetic variation with phenotypes of agronomic importance.

Databases and Tools for Explorative Data Analysis
With the growing quantity and diversity of genetic and genomic information for soybean, there is a requirement for the integration of data to improve gene annotation and to discover associations between allelic variants and agronomic traits. There are currently several relevant soybean datasets. For example, SoyKB [79] and SoyBase [80] offer curated genomic and genetic datasets, including epigenetic maps, gene expression data, regulatory RNA data, genomic sequence variants and pangenome gene visualization. These databases are continuously updated to host soybean genome analysis results [74] and are employed by the community for biological analyses, including Gene Ontology enrichment [74], QTL mapping and gene identification [81], quantitative disease resistance estimation [82] and the identification of homologous genomic features in related species [83]. A list of online soybean databases is given in Table 1. Across the different databases, users can find tools to explore and visualize genetic maps, soybean mutant lines, gene families and characterize differential gene expression. The value of genetic and genomic data is limited without associated phenotypic data. Phenotypic data have allowed breeders to identify QTLs and SVs associated with soybean yield and performance under abiotic stresses [111]. Several phenotypic datasets are hosted in the databases described in Table 1. The use of information-dense phenotype datasets can improve the association of genetic markers with crop traits [112]. For example, a multienvironment trial using 393 individuals from the SoyNAM (www.soynam.org, accessed on 15 October 2021) population used high throughput drone images to estimate above ground biomass. The derived phenotype data were used to identify genetic loci associated with biomass production at different times during crop growth [113]. Another study used image data from 5555 soybean SoyNAM lines in a GWAS, uncovering QTLs on chromosome 19 associated with average canopy coverage and increased yield [110]. With the expansion of genetic and genomic datasets, combined with high throughput phenotypic analysis, we can expect to gain a greater understanding of how genomic diversity in this crop species underpins trait diversity, information that is valuable for applied crop improvement.
The increase in genomic and phenotypic datasets for soybean and the diversity of databases provides a challenge for integrative soybean analysis as datasets are often scattered across multiple repositories, making it hard for researchers to find all the relevant information that could be used for analysis. Although SoyBase and SoyKB offer central hubs to retrieve genotypic and genetic information across multiple varieties, other 'omics' datasets (e.g., proteomics, metabolomics and phenomics) are not so easily found. Many published datasets have relatively poor metadata, limiting detailed analysis. The Planteome and other plant ontology references serve as standards to assist semantic integration among different datasets [114,115]. For plant phenotyping, the MIAPPE guidelines have forms suggesting the minimal information that is necessary to describe in the metadata to enable other researchers to benefit from the data [116]. Adhering to data sharing guidelines and structures will enable researchers to explore previously published data more effectively, and leverage soybean genetic diversity for crop improvement.

Conclusions and Future Perspectives in Breeding
Finding novel sources for environmental adaptation is fundamental to support breeding approaches. Genome-environment association (GEA) in conjunction with GWAS have been used to predict drought [117,118] and heat tolerance [119,120] in closely related legumes, such as the common bean, which has been proposed as a diploid model for soybean [121]. Enabled by the availability of a wealth soybean marker datasets, GEA will also be an excellent option to study soybean environment adaptation in the future. Furthermore, the availability of genomic datasets and connected phenotypic and marker databases also builds a foundation for next-generation breeding technologies, such as genomic prediction [122], genome-wide scans of selection signatures [123], machine learning [124] and speed breeding [125]. The high-quality datasets available for soybean also enable the use of genomic-assisted backcrossing and replace marker-assisted backcrossing, which will accelerate future soybean breeding.
New technologies, such as long-read sequencing, have been used to generate modern high-quality reference genomes and to de novo assembly of more than 20 accessions in pangenomes. We believe that long-read sequencing is poised to replace WGRS as the gold standard for high-fidelity variation mapping across populations, with the construction of larger and larger de novo assembled pangenomes. Pangenomes are on the verge of expanding into the higher level taxon, which has been demonstrated by Zhuang, Wang, Li, Hu, Fan, Landis, Cannon, Grimwood, Schmutz, Jackson, Doyle, Zhang, Zhang and Ma [78], and will soon start to address questions in functional genomics to enable superpangenomics-guided breeding.
With a wealth of published soybean (pan-)genomes, genomics has firmly established itself as one of the basic tools of soybean plant breeders' toolkit. In this review, we gave an overview of the available data and germplasm resources for soybean researchers and breeders. The valuable data stored within these resources enables new approaches to breed soybean cultivars to meet the challenges posed by a growing world population in a warming climate.