LegumeSSRdb: A Comprehensive Microsatellite Marker Database of Legumes for Germplasm Characterization and Crop Improvement

Microsatellites, or simple sequence repeats (SSRs), are polymorphic loci that play a major role as molecular markers for genome analysis and plant breeding. The legume SSR database is a webserver which contains simple sequence repeats (SSRs) from genomes of 13 legume species. A total of 3,706,276 SSRs are present in the database, 698,509 of which are genic SSRs, and 3,007,772 are non-genic. This webserver is an integrated tool to perform end-to-end marker selection right from generating SSRs to designing and validating primers, visualizing the results and blasting the genomic sequences at one place without juggling between several resources. The user-friendly web interface allows users to browse SSRs based on the genomic region, chromosome, motif type, repeat motif sequence, frequency of motif, and advanced searches allow users to search based on chromosome location range and length of SSR. Users can give their desired flanking region around repeat and obtain the sequence, they can explore the genes in which the SSRs are present or the genes between which the SSRs are bound design custom primers, and perform in silico validation using PCR. An SSR prediction pipeline is implemented where the user can submit their genomic sequence to generate SSRs. This webserver will be frequently updated with more species, in time. We believe that legumeSSRdb would be a useful resource for marker-assisted selection and mapping quantitative trait loci (QTLs) to practice genomic selection and improve crop health. The database can be freely accessed at http://bioinfo.usu.edu/legumeSSRdb/.


Introduction
Legumes are the second most essential group of crops with around~800 genera and 20,000 known legume species in the world. They are among the few species which can convert nitrogen available in the air to plant-usable form and are capable of increasing the nitrogen content in the soil which, in turn, helps in nitrogen fertilization for other crops [1]. Legumes play a significant role in natural ecosystems, agriculture, and agroforestry. They play an important role in both animal and human food, providing approximately one-third of human nitrogen. An important nutritional aspect of legumes is the high concentration of protein, oil, and starch in their seeds, and they play a significant role in animal forages, and livestock as well as human consumption. Medicago truncatula and Lotus japonicas are two major model organisms that have been used to elucidate the genetic basis of legumerhizobial symbiosis, which is used to fix atmospheric nitrogen for plant use, whereas legumes such as Phaseolus vulgaris and Glycine max are major staple crops in various parts of the world.
With the increase in water-stressed areas in the world, drought problems are likely to increase in legume species [2]. It is important to study legumes and come up with solutions to cultivate more drought-tolerant legumes. Thus, it is important to understand the fundamental mechanisms of these species under different biotic and abiotic stresses. The advancement in molecular marker technology and next-generation sequencing technologies has increased the scope for crop improvement as it has become relatively easier for researchers to study any given species of interest on a genome-scale.
In conventional plant breeding, the genetic selection of plants is decided by the parents and influenced by different environmental conditions [3]. In traditional plant breeding, alleles are mixed over generations and new combinations are produced, which contribute to the selection process to achieve higher quality. Marker-assisted selection (MAS) has become common in many crop breeding programs in recent years [4][5][6][7]. The discovery of a DNAbased marker that is closely linked to the target trait is a prerequisite for using MAS. In many crops, the marker method of choice for quantitative trait loci (QTL) mapping has long been microsatellite markers (SSRs). Microsatellites are tandem repeats of 1-6-nucleotidelong DNA units that are flanked by identical genome sequences but occur more frequently in the non-genic region [8]. Per generation, the mutation rate of SSRs ranges between 10 −3 and 10 −6 [9], which increases with the length of the repeat unit [10]. They are extremely flexible, low-cost, highly insightful molecular markers, and based on PCR, consistent with high-frequency polymorphism [11]. These are one of the markers among various genetic markers such as RFLP (restriction fragment length polymorphism), RAPD (random amplification of polymorphic DNA), AFLP (amplified fragment length polymorphism), and SNP (single nucleotide polymorphism, used for germplasm characterization). SSRs have many applications such as genetic diversity assessment, gene mapping, marker-assisted selection [12], the study of population and phylogenetic relationships and bio-invasions, and disease control [11].
SSR screening using genomic libraries are time-consuming and expensive on a large scale; in recent years, the in silico approach of marker development has paved a path for researchers to make the SSR development viable and fast [13]. Therefore, a user-friendly database of all available genomic data of the legume species can be a valuable genomic resource for legume cultivation improvement and characterization, and bearing in mind the importance of SSRs for crop improvement in legumes. Here, we present legumeSSRdb, a comprehensive web resource integrated with a wide range of services such as SSR prediction, primer design for genotyping along with ePCR-based polymorphism discovery, JBrowse visualization, and many other enriching features.

Cross-Species Comparison of Legume Species SSRs
A total of 3,706,276 microsatellites were predicted from 13 legume species for the development of the legumeSSRdb web-resource. The highest numbers of SSRs were predicted in Arachis hypogaea, whereas Trifolium pratense had the fewest SSRs ( Table 1). The number of SSRs is strongly associated with the size of the genome; the larger the genome, the greater number of SSRs are predicted, except for the atypical case of Phaseolus vulgaris. Generally, species with large genome sizes tend to have low SSR frequency (SSRs/MB) [14]. However, there was no link found between the SSR density and genome size among the SSRs discovered in our analysis of 13 legume species. This is consistent with some of the recent findings that found no relation between the genome size and SSR density, and that the genome size differences may influence the degree of microsatellite repeats in the genome [15][16][17][18].

Characterization of the Perfect SSRs
SSRs repeating ≥ 15 times were termed as perfect SSRs. On average, around 30-40% of SSRs found were perfect. Out of these perfect SSRs, 25% were located in the coding regions of the genome and 75% were present in non-coding regions. Among the 13 species, Phaseolus vulgaris (66%) had the highest and Vigna unguiculata had the lowest percentage of perfect SSRs. Trifolium pratense has the highest percentage of perfect SSRs present in Table 1. Statistics of species-wise microsatellites based on genome size, number of base-pairs, number of SSRs per megabase-pairs, genomic location, and the number of repeat units.

Genome
Size

Characterization of SSRs by Motif Type
All the predicted SSR loci were categorized into six categories: monomers, dimers, trimers, tetramers, pentamers, and hexamers. Among all the species, around 85% of SSRs comprised monomers and dimers. Medicago truncatula had the highest percentage of monomeric repeat, whereas Lupinus albus had the lowest percentage of monomeric repeat; Lupinus albus, Lupinus angustifolius, and Phaseolus vulgaris had more dimeric repeat than monomeric repeat (Table 2, Supplementary Figure S1). There has been a high abundance of monomeric repeats in almost all genomes, which may be due to the inherent limitations of next-generation sequencing (NGS) methods used for data generation [17]. Likewise, dimeric repeat also recorded a higher abundance in other crops [19,20]. For trimeric repeat, Lupinus angustifolius had the highest percentage and Medicago truncatula had the lowest percentage. In the case of tetrameric repeat, Trifolium pretense had the highest percentage and Medicago truncatula had the lowest percentage. Lupinus albus had the highest percentage (16.4%) of pentameric repeat, whereas Medicago truncatula and Vigna unguiculata had the lowest percentage (0.1%). For hexameric repeat, Lupinus angustifolius had the highest percentage (9.5%) followed by Lupinus albus and Cicer arietinum (0.3%), whereas Glycine max, Medicago tranctula, Trifolium pratense, Phaseolus vulagaris, and Vigna anguaris has the lowest percentage (0.1%). In all SSR classes, it was observed that longer repeats were less abundant; a decreased trend in SSR frequency with an increased trend in their repeat number has been observed in other species [18,21].

Functional Annotations of Predicted SSRs
The predicted SSRs in each species were mapped to their corresponding annotation file to classify genic and non-genic SSRs. Furthermore, genic SSRs were classified into exons, 5 UTR and 3 UTR regions. For non-genic SSRs, their closest genes were also assigned. Around 20-25% of the genic SSRs were present in each species (Supplementary Figure S2). Trifolium pratense had the highest percentage (36%) of SSRs present in the genic region, whereas Cicer arietinum had the lowest percentage (13%) of genic SSRs.

Web Genomic Resource: legumeSSRdb
The legume SSR web genomic resource (legumeSSRdb) was developed using three-tier architecture. This is the first comprehensive legume SSR resource containing 13 species. The web resource comprises seven tabs: Home, About, Species, Tools, JBrowse, Help, and Contact. Predicted SSRs can be searched based on the region in the genome, chromosome number, motif type, and repeat motif sequence. In the advanced query, a user can choose more parameters such as the range of the number of repeat units (how many times a motif should repeat), range of length of the SSR, and chromosome location range (the specific location of the genome with a start and an end). The result page displays real-time visualization of the searched result with graphs and tables. The result table can be sorted based on motif start, motif end, size, motif type, gene location, etc. Users can download results in a text file with a download result button, the flanking region of the selected SSRs can be viewed with the 'get sequence' option, and the 'design primer' button can be used to design primers for selected SSRs. The primer page displays designed primers in the flanking region and results can be downloaded in text format. The designed primers can be searched for cross-species transferability using an e-PCR option. The tool tab provides three tools: SSR prediction, BLAST, and JBrowse. The tool miSATminer has been implemented with custom scripts to design the SSR markers for user input sequences, whereas users on the BLAST search page can execute similarity searches. The 13 legume species genome sequences can be viewed with gene and SSR coordinates on the chromosome using the JBrowse tool option. The Help tab contains a tutorial for using the database efficiently and frequently asked questions. A workflow of searching legumeSSRdb is illustrated in (Figure 1). users on the BLAST search page can execute similarity searches. The 13 legume species genome sequences can be viewed with gene and SSR coordinates on the chromosome using the JBrowse tool option. The Help tab contains a tutorial for using the database efficiently and frequently asked questions. A workflow of searching legumeSSRdb is illustrated in (Figure 1).

Discussion
LegumeSSRdb is the first comprehensive database of legumes represented by 13 species containing 3,706,276 in silico predicted SSR markers. This study clearly shows that SSR mining can be performed more effectively via the computational approach [22]. These markers are ubiquitously spread across the whole chromosome sets; therefore, they may be a better example in the form of a molecular marker for the study of variability analysis. This strategy has the advantage of location specificity over chromosome, and it can be used as a specific gene molecular marker [23]. Designed SSR primers may be used to identify the QTL/candidate gene, generate a linkage map, hybrid development, and characterization of germplasm. Many studies have been reported on the use of microsatellite markers for mapping various quantitative traits in plants [24][25][26][27][28]. The SSR markers present in the flanking region of the genes can be used for marker-assisted selection (MAS) and germplasm improvement [29]. SSR markers have been used in the characterization of genotypes for early leaf spot (ELS) resistance, yellow mosaic virus resistance genes, and to identify QTLs for flowering traits and other traits of interest [30,31]. Varieties with identical morphological features are highly difficult to discern from the phenotypic study. Previous research has used SSRs to address these difficulties for variety characterization, linkage mapping, trait improvement, molecular breeding, hybrid cultivar development, improved variety development phylogenetic, and taxonomic comparisons [32,33]. Several whole genomes of numerous plant species are available in the public domain and offer the potential to research the cross-species transferability in closely related plants; it can

Discussion
LegumeSSRdb is the first comprehensive database of legumes represented by 13 species containing 3,706,276 in silico predicted SSR markers. This study clearly shows that SSR mining can be performed more effectively via the computational approach [22]. These markers are ubiquitously spread across the whole chromosome sets; therefore, they may be a better example in the form of a molecular marker for the study of variability analysis. This strategy has the advantage of location specificity over chromosome, and it can be used as a specific gene molecular marker [23]. Designed SSR primers may be used to identify the QTL/candidate gene, generate a linkage map, hybrid development, and characterization of germplasm. Many studies have been reported on the use of microsatellite markers for mapping various quantitative traits in plants [24][25][26][27][28]. The SSR markers present in the flanking region of the genes can be used for marker-assisted selection (MAS) and germplasm improvement [29]. SSR markers have been used in the characterization of genotypes for early leaf spot (ELS) resistance, yellow mosaic virus resistance genes, and to identify QTLs for flowering traits and other traits of interest [30,31]. Varieties with identical morphological features are highly difficult to discern from the phenotypic study. Previous research has used SSRs to address these difficulties for variety characterization, linkage mapping, trait improvement, molecular breeding, hybrid cultivar development, improved variety development phylogenetic, and taxonomic comparisons [32,33]. Several whole genomes of numerous plant species are available in the public domain and offer the potential to research the cross-species transferability in closely related plants; it can aid in the cloning of candidate genes from different species and orthologous loci within different species.
Since SSRs have many significant applications, a need exists for a full-fledged database resource of microsatellite marker development for legume species. Although some web resources provide information related to legumes, they lack microsatellite markers, SSR prediction, e-PCR primer design, visualization, and other related information. For example, in PMDBase, there are 15 species of legumes present with marker data, but this tool lacks advanced microsatellite search criteria; however, legumeSSRdb provides advanced search criteria such as frequency, motif type, repeat type, chromosome range, etc. Other than legumeSSRdb, no database provides a tool for checking the cross-species transferability of markers. As shown, our database could serve as a complementary resource to further advance plant genomics and breeding research in the legume species. The comparison of legumeSSRdb with similar existing tools is illustrated in Table 3. Table 3. Comparison of legumeSSRdb with other related databases.

Features
legumeSSRdb CicArMiSatDB Legumeinfo LegumeIP PMDbase visualization of data was implemented using several JS chart libraries, such as ChartJS, Morris, Flot, C3 charts, etc. Primer3 was implemented for real-time primer-designing. Genomic sequence and annotation visualization was implemented through JBrowse. For the SSR prediction to a user-given query, the miSATminer script was implemented in the backend. NCBI local BLAST and e-PCR was also implemented for similarity search and cross-species transferability, respectively. The overall workflow of the web resource is presented in Figure 1.

Conclusions
LegumeSSRdb is the first SSR database that has been designed with cutting edge GUI features and integrates all the services required for performing end-to-end marker selection in legume species. LegumeSSRdb contains 3,706,276 putative microsatellites from 13 legume species. These markers can be used with cross-species transferability to cater to the need for molecular markers for legume species where the whole-genome sequence data are not available. This genomic web resource can be of great value to the global community. The webserver facilitates the ability to search, predict, analyze, and visualize the SSRs of 13 legume species. Features exists such as the ability to create custom primers based on user-defined amplicon size and in silico validation, the real-time graphical visualization of SSR results, exploring SSR sequences by adjusting the flanking region, and identifying related genes and exploring their functional annotation information which is unique to legumeSSRdb. This web server also allows users to submit a genomic sequence of their interest to predict SSRs by adjusting parameters and design primers. With highperformance cluster computing on the backend, legumeSSRdb provides quick and flawless performance. This can be used for chromosome-wise microsatellite locus mining and primer designing for genic and non-genic FDM-SSRs for rapid genotyping (FDM, functional domain markers). This can also be used to facilitate the detection of polymorphisms by e-PCR that is most economically needed in future re-sequencing ventures. This web resource is not limited to knowledge discovery research such as genetic linkage mapping, QTL identification, etc., but can also be used for marker-assisted breeding and germplasm improvement of legume species. This web resource will be extended to more legume species in the future. The webserver is freely available at (http://bioinfo.usu.edu/legumeSSRdb/).

Funding:
The authors acknowledge the support to this study from faculty start-up funds to R.K. from the Center for Integrated BioSystems/Department of Plants, Soils, and Climate, USU. This research was also supported by the Utah Agricultural Experiment Station (UAES), USU, and approved as journal paper number 9531. The funding body did not play any roles in the design of this study or collection, analysis, and interpretation of data or in writing of this manuscript.

Institutional Review Board Statement: Not Applicable.
Informed Consent Statement: Not Applicable.

Acknowledgments:
The authors thank the anonymous referees for suggestions and help in improving the research article.