Development of Genome-Wide SSR Markers from Angelica gigas Nakai Using Next Generation Sequencing

Angelica gigas Nakai is an important medicinal herb, widely utilized in Asian countries especially in Korea, Japan, and China. Although it is a vital medicinal herb, the lack of sequencing data and efficient molecular markers has limited the application of a genetic approach for horticultural improvements. Simple sequence repeats (SSRs) are universally accepted molecular markers for population structure study. In this study, we found over 130,000 SSRs, ranging from di- to deca-nucleotide motifs, using the genome sequence of Manchu variety (MV) of A. gigas, derived from next generation sequencing (NGS). From the putative SSR regions identified, a total of 16,496 primer sets were successfully designed. Among them, we selected 848 SSR markers that showed polymorphism from in silico analysis and contained tri- to hexa-nucleotide motifs. We tested 36 SSR primer sets for polymorphism in 16 A. gigas accessions. The average polymorphism information content (PIC) was 0.69; the average observed heterozygosity (HO) values, and the expected heterozygosity (HE) values were 0.53 and 0.73, respectively. These newly developed SSR markers would be useful tools for molecular genetics, genotype identification, genetic mapping, molecular breeding, and studying species relationships of the Angelica genus.

The root of A. gigas contains active compounds, such as decursin, and decursinol angelate [2], which is successfully used in treating gynecological and anemic health issues [3,4]. They are also used for inducing immune system, and for constipation relief [1,5]. Herbalists consider it equivalent to ginseng due to its remedial properties for several diseases including treating cancer [6], an anti-oxidant [7], neuro-protective, anti-flu, anti-hepatitis [1], and anti-bacterial activities [8]. The A. gigas root also contains important chemical compounds like pyranocoumarins, essential oils, and poly-acetylenes.
Due to the scarcity of efficient molecular markers, there is a dearth of information about the genetic relationships and genetic diversity among breeding populations of Angelica crops [9]. Thus, to address these issues, during the past 30 years many kinds of markers have been developed and used for crop genetics and breeding [10]. The improvement in DNA sequencing and genotyping in conjunction with the development of computer programs have helped the design of molecular markers such as simple sequence repeats (SSRs) in minor crops [11].
Molecular markers are widely used in various areas such as genetic-diversity characterization, gene-flow studies, quantitative-trait locus (QTL) mapping, and evolutionary studies [12]. Accumulated plant genome sequences have accelerated single nucleotide polymorphism (SNP) marker development. Numerous SNP markers have been successfully used to develop various crop cultivars [13]. However, SNP marker technology requires special equipment and it is costly. In contrast, SSR markers are very convenient and easy to use. SSR markers are tandemly repeated nucleotide sequence motifs [14]; that are dispersed throughout the genome and may exhibit locus-specific codominance and high polymorphism rates, and have a high level of transferability [15][16][17]. Furthermore, molecular markers can be used to distinguish cultivars or plants originating from different locations. SSR markers are effective tools that play an important role in plant cultivar identification and genetic diversity analysis [18]. The next generation sequencing (NGS) technologies can be used for fast and cost-effective SSR discovery in many crops [19].
The Angelica species used in conventional medicine varies by country according to specific regulations, i.e., A. gigas in Korea, A. sinensis in China, and A. acutiloba in Japan. Because of the similarity between the names among Angelica, they can be confused in the market [20]. SSR markers are especially useful to distinguish closely related genotypes. This is the reason that SSR markers are widely used in genetic studies and identification of closely related cultivars. By using these SSR markers, various problems of Angelica genus can be solved [21].
In this study, we developed polymorphic SSR markers from several A. gigas accessions using NGS and bioinformatics approaches. The SSR markers developed in this study could be used for genetics and the breeding of Angelica genus.

Plant Material and DNA Isolation
Sixteen widely cultivated A. gigas accessions were collected in Korea (Table S1). All of the collected roots or seeds were germinated and grown in the Chungbuk National University greenhouse. Young leaves or fresh seedlings were used for extraction of the genomic DNA (gDNA), which is required for library construction and sequencing. Fresh leaves were ground with liquid nitrogen and DNA was extracted using a DNeasy Plant Mini kit (Qiagen, Valencia, CA, USA) according to the manufacturer's instructions.

Genomic DNA Sequencing and Sequence Assembly
The libraries were constructed from five A. gigas accessions: Manchu variety (MV), Hwangje variety (HV), Gangwon local variety (GLV), Jecheon local variety (JLV), and Sancheong local variety (SLV). Sequencing size was one lane for MV and 1/3 lane for the other four varieties. Sequencing libraries for DNA samples were prepared by a TruSeq Nano DNA Sample Prep Kit, according to the manufacturer's instructions (Illumina, Inc., San Diego, CA, USA). The libraries were subjected to paired-end sequencing with a 101 bp read length using the Illumina HiSeq 2500 platform (Illumina). Errors of short reads for each sample were corrected by using the SOAPec part of SOAPdenovo2 version 2.04 [22]. After error correction, short reads corresponding to ≥13× of genome size, estimated on the basis of a k-mer frequency spectrum, were assembled by using SOAPdenovo2 with the adjustment of k-mer size of 53. Repetitive DNA sequences, such as microsatellites, transposable elements, and rDNAs, were screened in the assembled contigs by using RepeatMasker version 4.0.5 [23] and RepeatModeler version 1.0.8 [23]. From the resulting repeat-masked sequences, gene models were built by ab initio prediction coupled with transcript alignment using AUGUSTUS version 3.1 [24], with the training gene set derived from the genome of Platycodon grandiflorum (unpublished), which belongs to the subclass Asterids. Genes were searched against UniProt and NCBI non-redundant (NR) protein databases using BLASTX, with a cutoff E-value of 10 −10 . Protein domains in genes were searched by using InterProScan version 5.17-56.0 [25].

SSR Findings and Primer Designs
Simple sequence repeats were identified by using SSR Finder [26] with the following parameters: (1) microsatellites consisting of tandem repeats of 2 to 10 bp with a minimum of five repeats; and, (2) no variation (mutation) in repeat motifs was permitted. Primers were designed from flanking sequences of SSR microsatellite loci by using Primer3 version 2.2.3 [27] with the following parameters: primer length = 18-26 bp (Opt. 23 bp); GC % = 50-70% (Opt. 60%); temperature = 55-62 • C (Opt. 58 • C); and, product size range = 150-200 bp. Repetitive DNA, as well as duplicate sequences, were excluded and selected primers from the unique genomic regions. Forward primers were labeled with a virtual dye 6-FAM, NED, VIC, or PET (Applied Biosystems, Foster City, CA, USA).

In Silico SSR Polymorphism Screening
In silico SSR polymorphism screening was performed by using CLC Main Workbench version 6.8.4 (CLC bio, Aarhus, Denmark) BLAST tool. HV, GLV, JLV and SLV sequences were assembled by using CLC Genomics Workbench version 7.5 (CLC bio, Aarhus, Denmark) and prepared for BLAST data base. The SSR regions (450 bp each) identified from MV sequences were used as reference sequences in a BLAST search. BLAST analyses were performed by the following parameters: BLAST program = blastn: DNA sequence and database; Target = BLAST database (HV, GLV, JLV and SLV sequences); Number of threads = 1; Choose filter = Filter low complexity; Expect = 10.0; Word size = 3; Match/mismatch cost = match 1, mismatch −3; Gap costs = Existence 5, Extension 2; and, Max number of hit sequences = 250 ( Figure S1).

PCR Amplification and Genotyping
The PCR reaction mixture (20 µL) consists of 20 ng of gDNA, 1× HSTM Taq DNA polymerase buffer, 1.5 mM MgCl2, 0.2 mM of each dNTP, 0.2 M of each primer, and 1.25 units HSTM Taq DNA polymerase (Dongsheng, Göttingen, Germany). DNA amplification was performed by using the PCR system (BIO-RAD T-100 Thermal Cycler) according to the following: 5 min for initial denaturation at 95 • C, 34 cycles of 30 s at 94 • C, 30 s at 58-60.5 • C, 30 s at 72 • C, concluding with 1 cycle of 30 min at 72 • C. PCR products were separated in a 4% agarose gel to check the amplification and 0.2 L of PCR products were mixed with 9.8 L Hi-Di formamide (Applied Biosystems, Foster City, CA, USA), and 0.2 L of GeneScanTM 500 LIZ ® size standard (Applied Biosystems). The PCR product mixture was denatured at 95 • C for 5 min and kept on ice. Subsequently, the mixture was separated by capillary electrophoresis on the ABI 3730 DNA analyzer (Applied Biosystems) by using a 50 cm-capillary array with a DS-33 install standard as a matrix. We analyzed the amplified fragments by size with GeneMapper software version 4.0 (Applied Biosystems). The number of alleles (N A ), frequency of major alleles (M AF ), observed heterozygosity (H O ) values, expected heterozygosity (H E ) values, and polymorphic index content (PIC) values were measured by using PowerMarker software version 3.23 [28]. Coefficients of genetic similarity were calculated by using the molecular evolutionary genetics analysis (MEGA) program version 5.05 [29]. Genetic distance was computed by using the SharedAllele distance method with PowerMarker software version 3.23.

Characteristics of Genomic SSR in the NGS Assemblies of A. gigas
Libraries were constructed from the five A. gigas accessions ( Table 1) and subjected to paired-end sequencing with a 101 bp read length by using an Illumina HiSeq 2500 platform. The SOAPdenovo2 assembler (version 2.04) was used to assemble the A. gigas genome de novo, generating 395,007 scaffolds representing 804 Mb from the A. gigas (MV) genome. MV is a breed that was developed based on the root hypertrophy and slow-bolting characteristic from A. gigas. Thus, we selected MV as a representative A. gigas variety. The other four varieties were selected due to their diverse geographical origins. The four varieties were used for in silico polymorphism analysis. Analysis of k-mer frequency spectrum revealed a peak (k-mer depth: 11) in the 17 k-mer depth distribution of reads, suggesting a low heterozygosity across the genome ( Figure S2). Based on the k-mer frequency analysis, the genome size of MV was estimated to be approximately 2.67 Gb (No. of k-mer/depth of k-mer at peak). Conclusively, we found, 138,113 SSRs from the de novo assembly of the MV genome. The total number of 121,112 di-nucleotide, and 13,211 tri-nucleotide SSRs constituted 87.691% and 9.565% of SSRs, respectively, followed by tri-, tetra-, penta-and hexa-nucleotide motifs (Table S2).
The MV scaffolds including the SSRs were subjected to BLAST analysis against the assembled contigs of four other varieties (HV, GLV, JLV, and SLV). Polymorphic SSR markers showed two to five different repeat length scaffolds from the five A. gigas accessions. Consequently, a total of 848 polymorphic SSR markers were selected from the in silico analyzed tri-nucleotide to hexa-nucleotide motifs (a total of 2379 SSR primer sets).
We selected 48 polymorphic SSR markers, from 848 in silico analyzed primer pairs, which were showing four or more different repeat length scaffolds types, or more than 15 bp of nucleotide length difference (Table S2; Figure S1) and further used for genotyping of five A. gigas accessions (MV, HV, GLV, JLV, and SLV). Thirty-six SSR primer sets yielded balanced, intact, reproducible, and polymorphic amplicons in 4% agarose gels. The remaining 12 SSR primer sets were amplified more than two bands or showed unexpected amplicon sizes. The 36 SSR primer sets were deposited to GenBank (Table 2). Genomic regions of the 36 SSR loci were also analyzed on the basis of ab initio gene identification in the assembled sequences, and 16 (44.4%), 9 (25.0%), 9 (25.0%), and 2 (5.6%) were found from intergenic, intron, coding DNA sequence (CDS), and 5 untranslated region (5'UTR), respectively (Table S3).
Subsequently, 16 A. gigas accessions were used to detect the polymorphisms of the 36 SSR loci after one of the primers from each primer set was labeled. Polymorphism estimation indicated that the  (Table 3). Thirty-six SSR markers showed a high PIC value (PIC > 0.6) and, based on our results, we constructed the genetic relationship of 16 A. gigas accessions. Phylogenetic trees were constructed for the 16 A. gigas accessions by unweighted pair group method with arithmetic mean (UPGMA) analysis ( Figure S3).
The study of genetic diversity and genetic relationships of crops can assist breeders in the development of economically effective cultivars. SSR markers have been used in many areas, including linkage map development, QTL mapping, marker-assisted selection, cultivar fingerprinting, genetic diversity, gene flow, and evolutionary studies in the botanic sciences [30][31][32]. NGS technology is a modern technique used to discover large-scale genetic polymorphisms [32][33][34]. Since NGS technology allows cost-effective and higher throughput genome sequencing, it has been universally applied for many crop species, including Allium fistulosum L., Sesamum indicum L., and Vicia sativa subsp. Sativa [35,36]. Recently, a study on the development of 18 polymorphic microsatellite markers for A. sinensis using NGS technology has been reported although it was focused on species identification [37]. In this study, we developed SSR markers that could be used for genetics, breeding, and diversity analyses of Angelica species.
The selection of isolated SSR loci from the genome or transcriptome data would be useful to analyze data, project goals, and future plans [34]. Our mass production method of SSR markers using an in silico approach, and resulted in 36 new polymorphic SSR markers from A. gigas. The developed markers from this study are expected to be useful in crop genetics and breeding. Even though the remaining 12 SSR markers amplified multiple bands or yielded bands of unexpected size, they were also polymorphic in the sequence. Therefore, the 848 SSR marker candidates discovered through this research are expected to represent informative markers.
Even though we restricted the purpose of this study on the selection of polymorphic SSR markers for the genetic diversity analysis of the genetic resources, application of the primers to more diverse Angelica species resources can enlarge the usage of the developed markers. In addition, we hope that the A. gigas sequence data of this study can be used to develop various molecular markers, such as SNP and insertion-deletion (InDel) markers, in the future.      Figure S1: Different repeat length scaffold types of four or more, and SSR markers of more than 15 bp nucleotide length difference, Figure S2: The k-mer depth distribution of whole-genome sequencing reads of Angelica gigas, Figure S3: Dendrogram generated using UPGMA cluster analysis based on genetic diversity of 16 Angelica gigas Nakai accessions, Table S1: List of 16 Angelica gigas accessions and information of collection sites, Table S2: Details of Manchu variety accession repeat genome assemblies, Table S3: Number of designed SSR primer sets and polymorphic SSRs discovered in silico.