A Novel Software and Method for the Efficient Development of Polymorphic SSR Loci Based on Transcriptome Data

Traditional methods for developing polymorphic microsatellite loci without reference sequences are time-consuming and labor-intensive, and the polymorphisms of simple sequence repeat (SSR) loci developed from expressed sequence tag (EST) databases are generally poor. To address this issue, in this study, we developed a new software (PSSRdt) and established an effective method for directly obtaining polymorphism details of SSR loci by analyzing diverse transcriptome data. The new method includes three steps, raw data processing, PSSRdt application, and loci extraction and verification. To test the practicality of the method, we successfully obtained 1940 potential polymorphic SSRs from the transcript dataset combined with 44 pea aphid transcriptomes. Fifty-two SSR loci obtained by the new method were selected for validating the polymorphic characteristics by genotyping in pea aphid individuals. The results showed that over 92% of SSR loci were polymorphic and 73.1% of loci were highly polymorphic. Our new software and method provide an innovative approach to microsatellite development based on RNA-seq data, and open a new path for the rapid mining of numerous loci with polymorphism to add to the body of research on microsatellites.


Introduction
The increasing progress of next generation sequencing (NGS) has promoted the explosion of transcriptome data, providing large-scale essential data for the application of molecular markers [1][2][3]. Microsatellites, also called simple sequence repeats (SSRs) or short tandem repeats (STRs), are among the most popular markers, and have been widely used in the analysis of population genetics due to such advantages as wide distribution, high polymorphism, and satisfactory repeatability [4][5][6][7][8]. The applications of SSR markers can obtain abundant and reliable experimental data based on the hypermutation of SSRs. However, traditional methods of polymorphic SSR isolation and characterization without reference sequences are expensive, time-consuming and labor-intensive [9,10]. Many researchers have attempted to optimize and streamline microsatellite experiments [11][12][13][14][15]. For instance, the application of multiplex PCR can significantly reduce the time and cost of SSR genotyping [11], and the success rate of tests can be increased by referring to the propositions concerning the primer design of SSR loci [12]. In addition, to reduce the risk of failure, SSR loci over a certain repeat time were presumed to be polymorphic for follow-up research, resulting in large loci with fewer repeats being overlooked [16].

Development of a New Software
To establish a novel method for screening polymorphic SSR loci using multiple transcriptomes, we first developed one software (Polymorphic SSR digging tool, PSSRdt, publicly available at https: //github.com/PSSRdt/program). This program is based on Perl and available in the input files in FASTA format, which is compatible with multiple systems including Linux and Windows. The usage of PSSRdt is consistent with the regular Perl scripts and is also listed in the software manual. The program primarily depends on the data primitiveness of transcripts assembled using the de novo assembly method. PSSRdt first seeks out SSRs from the input file. The criteria of SSR detection refer to microsatellite searching program-MISA (available online at http://pgrc.ipk-gatersleben.de/misa/misa.html) (Table S1). Unlike some specialized software for SSR identification such as MISA or SSRLocator [42], this program does not need to distinguish whether SSRs are perfect (with single simple repeats) or imperfect [43]. SSR motifs screened with their flanking sequences are then recorded as the hash 'keys' and will be assessed as one SSR locus for the moment, the repeat number of which will be logged as their hash 'values'. The various SSR loci excavated from the input file combined by multiple transcriptomes can be classified and their repeat details will be quickly saved and listed. Thus, if only a few RNA-seq data are available, such as one to five, the digitals in 'values' will be not enough for analysis. In addition, the length of each flanking sequence is determined by the judgment of the users according to thespecific data assembled quality (usually over eight to improve the accuracy). PSSRdt needs to call the scripts in Bioperl (available online at https://bioperl.org/). Users should first ensure the setup success of Bioperl modules, where the common installation methods of the modules are listed in the software manual.

Screening and Verifying of SSR Obtained by PSSRdt
After running PSSRdt, users will obtain two output files containing the details of total detected SSRs and unverified polymorphic ones respectively (Figure 1 Step 2). Both files consist of three column contents. Each row represents the details of all detected SSRs in the assumed same SSR locus, which includes, in turn, the SSR motif with its flanking sequences of that locus, the sum of all SSRs in the locus and the repeat number of each SSR in that locus. The highly variable numbers of repeat motifs shown in the third column indicate the site with relatively high polymorphism. Therefore, users could select the loci with many different repeat numbers to improve the efficiency of subsequent tests. Based on SSRs with the flanking sequences shown in 'FindStr.result', users can quickly verify the correctness of polymorphism information and obtain the complete sequences of these loci using a string search function in any text editor. Meanwhile, users could check SSR loci carefully and extract the applicable sites in the same way. The most important items the users need to check are listed in Step 3 of Figure 1, including estimating whether the length of flanking sequences of SSR motifs meets criteria for primer design (the flanking sequences of Sample 1 could fill the primer design, while Sample 2 could not) and the consistency of all sequences in assumed same loci (the flanking sequences in Sample 4 are consistent with corresponding sequences in Sample 3, which are identified as the same SSR locus. Sample 5 and 3 are not the same locus).

Overall Flow of the Novel Method
A novel method was constructed to efficiently excavate potential polymorphic microsatellite loci using multiple transcriptomes, which was divided into three steps ( Figure 1). First, in Step 1, different transcriptomes of an organism are collected and downloaded. After quality control and data filtering of

Overall Flow of the Novel Method
A novel method was constructed to efficiently excavate potential polymorphic microsatellite loci using multiple transcriptomes, which was divided into three steps ( Figure 1). First, in Step 1, different transcriptomes of an organism are collected and downloaded. After quality control and data filtering of raw data, de novo assembly software is applied to the high-quality clean reads. All transcripts assembled in FASTA format are then merged into one dataset, which will be calculated by PSSRdt in Step 2. Two output files will be immediately generated, including the details of all excavated SSR loci (called 'FindStr.result.detail') and the potential polymorphic sites (called 'FindStr.result'). Users can select some potential polymorphic SSR loci from the 'FindStr.result' file. In Step 3, SSRs with complete flanking sequences can be simply and visually extracted using the details of these SSR loci by the text editors and the verification of sequence information accuracy can be performed simultaneously.

RNA-Seq Data Assembly
To intuitively present and test the method, the experiments were conducted with pea aphids as samples. The raw reads of pea aphid transcriptomes were filtered to generate clean reads by removing adapter sequences, low-quality reads (quality scores < 30), reads with unknown bases 'N' and < 30 bp reads. All raw reads were assembled into transcripts using Trinity [44], the short reads assembling program. We integrated all 44 transcriptomes assembled into one dataset after de novo assembly. The 'cat' command on Linux was used to realize this step. The dataset was then computed by PSSRdt (parameter: 10) as an input file. Sublime Text 3 was performed to check the SSR details generated by PSSRdt and extract 100-300 bp sequences containing SSRs (each flanking sequence ≥ 50 bp).

Samples and Primers
Pea aphid individuals were used for validation of the SSR polymorphism. These individuals are randomly taken from five populations originated from different locations and were fed by dactylethrae with broad bean plants in phytotron. The cultivation conditions were set to temperature 24 ± 1 • C, relative humidity 60% ± 10%, photoperiod L:D = 16 h:8 h.
We tested 52 SSR loci with high polymorphisms in the 'FindStr.result' file for validation by genotyping. SSR primers were designed by PRIMER3 [45] and the options for primer design refer to the following: (1) primer lengths ranging from 18 to 27 bp; (2) product sizes are 100-300 bp; (3) melting temperature (Tm) is 57 • C to 63 • C and the difference of Tm between forward and reverse primers < 2 • C; (4) GC content 40-60%, with an optimal value of 50%. If no primers were found with these options, the parameter range was adjusted: Tm 45-63 • C; the difference between forward and reverse primers < 5 • C; GC content 30-70%. The study referred to the microsatellite PCR fragment fluorescent labeling method [46]. Three PCR primers are needed to amplify each microsatellite locus by this method. The first primer was the 5 -end of the forward primers (F) adding M13 forward primer (5'-TGTAAAACGACGGCCAGT-3'); the second primer was the reverse primer (R) without any modification; and the third primer was a M13 forward primer labeled by 6-carboxy-fluorescine (FAM) at its 5'-end.

PCR Amplification and Statistical Analysis
We extracted genomic DNA from individual A. pisum sample using the Easy Pure Genomic DNA kit (TransGen Biotech Co., Ltd., Beijing, China). Five individuals were randomly selected from each of the five populations for DNA extraction. Because of the small body size, the DNA extracted from a single pea aphid sample can only be used for the verification of 15 pairs of primers. Therefore, after the DNA was used up, we took another five individuals from each of the population for the next 15 primers, amounting to 25 pea aphid individuals from each population were used for validation of the 52 SSR loci selected. The DNA extracted was tested by 1.0% agarose gel electrophoresis to estimate the quality. Each 25 µL PCR included 1. The PCR products of microsatellite DNA were detected by 2% agarose gel electrophoresis. The products containing target bands were sent to Sangon Biotech for microsatellite genotyping detection. The PCR products with fluorescent labels were taken for fluorescence detection using capillary electrophoresis method in an ABI3730XL DNA automated analyzer (Applied Biosystems, Foster City, CA, USA). The obtained peak electrophoregrams were converted into the fragment length of amplified products using GENEMAPER v4.0 software (Applied Biosystems, Foster City, CA, USA). Lastly, the number of alleles (Na), polymorphism information content (PIC) and other analyses for evaluating polymorphism of SSRs were conducted by PowerMarker v3.25 [47].

Transcriptome Data Assembly and Microsatellite Detection
To intuitively present the overall processes of the proposed method, all results of verification experiments on pea aphids are provided. A total of 44 representative pea aphid transcriptomes were downloaded from the NCBI SRA public database (Table 1). Nine institutions submitted the RNA-seq data. Using the data, we obtained 3,387,696 to 126,263,308 raw reads, and 2,829,317 to 108,646,204 high quality clean reads were then generated by data filtering for next de novo assembly (Table S2). Running Trinity, the various pea aphid transcriptomes were assembled into 7634 to 68,321 transcripts and the total sizes of sequences ranged from 2,076,016 bp to 42,402,758 bp (Table 1). SSR analysis of transcriptomes using MISA is also exhibited in Table 1. In addition, the number of trinucleotide repeat motifs in all transcriptomes exceeded the dinucleotide repeats but were less than the mononucleotides (Figure 2a). SSR motifs of AT/AT, AG/CT, and AAT/ATT were dominant in dinucleotide and trinucleotide microsatellites. The PCR products of microsatellite DNA were detected by 2% agarose gel electrophoresis. The products containing target bands were sent to Sangon Biotech for microsatellite genotyping detection. The PCR products with fluorescent labels were taken for fluorescence detection using capillary electrophoresis method in an ABI3730XL DNA automated analyzer (Applied Biosystems, Foster City, CA, USA). The obtained peak electrophoregrams were converted into the fragment length of amplified products using GENEMAPER v4.0 software (Applied Biosystems, Foster City, CA, USA). Lastly, the number of alleles (Na), polymorphism information content (PIC) and other analyses for evaluating polymorphism of SSRs were conducted by PowerMarker v3.25 [47].

Transcriptome Data Assembly and Microsatellite Detection
To intuitively present the overall processes of the proposed method, all results of verification experiments on pea aphids are provided. A total of 44 representative pea aphid transcriptomes were downloaded from the NCBI SRA public database (Table 1). Nine institutions submitted the RNA-seq data. Using the data, we obtained 3,387,696 to 126,263,308 raw reads, and 2,829,317 to 108,646,204 high quality clean reads were then generated by data filtering for next de novo assembly (Table S2). Running Trinity, the various pea aphid transcriptomes were assembled into 7634 to 68,321 transcripts and the total sizes of sequences ranged from 2,076,016 bp to 42,402,758 bp (Table 1). SSR analysis of transcriptomes using MISA is also exhibited in Table 1. In addition, the number of trinucleotide repeat motifs in all transcriptomes exceeded the dinucleotide repeats but were less than the mononucleotides (Figure 2a). SSR motifs of AT/AT, AG/CT, and AAT/ATT were dominant in dinucleotide and trinucleotide microsatellites.

PSSRdt Application
The dataset (3137 Mb, millions of base pairs) merged by 44 RNA-seq assembled data contained 1,089,298 transcripts. PSSRdt were executed on Intel i7-4770 1600 MHz with 4 Gb RAM, running on Windows 7. The transcript set in FASTA format was processed in 14 min and produced two documents, which included a total of 16,384 SSR loci (Supplementary File S1) and 1940 potential polymorphic loci (Supplementary File S2) respectively. The analysis of the 'FindStr.result' file in the Supplementary File 2 revealed that the mononucleotide repeats were predominant, reaching 1097 (56.55%). There were 710 (36.60%) more dinucleotide SSR motifs more than trinucleotide SSRs (133, 6.86%) (Figure 2b), which was not inconsistent with the quantity relationship shown in Figure 2a. The highest proportions in mononucleotide and dinucleotide repeats were A/T (1074, 55.33%) and AT/AT (444, 22.87%) respectively. Among trinucleotide SSRs, the AAT/ATT type was most abundant (108, 5.56%).

Primer Design and SSR Loci Validation
Primer sequences are showed in Table S3. The statistics of polymorphism for the 52 SSR loci are presented in Table 2. The number of pea aphids successfully genotyped ranged from 9 to 21 (average 15.211). The number of SSR alleles per locus was 1 to 12, with an average of 5.346. Only four SSR loci failed to show polymorphism among the samples of five geographical populations ( Table 2), indicating that over 92% of loci are polymorphic. The average PIC in 52 loci is 0.575. The PIC of 84.6% (44) and 73.1% (38) loci exceeded 0.25 (reasonably informative) and 0.5 (highly informative), respectively ( Table 2). The PIC values (0-0.25) in the seven types of motifs were under 30% apart from AAT/ATT and CCG/CGG. The percentage of most SSR motifs' PIC, which exceeded 0.5, was not less than 50% (Figure 3). There is no significant difference in the PIC ratio between dinucleotide and trinucleotide SSRs.

Discussion
With the rapid increase in popularity of NGS, an increasing number of transcriptomes are being uploaded on public databases [48,49], signifying that the method using RNA-seq data to excavate SSR polymorphism information will acquire more data support and can be applied to additional species. This method contained complete flows for analysis of transcriptome data, providing an effective path to reduce the workload of biologist. The classical biology experiments to excavate polymorphic SSR loci, such as magnetic beads enrichment and 5′ anchored PCR method, are not easy and inevitably require large workloads and high costs [50,51]. The new idea, that researchers directly obtain polymorphic information from RNA-seq data, may bring about significant progress in the study of SSRs.

Discussion
With the rapid increase in popularity of NGS, an increasing number of transcriptomes are being uploaded on public databases [48,49], signifying that the method using RNA-seq data to excavate SSR polymorphism information will acquire more data support and can be applied to additional species. This method contained complete flows for analysis of transcriptome data, providing an effective path to reduce the workload of biologist. The classical biology experiments to excavate polymorphic SSR loci, such as magnetic beads enrichment and 5 anchored PCR method, are not easy and inevitably require large workloads and high costs [50,51]. The new idea, that researchers directly obtain polymorphic information from RNA-seq data, may bring about significant progress in the study of SSRs.
For the sake of controlling cost and workload, researchers were previously prone to using SSRs with more repeats because of the probability of SSR loci with fewer repeats showing high polymorphism might be unsatisfactory, resulting in a number of SSRs being ignored [10]. The proposed method directly excavates the polymorphism details to avoid the loss. During the test on pea aphids, the repeated number of SSR motifs in outputs from PSSRdt was not emphasized. Among the 52 loci tested, the repeat number of nine dinucleotide SSR motifs was entirely less than 12, and eleven trinucleotide repeats were all below 10 in the transcript set, and most of those loci were polymorphic (8, 88.9%; 11, 100%). There were six (66.7%) and seven (63.6%) dinucleotide and trinucleotide SSRs that had a PIC over 0.5. It is believed that researchers could acquire more polymorphic SSR loci for the analysis of population genetics when they adopt this method on the basis of those particular results. In addition, multiple sequences at the same SSR loci can be extracted from the sets of diverse transcriptomes, which provides more complete sequences for the design of PCR primers and reduces the impact of assembly errors.
PSSRdt can rapidly complete microsatellite detection and produce two different files, thereby helping users manage different issues and simplify workload. However, if part of the flanking sequences at the same locus of SSRs is mutated, the microsatellite loci with mutations will not be matched with others. Thus, missing a few potential polymorphic loci is unavoidable. Besides, the principle of the program regarding the identification of SSRs is based on whether the number of tandem repeats above certain thresholds; therefore, it is unable to distinguish between perfect and imperfect SSRs. Fortunately, the minimum repeat time of the imperfect SSRs is very close to or not lower than the thresholds [52,53], thereby leaving the imperfect loci generally unburied.
Although, transcriptome assembly is a complex task and has certain requirements for server hardware. In this study, we downloaded 44 pea aphid transcriptomes submitted in SRA database of A. pisum, and analyzed the backgrounds of RNA-seq data, including the experimental objectives and methods, the attributes of the samples, and the submitted institutions. In fact, there is no need to download and assemble all transcriptomes of research objects when the amount of data is abundant. Many cDNA libraries for RNA-seq have multiple duplications, and researchers can choose part of the raw data to save time. In addition, many factors can generate a large influence on microsatellite alteration, such as long-term pesticide treatment and extreme temperature. Thus, using more representative data that samples through different types of treatments can improve the possibility of polymorphic loci mining.
Many new polymorphic SSR loci were obtained in pea aphids and the subsequent experiment verified the efficiency of the method. This approach can be used for more species with sufficient transcriptome data. Numerous new SSR loci with polymorphisms in various species will be found and some research concerning microsatellites may be extended with the data support, for instance, the distribution rules and structural features of SSRs with polymorphisms in functional regions of genes, the influences of external environment on SSR mutations, and characteristics of SSR alleles among different species [54][55][56]. Moreover, examining the vast new loci among diverse species might be valuable to researchers studying the laws of SSR mutations during biological evolution [57].

Conclusions
In this study, a novel software and method were presented to efficiently excavate polymorphic SSR loci from RNA-seq data and tested on A. pisum. This concise method includes three stages: raw data processing, program development and application, and loci extraction and verification. The method provides a clear range for polymorphic loci mining and the experiment success rate was high compared with the traditional methods using RNA-seq data. PSSRdt was especially designed for SSR detection, which was better than the indel analysis software for SSR studies. The novel method provides a new path for rapidly screening numerous polymorphic SSR loci and abundant data for further studies of SSRs.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4425/10/11/917/s1, Table S1: The criteria for SSR detection, Table S2: Statistics of RNA-seq raw reads and clean reads after data filtering, Table S3: Characteristics of 52 microsatellite loci developed for A. pisum, File S1 and File S2 are the two output files generated by PSSRdt (findStr.result and findStr.result.detail).
Author Contributions: R.T. conceived and designed the experiments, and R.T. and M.C. wrote the manuscript; X.P., X.G., Y.H., and C.Z. participated in the related experiments.