Identification of Genic SSRs Provide a Perspective for Studying Environmental Adaptation in the Endemic Shrub Tetraena mongolica

Tetraena mongolica is a xerophytic shrub endemic to desert regions in Inner Mongolia. This species has evolved distinct survival strategies that allow it to adapt to hyper-drought and heterogeneous habitats. Simple sequence repeats (SSRs) may provide a molecular basis in plants for fast adaptation to environmental change. Thus, identifying SSRs and their possible effects on gene behavior has the potential to provide valuable information for studies of adaptation. In this study, we sequenced six individual transcriptomes of T. mongolica from heterogeneous habitats, focused on SSRs located in genes, and identified 811 polymorphic SSRs. Of the identified SSRs, 172, 470, and 76 were located in 5′ UTRs, CDSs, and 3′ UTRs in 591 transcripts; and AG/CT, AAC/GTT, and AT/AT were the most abundant repeats in each gene region. Functional annotation showed that many of the identified polymorphic SSRs were in genes that were enriched in several GO terms and KEGG pathways, suggesting the functional significance of these genes in the environmental adaptation process. The identification of polymorphic genic SSRs in our study lays a foundation for future studies investigating the contribution of SSRs to regulation of genes in natural populations of T. mongolica and their importance for adaptive evolution of this species.


Introduction
Tetraena mongolica (2n = 2x = 28) [1], a monotypic species of the Zygophyllaceae family, is a xerophytic shrub that originated from the Tethys Ocean, and is now endemic to the Eastern Alxa-Western Ordos desert region in Inner Mongolia, China [2,3]. The area is characterized by hyper-drought conditions, low annual temperatures, and high soil salinity [4]. To adapt to the semi-desert environment, T. mongolica has evolved distinct survival strategies, such as an extensive root system, thick cuticles on its stems, and succulent leaves. T. mongolica has important ecological roles as a windbreak and in the fixation of local ecosystem soils [3]. Moreover, high levels of triacylglycerol in its stems makes it combustible even in the fresh state, so it has been used as firewood and is named locally as "oil firewood" [5,6]. T. mongolica is a typical species that grows in this area and

Plant Materials
Leaves of T. mongolica were collected from six natural populations in June 2015 from the Eastern Alxa-Western Ordos area in Inner Mongolia, China. Twenty individuals from each population were selected at intervals of 30 m to avoid sampling progeny from the same maternal parent. Detailed geographical information about the sampling location is given in Table 1. After collection, the leaf samples were snap frozen in liquid nitrogen, then kept in a −80 • C refrigerator until they were transported to the laboratory.

RNA Preparation, cDNA Library Construction, and Sequencing
T. mongolica leaves (one sample per population) were used for RNA isolation. Total RNA was extracted using a Plant Plus RNA Kit (DP437, Tiangen, Beijing, China), according to the manufacturer's instructions. The extracted RNA was treated with deoxyribonuclease I (TaKaRa Bio Inc., Otsu, Shiga, Japan) for 30 min at 37 • C to remove residual DNA. RNA quality was monitored using an Agilent 2100 Bioanalyzer with a minimum RNA integrity number value of 8.0. Six cDNA libraries (D1-D6) were constructed and sequenced on a paired-end flow cell device using an Illumina HiSeq TM 4000 sequencer at Biomarker Biotechnology Co., Ltd (Beijing, China). The sequencing data were called and quality controlled through the Illumina data processing pipeline.

De novo Assembly and Sequence Clustering
The adaptor reads, low-quality reads, and reads in which the proportion of "N" was greater than 5% were filtered to obtain the clean reads. De novo transcriptome assembly was performed for each sample using Trinity software (v2.4.0). TGICL v2.1 (http://sourceforge.net/projects/tgicl/files/tgicl%20v2.1/) was used to cluster the homologous transcripts and form an unextendible unigene. Finally, to eliminate redundancy and splicing among the six assembled transcriptomes, we further merged the unigenes to obtain non-redundant unigenes (all-unigenes).

Identification of Potential Polymorphic Genic SSRs
CandiSSR was used to identify potential polymorphic genic SSRs among the six unigenes datasets of T. mongolica [39]. The parameters used in the pipeline were: flanking sequence length 100 bp, blast e-value cutoff 1.0 × e −10 , blast identity cutoff 95%, and blast coverage cutoff 95% [39,40]. The SSR search criteria were designed for di-, tri-, tetra-, penta-, and hexa-nucleotide repeats, with a minimum repeat number of six, five, five, four, and four, respectively, while SSRs with mono-nucleotide repeats and motif bigger than six nucleotides were not examined in this study. Primers for the identified SSRs were designed automatically in the pipeline using Primer3 [41].

Location Prediction and Functional Annotation of the Identified SSRs
We first determined the protein-coding sequences (CDSs) of the SSR-containing sequences. The sequences were aligned to protein databases in the priority order of NR, Swiss-Prot, KEGG, and COG using BLASTX (http://blast.ncbi.nlm.nih.gov/Blast.cgi), with a significance threshold e-value <1.0 × 10 −5 . The highest ranked proteins in blast results were used to predict the CDSs of the SSR-containing sequences. The results were then used to identify the distributions of the SSRs. For unigenes with full-length CDSs, the locations of the SSRs were analyzed based on the relative positions of the SSRs with respect to the start (ATG) and stop (TAA, TAG, TGA) codons. For unigenes with partial CDSs, we manually performed nucleotide blast analysis to identify full-length homologous genes and then predicted the locations of the SSRs according to their positions in the query sequences. To obtain functional information about the genic-SSR-containing unigenes, gene ontology (GO) functional categories were identified using the Blast2GO software v2.5.0 (http://www.blast2go.com/b2ghome) and pathways were assigned by sequence searches against the KEGG database using the BLASTX algorithm, with an e-value threshold of 1.0 × 10 −5 .

SSR Polymorphic Validation
Genomic DNA was extracted from T. mongolica leaves (eight individuals per population) using a Plant Genomic DNA kit (DP305, TianGen, Beijing, China). DNA quality was checked on 1% TAE agarose gels and NanoDrop 2000c (Gene Company Limit, Beijing, China). PCR amplifications were performed on the ABI2720 Thermal cycler in 25 µl reaction mixtures that included 1 µl of template DNA (50 ng/µL), 12.5 µL of Premix Taq (TaKaRa Biotechnology Co., Dalian, Liaoning Province, China), 0.5 µl (10 pM) of forward primer, 0.5 µl (10 pM) of reverse primer, and ddH 2 O. PCR amplifications were carried out as follows: 5 min at 94 • C, followed by 30 cycles of 30 s at 94 • C, 30 s at the primer-specific annealing temperature (Table 2), 30 s at 72 • C, and the final extension step at 72 • C for 10 min. For the successfully amplified primer pairs, the 5 end of each forward primer was tagged with one of three fluorescent dyes (6-carboxy-fluorescine, hexachloro-6-carboxy-fluorescine, and 6-carboxy-X-rhodamine) and used for amplifications with the same protocol. The labeled PCR products were analyzed on an ABI 3730xl DNA analyzer with a GeneScan 500 LIZ Size Standard (Applied Biosystems, Beijing, China). Allele sizes were called using GeneMarker v2.6.0 (SoftGenetics, State College, Pennsylvania, USA). Several genetic parameters, including mean number of alleles per locus (N A ) and observed and expected heterozygosity (H O and H E ), were estimated across all the identified loci using GenAlEx software (v6.502) [42]. Polymorphism information content (PIC) of each locus was measured by PowerMarker program (v3.0) [43]. CR, CN, GC, and Ug represent clean read, clean nucleotide, GC content, and unigenes, respectively. Q 20 represents the clean reads that had Phred-like quality scores at the Q 20 level (an error probability of 1%). ML represents the mean length of assembled sequences and N 50 indicates that 50% of the assembled bases were incorporated into sequences with a length of N 50 or longer.

Data Availability
The raw sequencing data have been deposited in the Sequence Read Archive under accession number PRJNA597711.

Sequencing Outputs and de novo Assembly
A total of 51.48 gigabases of clean data were obtained from the six cDNA libraries. As shown in Table 2, the clean reads in all samples had an average CG content of 44.24%, and the quality of these data were quite high, with an average Q 20 percentage nearly of 97.15%. After de novo assembly, 73,977 to 92,301 unigenes in the six datasets were generated, and the mean length and N 50 were 806 bp and 1536 bp, respectively. Finally, the assembled unigenes of the six transcriptomes were further clustered into a total of 119,603 all-unigene sequences, which had an average length of 1098 bp and average N 50 value of 1843 bp.

Location Prediction and Frequency Analysis of the Polymorphic SSRs
Among the polymorphic SSR-containing sequences, 620 were aligned with known proteins and 763 SSRs were embedded in these transcripts. Of these SSRs, 718 were located in the coding and untranslated regions of 591 unigenes. Among them, 172, 470, and 76 SSRs in 150, 372, and 69 unigenes were distributed in 5 UTRs, coding regions, and 3 UTRs, respectively (Table S2). Most of the tri-nucleotide repeats (90.12%) were located in CDSs, while most of the di-nucleotide repeats (91.33%) were located in UTRs. In the CDSs, AAC/GTT, AGC/CTG, AAG/CTT, ATC/ATG, AAG/CTT, ACC/GGT, and ATC/GAT were relatively abundant, together accounting for 62.55% of all the SSRs located in CDSs. AG/CT (62, 36.05%) was the most abundant repeat in 5 UTRs and AT/AT (42, 55.26%) was the most abundant repeat in 3 UTRs ( Figure 1). CR, CN, GC, and Ug represent clean read, clean nucleotide, GC content, and unigenes, respectively. Q20 represents the clean reads that had Phred-like quality scores at the Q20 level (an error probability of 1%). ML represents the mean length of assembled sequences and N50 indicates that 50% of the assembled bases were incorporated into sequences with a length of N50 or longer.

Location Prediction and Frequency Analysis of the Polymorphic SSRs
Among the polymorphic SSR-containing sequences, 620 were aligned with known proteins and 763 SSRs were embedded in these transcripts. Of these SSRs, 718 were located in the coding and untranslated regions of 591 unigenes. Among them, 172, 470, and 76 SSRs in 150, 372, and 69 unigenes were distributed in 5′ UTRs, coding regions, and 3′ UTRs, respectively (Table S2). Most of the trinucleotide repeats (90.12%) were located in CDSs, while most of the di-nucleotide repeats (91.33%) were located in UTRs. In the CDSs, AAC/GTT, AGC/CTG, AAG/CTT, ATC/ATG, AAG/CTT, ACC/GGT, and ATC/GAT were relatively abundant, together accounting for 62.55% of all the SSRs located in CDSs. AG/CT (62, 36.05%) was the most abundant repeat in 5′ UTRs and AT/AT (42, 55.26%) was the most abundant repeat in 3′ UTRs ( Figure 1).

Functional Annotation of the Polymorphic SSR-Containing Sequences
Based on the sequence similarity, the 424 unigenes containing 525 SSRs were mapped to 3561 terms in the GO database by blast. The terms under the three main GO categories were divided among 45 sub-categories. For 5′ UTRs that contained SSRs, the most abundant terms in each main categories were "metabolic process", "cell", and "binding", respectively. For CDSs that contained SSRs, "cellular process" was the most abundant term, followed by "metabolic process", "cell", "cell part", and "single-organism process". For the 3′ UTRs that contained SSRs, the top three terms were "metabolic process", "cellular process", and "single-organism process" (Figure 2, Table S3).

Functional Annotation of the Polymorphic SSR-Containing Sequences
Based on the sequence similarity, the 424 unigenes containing 525 SSRs were mapped to 3561 terms in the GO database by blast. The terms under the three main GO categories were divided among 45 sub-categories. For 5 UTRs that contained SSRs, the most abundant terms in each main categories were "metabolic process", "cell", and "binding", respectively. For CDSs that contained SSRs, "cellular process" was the most abundant term, followed by "metabolic process", "cell", "cell part", and "single-organism process". For the 3 UTRs that contained SSRs, the top three terms were "metabolic process", "cellular process", and "single-organism process" (Figure 2, Table S3). The three lines of bubbles represent SSRs that were located in the 5′ untranslated regions (UTRs), protein-coding sequences, and 3′ UTRs, respectively. GO terms that contained unigenes more than or equal to ten in one of the gene regions are shown in the figure. The relative number of unigenes assigned to each term is indicated by the size of each bubble. The grey hollow circle indicates no unigene was assigned to the relevant GO term.
Pathway analysis assigned 228 genic-SSR-containing sequences to 77 KEGG pathways ( Figure  3, Table S4). For the 5′ UTRs, CDSs, and 3′ UTRs that contained SSRs, "metabolic pathways" was the most common and abundant term, followed by "plant hormone signal transduction", "endocytosis", and "biosynthesis of secondary metabolites" in each gene region.  The three lines of bubbles represent SSRs that were located in the 5 untranslated regions (UTRs), protein-coding sequences, and 3 UTRs, respectively. GO terms that contained unigenes more than or equal to ten in one of the gene regions are shown in the figure. The relative number of unigenes assigned to each term is indicated by the size of each bubble. The grey hollow circle indicates no unigene was assigned to the relevant GO term.

Validation of the Polymorphic SSRs
Pathway analysis assigned 228 genic-SSR-containing sequences to 77 KEGG pathways ( Figure 3, Table S4). For the 5 UTRs, CDSs, and 3 UTRs that contained SSRs, "metabolic pathways" was the most common and abundant term, followed by "plant hormone signal transduction", "endocytosis", and "biosynthesis of secondary metabolites" in each gene region. The three lines of bubbles represent SSRs that were located in the 5′ untranslated regions (UTRs), protein-coding sequences, and 3′ UTRs, respectively. GO terms that contained unigenes more than or equal to ten in one of the gene regions are shown in the figure. The relative number of unigenes assigned to each term is indicated by the size of each bubble. The grey hollow circle indicates no unigene was assigned to the relevant GO term.

Validation of the Polymorphic SSRs
Primer pairs were successfully designed for 795 identified SSRs using their flanking sequences (Table S5). To validate the effectiveness of these primers, 20 of the identified polymorphic SSRs were selected randomly and validated by PCR. Seventeen SSRs were successfully amplified with polymorphic loci. The loci size polymorphism levels were further assessed by capillary electrophoresis (Figure 4). A total of 76 alleles were detected with 2 to 7 (average 4.471) alleles obtained per locus (Table 3).  (Table 3).
Primer pairs were successfully designed for 795 identified SSRs using their flanking sequences (Table S5). To validate the effectiveness of these primers, 20 of the identified polymorphic SSRs were selected randomly and validated by PCR. Seventeen SSRs were successfully amplified with polymorphic loci. The loci size polymorphism levels were further assessed by capillary electrophoresis (Figure 4). A total of 76 alleles were detected with 2 to 7 (average 4.471) alleles obtained per locus ( Table 3). The expected heterozygosity (HE), ranged from 0.274 to 0.774 and the observed heterozygosity (HO) from 0.167 to 0.729, with mean values of 0.587 and 0.418, respectively. The polymorphism information content (PIC) value ranged from 0.240 to 0.728, with an average value of 0.531 (Table 3).    Ug, Cl, Ct, PS, RM, AS, T a and PF represent unigene, cluster, contig, primer sequence, repeat motif, allele size, annealing temperature and putative function respectively; "-" represents no blast hits with known proteins deposited in the public databases.

Identifying and General Profiling of Genic SSRs in T. mongolica Transcriptomes
SSRs have been recognized as an efficient tool for linking phenotypic and genotypic variations. They have been used widely in plants, for linkage map development, quantitative trait loci mapping, marker-assisted selection, parentage analysis, cultivar fingerprinting, genetic diversity studies, gene flow, and evolutionary studies [25,37]. Although microsatellite markers are versatile, developing polymorphic and robust SSR markers has been the main limitation for their use, especially for species that lack genomic background information. In recent years, next-generation sequencing technologies and a variety of computer programs (e.g., MISA [45], SSRFinder [46], SSRIT [47]) have promoted the identification of SSRs in model and non-model organisms [37]. However, the downstream development and identification of usable and researchable polymorphic SSRs is still a manual and time-consuming process [39]. Recently, an integrated procedure called CandiSSR was developed and used to obtain polymorphic SSRs in silico using genomic or transcriptomic sequence data [40]. In the present study, using the bioinformatics pipeline, we identified 811 potential polymorphic SSRs in six sequenced transcriptomes of T. mongolica. By comparing our finding with some other SSR development projects performed using CandiSSR, we found a similar polymorphic genic SSRs generation rate among these different studies. For example, 497 polymorphic genic SSRs were identified in the comparison of the transcriptomes representing the northern and southern population of Parrotia subaequalis [40]; and 1663 polymorphic EST-SSRs were identified by comparing the transcriptomes of tea plant and 19 Camellia species [48]. The identified SSRs in these studies were relatively small. One reason was probably attributed to the search criterial used in CandiSSR (SSRs with mono-nucleotide repeats and motif bigger than six nucleotides were excluded). Another reason was that the pipeline was designed to identify polymorphic SSRs directly based on the transcriptome or genome sequences comparison, thus the SSRs number would be significantly lower when compared to those SSR mining projects that were performed using only one dataset. Experimental validation showed that 17 of 20 randomly selected SSRs had 2 to 7 (average 4.471) polymorphic alleles per locus. The average H E and H O were 0.587 and 0.418, respectively (Table 3 and Figure 4). Although these genic SSRs may have low levels of polymorphism compared with the levels in genome SSRs (mean number of alleles per locus was 26.0, and mean H E and H O were 0.868 and 0.840, respectively [24]) in the T. mongolica genome, the genic SSRs have some intrinsic advantages (e.g., high transferability and functional importance) that make them a valuable resource that can be used for genetic and evolutionary studies of the species or its phylogenetically closed species and genus, such as species Larrea tridentata and genus Viviania marifolia [8,9]. In addition, Zhi et al. (2018) indicated that the highly polymorphic nuclear microsatellites might be over-estimating the genetic diversity of the present demography of T. mongolica [24], thus we considered that a rational genetic variation reserve could further be evaluated in T. mongolica by combing genetic data obtained from either the nuclear or genic SSRs. However, analyses of gene regions rather than non-coding genome regions evenly revealed a high level of genetic diversity in this endangered species [5,24], suggesting that the genome of the species is highly heterozygous to some extent and saturated with plenty of genetic variations that are potentially important for environmental adaptation and ongoing survival of the species. For the other three unamplified SSRs, we have speculated several possible reasons that might explain failure of the validations. Firstly, the PCR conditions might not have been suitable enough for amplifying the three SSRs. Secondly, the specificity of the primers was low. Thirdly, the primer binding sites may cross the boundary between introns and exons of the SSR-containing genes. Overall, whilst some markers could not be validated for technical reasons, we nevertheless were able to conclude that a large number of the identified SSRs in the present study are probably polymorphic and suitable for further study in T. mongolica transcripts.
Strong evidence has shown that SSRs are non-randomly distributed across CDSs and UTRs [29]. In many species, exons contain many more tri-and hexa-nucleotide SSRs than di-and tetra-nucleotide SSRs [49,50]. This distribution bias may be explained by the low number of non-trimeric SSRs found in coding regions, possibly caused by frameshift mutations [51], that might be related to features of adaptation or responses to environment stress, as was previously found in Escherichia coli K12 [35]. In T. mongolica, the distribution analysis of transcribed sequences showed that > 90.12% of the identified tri-nucleotide SSRs were located in coding regions, whereas 91.33% of the di-nucleotide SSRs were distributed in UTRs (Figure 1). The SSRs in transcribed regions also showed bias to a specific nucleotide composition [29]. It was reported that the UTRs, in particular the 5 UTRs, were strongly biased toward AG/CT in Arabidopsis thaliana [49,52]. In this study, almost all the identified AG/CT repeats were distributed in 5 UTRs and most of the AT/AT repeats were located in 3 UTRs. The SSRs in coding regions also exhibited a strong bias toward specific triplet repeats, such as AAG, AAC, ATC, AGC, AGG, and ACG, and these triplet repeats are known to be the most common repeats in the exons of dicotyledonous plants [53]. As shown in Figure 1, our results are consistent with these observations, so the identified SSRs in T. mongolica may have similar functions to their counterparts.

Probable Functions of Genic SSRs in the Environmental Adaptation of T. mongolica
Habitat heterogeneity and environmental pressures cause various stress responses that involve changes in gene transcription and translation in multiple cellular processes and metabolic pathways, and plants have evolved effective and efficient strategies to adapt to environmental changes for their survival [54,55]. Although narrowly distributed in the Ordos desert region, the eco-physiology characteristics of T. mongolica populations in their heterogeneous habitats (e.g., upland, hill, tableland, high plain) are different. Previous studies have shown that T. mongolica populations have different water parameters (water potential in saturated point, water potential in turgor loss point, available water capacity), endogenous hormones (indole-3-acetic acid, abscisic acid, gibberellin, zeatin riboside), antioxidant activities (superoxide dismutase, peroxidases, catalases, glutathione reductase), and photosynthesis, which suggests that the populations have evolved distinct adaptative mechanisms under heterogeneous growth conditions [14][15][16]. In recent decades, anthropogenic disturbances such as mineral exploitation, road building, overgrazing, and farming have led to habitat isolation, thus further accelerating the differentiation and survival adaptation of T. mongolica [19,20]. Genic SSRs have been proposed as fast-evolving genomic elements that might be of importance for adaptive evolution [34,36]. Therefore, it is reasonable to assume that the genes that contain polymorphic SSRs are important drivers of a plant's response to environmental pressure and fluctuations. In the present study, we identified 646 transcripts that contained 811 polymorphic SSRs that may play significant roles in the local adaptation processes of T. mongolica. Of the SSR containing sequences, the majority (620) returned at least one blast hit with the known protein database at the e-value > 1.0 × 10 -5 , while the remaining 26 unigenes did not match known genes, suggesting that these unannotated unigenes might represent unique genetic resources of T. mongolica. Substantial data indicate that SSR expansion or contraction may affect gene expression and activity, resulting in concomitant physiological consequences and observable phenotypic changes [29,55]. Variations in 5 UTRs have been found to regulate gene expression by affecting transcription and translation. In coding regions, variations can lead to a gain or loss of gene function via frameshift mutations or expanded toxic mRNA, and in 3 UTRs variations can cause transcription slippage and produce expanded mRNA, which can disrupt splicing and disrupt other cellular functions [25,29]. Variations in CT repeats in the 5 UTR of the starch-synthesis genes in waxy rice (Oryza sativa) were correlated with the physicochemical properties of starch [56]. Changes in the number of TGA repeats in the cellulase synthase gene (PtoCesA10) in Populus tomentosa resulted in an insertion mutation in the coding region and the addition of aspartate to the polypeptide chain of the enzyme, which is uniquely associated with the lignin content [57]. Myotonic dystrophy type 1 is caused by CTG repeat expansions in the 3 UTR of DMPK [58,59]. In our analysis, 150, 372, and 69 unigenes were found to contain 172, 470, and 76 SSRs in 5 UTRs, coding regions, and 3 UTRs, respectively (Table S2). The GO functional annotations showed that the identified SSR-containing sequences were enrichened in "cellular process", "metabolic process", "cell", "cell part", and "single-organism process" terms under the three main GO categories, suggesting such groups of T. mongolica genes might first be regulated by SSRs mutations under the heterogeneous environments ( Figure 2, Table S3). The KEGG analysis showed that among the SSR-containing 5 UTRs and CDSs, 35 were distributed in "plant hormone signal transduction" (Figure 3, Table S4), implying that the regulation of endogenous hormones may play significant roles in the local adaptation of T. mongolica. This prediction is partially supported by the eco-physiology observations described above [14][15][16]. In future studies, we plan to further investigate how changes in the SSR repeats in natural T. mongolica populations impact gene functions and clarify the probable functions of these SSRs, which would provide hints for the study of environmental adaptation of the species and information to answer the question of whether genic SSRs play significant roles in adaptive evolution.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4425/11/3/322/s1: Table S1: Summary of the identified polymorphic SSRs by CandiSSR. Table S2: Summary of the location information of the identified polymorphic SSRs. Table S3: GO functional annotation of the polymorphic SSR containing sequences. Table S4: KEGG functional classification of the SSR containing sequences. Table S5: Primers for the polymorphic SSRs identified by CandiSSR.