De Novo Assembly and Characterization of Bud, Leaf and Flowers Transcriptome from Juglans Regia L. for the Identification and Characterization of New Est-ssrs

Persian walnut (Juglans regia L.), valued for both its nut and wood, is an ecologically important temperate tree species native to the mountainous regions of central Asia. Despite its importance, there are still few transcriptomic resources in public databases for J. regia, limiting gene discovery and breeding. Here, more than 49.9 million sequencing reads were generated using Illumina sequencing technology in the characterization of the transcriptome of four J. regia organs (bud, leaf, female flowers, and male flowers). De novo assembly yielded 117,229 unigenes with an N50 of 1955 bp. Based on sequence similarity searches against known proteins, a total of 20,413 (17.41%) genes were identified and annotated. A set of 27,584 unigenes with SSR (simple sequence repeats) motifs were identified as potential molecular markers, and a sample of 77 of these EST-SSRs (express sequence tags) were further evaluated to validate their amplification and assess their polymorphism. Next, we developed 39 polymorphic microsatellite markers to screen 88 Persian walnut individuals collected from 11 populations. These markers and transcriptomic resources will be useful for future studies of population genetic structure, evolutionary ecology, and breeding of Persian walnut and other Juglans species.


Introduction
Juglans regia L., a diploid (2n = 32) walnut species, is known as Persian, English, or common walnut.It is native to the mountainous regions of central Asia [1][2][3].It is an ecologically important tree species valued for both its nuts and wood since ancient times [4][5][6].Walnut is cultivated commercially in nearly every nation with a temperate climate.World production of whole walnut (in-shell) was around 1.5 × 10 6 tons in 2008 [7].Despite its huge value, genomic resources for Persian walnut are limited.
The most abundant genetic resource for Persian walnut is microsatellites (simple sequence repeat, SSRs), which can be neutral or genic (expressed sequence tags, EST-SSRs).Recently, 185 polymorphic genomic, non-genic SSRs from J. regia were published and 398 EST-SSRs were identified by Zhang et al. through data mining, of which 41 were shown to be polymorphic [8,9].In general, EST-SSRs are more conserved than noncoding sequences; therefore, EST-SSR markers have a relatively high transferability to closely related species [10,11].A total of 21,294 EST sequences of J. regia have been deposited in the NCBI (National Center for Biotechnology Information) GenBank database.This represents an estimated 99.6% of all Juglans ESTs (21,375, as of October 2015).Zhang et al. identified 805 loci containing EST-SSRs, although only 13 EST-SSRs (2.5%) were tested extensively [9,12].Previous methods for developing SSRs from genomic DNA required costly and time-consuming approaches involving cDNA library construction, cloning, and labor intensive Sanger sequencing.
Next generation sequencing (NGS) of transcriptomes has proved an attractive alternative to whole genome sequencing [13][14][15].The transcriptome provides information on gene expression, gene regulation, and amino acid content of proteins.Therefore, transcriptome analysis is essential to interpret the functional elements of the genome and to provide insight into the proteins present in cells and tissues [10,16,17].Moreover, with traditional methods, sequencing of randomly selected cDNA clones often resulted in insufficient coverage of less-abundant transcripts, which potentially have essential functions [18,19].Transcriptome data generated by high-throughput sequencing has been an excellent resource for SSR marker development [20] and gene discovery [21,22].
In this study, we utilized Illumina paired-end sequencing to characterize the pooled transcriptome of buds, leaves, female flowers, and male flowers of Persian walnut.The resulting sequence data was used to develop EST-derived SSR markers.This study involved the: (1) characterization of the frequency and distribution of putative SSRs obtained from J. regia transcriptome and analysis of polymorphism in the EST-SSR markers derived from expressed sequences, and (2) exploration of the population structure of 88 individuals from 11 Chinese populations using the EST-SSR markers.These markers will be useful for genetic mapping, population genetic studies, evolutionary ecology, and breeding of Juglans species.

Sample Collections, DNA Extraction and RNA Extraction
For transcriptome sequencing, fresh leaves, buds, female flowers, and male flowers were collected on 28 April 2014 from a single, mature, healthy-appearing J. regia tree growing in the Qingling Mountains of western China and immediately frozen in liquid nitrogen prior to storage at −80 • C. Total RNA was extracted using OMEGA Bio-Tek's Plant RNA Kit (Norcross, GA, USA).RNA degradation and contamination was monitored on 1% agarose gels.RNA purity was assayed using the Nano Photometer ® spectro photometer (IMPLEN, Westlake Village, CA, USA) and RNA concentration was measured using the Qubit ® 2.0 Flurometer (Life Technologies, Carlsbad, CA, USA).RNA integrity was assessed using the Agilent Bioanalyzer 2100 system (Agilent Technologies, Santa Clara, CA, USA).We pooled equivalent amounts of all RNA from fresh leaves, buds, female flowers, and male flowers.
To verify the polymorphism of EST-SSRs sequenced for subsequent population genetic studies, we extracted DNA from 88 leaf samples collected from 11 locations in 2013 from J. regia trees in natural populations and "populations" of cultivars (MT, GZ, BS, YC, and CL) in China (Table 1).Each sampled wild tree was an autochthonous, healthy adult from a mountain forest orin some cases from a roadside in a primary forest.All sampled trees were growing at least 1000 m from any orchard, cultivated land, or human dwelling.Sampled trees were separated by at least 50 m."Populations" of cultivated trees were collected from farm land, villages, or near a house.Fresh leaves were collected and dried with silica gel.Genomic DNA was extracted following the methods of Doyle and Doyle [23] and Zhao and Woeste [24] and was resuspended in 50 µL of water, diluted to 10 ng/µL and then stored at −20 • C. RNA-seq libraries were generated using NEBNext Ultra™ RNA Library Prep Kit for Illumina (NEB, Beverly, MA, USA) following manufacturer's recommendations, and index codes were added to attribute sequences to each sample.Briefly, mRNA was purified from 3 µg total RNA using poly-T oligo-attached magnetic beads.Remaining overhangs were converted into blunt ends via exonuclease/polymerase activities of DNA polymerase and RNase H.After adenylation of 3' ends of DNA fragments, the NEBNext Adaptor with hairpin loop structure was ligated to prepare for hybridization.To enrich for 150-200 bp cDNA fragments, the library was first purified using the AMPure XP system (Beckman Coulter, Beverly, MA, USA).Afterward, 3 µL USER Enzyme (NEB, Beverly, MA, USA) was used with size-selected adaptor-ligated cDNA at 37 • C for 15 min followed by 5 min at 95 • C. Next, PCR was performed using the Phusion High-Fidelity DNA polymerase, universal PCR primers and an index primer.Finally, PCR products were purified (AMPure XP system) and library quality was assessed on the DNA high sensitivity chips using the Agilent Bioanalyzer 2100 system (Agilent, Santa Clara, UT, USA).

Transcriptome Assembling and Gene Annotation
Illumina HiSeq2000 sequencing was performed by Novogene Bioinformatics Technology Co., Ltd., Beijing, China [25].De novo transcriptome assembly was accomplished using Trinity [26] with default parameters.The Blast2GO version 2.5 program [27] was first used to analyze GO annotation of the assembled unigenes.Afterwards, GO functional classifications of the unigenes were performed using the WEGO version 1.0 software [28].Unigenes of the transcriptome were annotated based on data from the NCBI non-redundant protein sequences (Nr) database, and NCBI non-redundant nucleotide sequences (Nt) database, Clusters of Orthologous Group of proteins (KOG/COG) database, KEGG ortholog (KO) database, a manually annotated and reviewed protein sequence (Swiss-Prot protein) database, Gene Ontology (GO) database, and protein family (Pfam) database.The COGs protein database phylogenetically classifies the complete complement of protein sencoded in a genome.Each COG is a group of three or more proteins that are inferred to be orthologs.To further analyze the transcriptome of J. regia, all of the unigenes were submitted to the KEGG pathway database.The KEGG pathway database is a knowledge base for the systematic analysis of gene functions [29].KOG, Nr, Nt, and SwissProt database used NCBI Blast version 2.2.28+ [27,30].Afterwards, GO functional classifications of the unigenes were performed using the WEGO software [28].All BLAST searches were performed with an e-value of 1e −5 .Pfam protein domain prediction was performed using HMMER version 3 software [31].GO annotations using Blast2GO version 2.5 were performed using the cutoff e-value of 1e −6 [27,32].

Discovery of EST-SSRs, Primer Design, Amplification Conditions, and Marker Validation
Microsatellites were identified using Micro Satellite identification tool (MISA) [33] and sequences with ≥5 uninterrupted motifs were randomly selected for primer design by Primer 3 [34].For primer design, 77 unigenes were randomly selected from 16,699 sequences (unigenes) containing microsatellites that were not single nucleotide repeats.Primers were designed so that the predicted product size was 150-280 bp based on cDNA sequences and assuming no introns.Primer design parameters were set as follows: length range = 18-23 nucleotides with 21 as optimum, optimum annealing temperature = 55 • C and GC content 40%-60% with 50% as optimum.The PCR was programmed for 3 min at 94

Population Genetics Data Analysis
Genetic diversity per locus and population were evaluated based on the following descriptive summary statistics: number of alleles (N A ), observed (H O ) and expected (H E ) heterozygosity using the program GenAlEx version 6.5 [35].GENEPOP version 4.2 [36] and Arlequin version 3.5 [37] were used to test the Hardy-Weinberg equilibrium (HWE) and linkage disequilibrium (LD) for all loci.The significance of deviations from HWE and extent of LD was assessed with 1500 permutations using the program GENEPOP version 4.2 [36].The program CERVUS version 3.0 [38] was used to calculate polymorphic information content (PIC) and the program MICRO-CHECKER version 2.2.3 [39] was used to detect null alleles.Genetic differentiation among the five wild populations (F ST ) was tested using the program GENEPOP version 4.2 [36].The significance of variation in the F ST observed between the two populations was determined by permutation tests (10,000) using Arlequin version 3.5 [37].The software STRUCTURE version 2.3.4 [40] was used to derive a most likely number of ancestral populations represented by the samples and to determine the probability of assignment for each sample.We assumed independent allele frequencies with a burn-in length of 100,000 iterations, program run length of 1,000,000 iterations, and ten replicates per run for K = 2-8 clusters with the admixture model [35].The program STRUCTURE HARVESTER [41] was used to calculate the optimal value of K using the delta K criterion [41], the inferred clusters were drawn as colored box-plots using program DISTRUCT version 1.1 [42].The overall pattern of genetic variation among cultivated and wild trees was determined by principal coordinates analysis (PCoA) using GenALEx version 6.5 [35].The software IBD [43] was used to perform Mantel tests comparing matrices of geographic distance and genetic distance based on the isolation by distance web service (IBDWS) method [44].The UPGMA (unweighted pair-group method with arithmetic averaging) analysis based on Nei's genetic distance [45] was performed using GENEPOP version 4.2 [36].

Data Deposit
The transcriptome was submitted to the National Center for Biotechnology Information, the accession number was SRR3499221 for raw reads.

Sequence Assembly
To increase the likelihood of recovering rare transcripts, to obtain a broad sample of the transcriptome, and to observe potential tissue-specific splice variants, we used normalized RNA pools from leaves, buds, female and male flowers for sequencing.A total of 4.98 G high quality reads were used to assemble the J. regia transcriptome de novo based on the expression of genes from four plant organs.The raw transcript data included 49,929,297 reads, 250,222 transcripts, and 117,229 unigenes.As a result, 250,222 transcripts were obtained with an average length of 503 bp (N90) and a N50 of 1955 bp.The length of the unigenes varied from 201 bp to 17,048 bp, with an average of 725 bp and N50 value of 1226 bp (Figure 1).Transcripts over 500 bp accounted for about 62.4% of the total (Figure S1).

Gene Annotation of J. regia Transcriptomes
In total, 45,029 of 117, 229 unigenes were annotated to 55 functional sub-categories distributed under three main categories including biological process, cellular component, and molecular function (Figure 2).Nine functional sub-categories included few unigenes (Figure 2b).Within the biological process category, "cellular process" and "metabolic process" were the top two GO classes among 19 sub-categories (Figure 2a) while the smallest sub-categories were "growth", "rhythmic process", and "cell death" (Figure 2b).Within the cellular component category, "cell", "cell part", and "organelle" were the most common among the 19 sub-categories (Figure 3a; five categories shown in Figure 2b); "synapse part" and "synapse" were among the least represented.Within the molecular function category, the most highly represented sub-categories were "binding" and "catalytic activity" (Figure 2a), whereas the least represented were "receptor regulator activity" and "metallochaperone activity" (Figure 2b).

Gene Annotation of J. regia Transcriptomes
In total, 45,029 of 117, 229 unigenes were annotated to 55 functional sub-categories distributed under three main categories including biological process, cellular component, and molecular function (Figure 2).Nine functional sub-categories included few unigenes (Figure 2b).Within the biological process category, "cellular process" and "metabolic process" were the top two GO classes among 19 sub-categories (Figure 2a) while the smallest sub-categories were "growth", "rhythmic process", and "cell death" (Figure 2b).Within the cellular component category, "cell", "cell part", and "organelle" were the most common among the 19 sub-categories (Figure 3a; five categories shown in Figure 2b); "synapse part" and "synapse" were among the least represented.Within the molecular function category, the most highly represented sub-categories were "binding" and "catalytic activity" (Figure 2a), whereas the least represented were "receptor regulator activity" and "metallochaperone activity" (Figure 2b).

Functional Classification by the Orthologous Groups (COG)
All unigenes were aligned to the COG database to predict and classify possible functions.Out of 27,435 Nr hits, 11,983 sequences were assigned to COG classifications (Figure 3).Among the 26 COG categories, the cluster for general function (3432; 17.0%) represented the largest group, followed by transcription (1789; 8.9%), replication, recombination and repair (1665; 8.3%).Post-translational modification, protein turnover and chaperones (1577; 7.8%), signal transduction mechanisms (1487; 7.4%), carbohydrate transport and metabolism (1200; 6.0%) and translation, ribosomal structure and biogenesis (1161; 5.8%) were the largest sub-categories, whereas, only a few unigenes were assigned to nuclear structure and extracellular structure.In addition, 619 unigenes were assigned to secondary metabolite biosynthesis, transport, and catabolism (Figure 3).unigenes were assigned to nuclear structure and extracellular structure.In addition, 619 unigenes were assigned to secondary metabolite biosynthesis, transport, and catabolism (Figure 3).

Functional Classification by the KEGG Pathway
To further analyze the transcriptome of J. regia, all of the unigenes were analyzed in the KEGG pathway database.The KEGG pathway database is a knowledge-based site for the systematic analysis of gene functions in terms of networks of genes and molecules in cells and their variants specific to particular organisms.In total, 19,526 of 117,229 unigenes had significant matches in the database were assigned to 32 KEGG pathways in five main categories (Figure 4).Among these five main categories, translation was the largest (1738; 8.9%), followed by carbohydrate metabolism (1660; 8.5%), signal transduction (1434; 7.3%), folding, sorting and degradation (1402; 7.2%), and overview (1134; 5.8%) (Figure 4).

Functional Classification by the KEGG Pathway
To further analyze the transcriptome of J. regia, all of the unigenes were analyzed in the KEGG pathway database.The KEGG pathway database is a knowledge-based site for the systematic analysis of gene functions in terms of networks of genes and molecules in cells and their variants specific to particular organisms.In total, 19,526 of 117,229 unigenes had significant matches in the database were assigned to 32 KEGG pathways in five main categories (Figure 4).Among these five main categories, translation was the largest (1738; 8.9%), followed by carbohydrate metabolism (1660; 8.5%), signal transduction (1434; 7.3%), folding, sorting and degradation (1402; 7.2%), and overview (1134; 5.8%) (Figure 4).
We designed 13,947 pairs of SSR primers from 27,584 SSR sequences (Table S3).In order to verify the design of primers and to determine how many of the 13,947 SSR-containing unigenes could be amplified as scorable SSR markers, primers were designed to amplify a sample of 77 representative SSR-containing unigenes (Table S4).These 77 unigenes were chosen to include mono-nucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide repeats (Table 2; Figure 5d; Table S4).Of these 77 EST-SSRs, 39 were amplified bands with high specificity from walnut DNA (Table 2; Figure S2).Using the NCBI nucleotide database BLAST, we found that 23 of the 39 sequences matched previously submitted sequences with high similarity (e-value = 0).These 23 sequences were associated with a wide variety of functional genes: 17 of 23 (73.9%) were linked to disease-resistance, insect and pest resistance, or immunity, two were related
We designed 13,947 pairs of SSR primers from 27,584 SSR sequences (Table S3).In order to verify the design of primers and to determine how many of the 13,947 SSR-containing unigenes could be amplified as scorable SSR markers, primers were designed to amplify a sample of 77 representative SSR-containing unigenes (Table S4).These 77 unigenes were chosen to include mono-nucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide repeats (Table 2; Figure 5d; Table S4).Of these 77 EST-SSRs, 39 were amplified bands with high specificity from walnut DNA (Table 2; Figure S2).Using the NCBI nucleotide database BLAST, we found that 23 of the 39 sequences matched previously submitted sequences with high similarity (e-value = 0).These 23 sequences were associated with a wide variety of functional genes: 17 of 23 (73.9%) were Forests 2016, 7, 247 11 of 17 linked to disease-resistance, insect and pest resistance, or immunity, two were related to metabolism (JR2018 and JR3608), JR2600 was associated with salt tolerance, and JR6714 was associated with environment stress (Table S4).The remaining 39 of 77 primer pairs were excluded from further analysis due to lack of specificity or weak amplification.All 39 EST-SSRs that amplified specific products were also polymorphic (Figure S2) when used to analyze DNA from 88 Persian walnuts in 11 Chinese populations (Table 2, for sequences, see Table S5).Alleles per locus (N A ) ranged from two to seven with a mean of 3.64.Using MICRO-CHECKER 2.2.3 (Van et al.; 2004), we did not detect null alleles at any locus.The observed heterozygosity (H O ) and expected heterozygosity (H E ) varied from 0.016 to 0.929 (χ = 0.404) and from 0.052 to 0.715 (χ = 0.491), respectively.Polymorphic information content (PIC) ranged from 0.150 to 0.695, with a mean of 0.433.Loci that showed significant departure from Hardy-Weinberg equilibrium (HWE) were JR0082; JR1739; JR2465; JR2600; JR3434; JR4616; JR6226; JR6439; JR6742; JR6926; JR7171; JR7363; JR7544; JR8815 (Table 2).Annotations of these loci were shown in Table S4. to metabolism (JR2018 and JR3608), JR2600 was associated with salt tolerance, and JR6714 was associated with environment stress (Table S4).The remaining 39 of 77 primer pairs were excluded from further analysis due to lack of specificity or weak amplification.All 39 EST-SSRs that amplified specific products were also polymorphic (Figure S2) when used to analyze DNA from 88 Persian walnuts in 11 Chinese populations (Table 2, for sequences, see Table S5).Alleles per locus (NA) ranged from two to seven with a mean of 3.64.Using MICRO-CHECKER 2.2.3 (Van et al.; 2004), we did not detect null alleles at any locus.The observed heterozygosity (HO) and expected heterozygosity (HE) varied from 0.016 to 0.929 (  = 0.404) and from 0.052 to 0.715 (  = 0.491), respectively.Polymorphic information content (PIC) ranged from 0.150 to 0.695, with a mean of 0.433.Loci that showed significant departure from Hardy-Weinberg equilibrium (HWE) were JR0082; JR1739; JR2465; JR2600; JR3434; JR4616; JR6226; JR6439; JR6742; JR6926; JR7171; JR7363; JR7544; JR8815 (Table 2).Annotations of these loci were shown in Table S4.

Assessment of Genetic Diversity and Population Structure of J. regia in China Using 39 EST-SSRs
Analysis using STRUCTURE software revealed that the J. regia trees we sampled from 11 sites represented five populations.Using ΔK as the criterion, K = 5 showed the highest likelihood (Figure 6).Samples from two sites in southern China, GZ (cultivar) and YW (wild), comprised cluster I (green color block).Cluster II (blue color block) was comprised of wild population SC (Sichuan province) and some members sampled at (wild) site YW.Samples from three (wild) sites located in western and northwestern China comprised cluster III (yellow color block), XZ (Tibet), GS (Gansu province), and XJ (Sinkiang).Cultivated trees from two locations (MT and LS) comprised cluster IV  Analysis using STRUCTURE software revealed that the J. regia trees we sampled from 11 sites represented five populations.Using ∆K as the criterion, K = 5 showed the highest likelihood (Figure 6).Samples from two sites in southern China, GZ (cultivar) and YW (wild), comprised cluster I (green color block).Cluster II (blue color block) was comprised of wild population SC (Sichuan province) and some members sampled at (wild) site YW.Samples from three (wild) sites located in western and northwestern China comprised cluster III (yellow color block), XZ (Tibet), GS (Gansu province), and XJ (Sinkiang).Cultivated trees from two locations (MT and LS) comprised cluster IV (red color block of Figure 6).Cluster V (purple color group) was comprised of cultivated trees sampled from three locations (BS, CL, and YC) (Figure 6).
(red color block of Figure 6).Cluster V (purple color group) was comprised of cultivated trees sampled from three locations (BS, CL, and YC) (Figure 6).The first two coordinates in the principal coordinate analyses (PCoA) (Supplemental Figure S3, accounted for 80.8% of the observed variance).PCoA partitioned the samples into groups similar to those identified by the Bayesian software STRUCTURE (Figure 6; Figure S3).Based on PCoA, samples from the 11 sampled locations were divided into four groups: LS and MT as a group, BS, YC, and CL as a group, XZ, XJ, and GS as a group, and unlike the results from the Bayesian analysis, SC was assigned to a group with GZ and YW (Figure S3).

Discussion
Transcriptome sequencing is an effective method to obtain EST sequences [46] for developing molecular markers and identifying novel genes.Transcriptome sequencing also provides raw data for data mining studies, including the identification of SSRs [9,12,47].The transcriptome data we generated included over 26,000 sequences that contained SSRs, so 22.3% of all unigenes contained a microsatellite.Excluding mononucleotide repeats, dinucleotide repeats were the most frequent SSR motif type (35.8%), consistent with results previously reported for J. regia, cabbage (Brassica oleracea L. var.capitata L.), sweet potato (Ipomoea batatas L.), and white poplar (Populus tomentosa Carr.) [9,13,14,48,49], but different from coconut tree (Cocos nucifera L.), field pea (Pisum sativum L.) and fava bean (Vicia faba L.), species in which tri-nucleotides were the most abundant EST-SSR markers [14,49,50].The first two coordinates in the principal coordinate analyses (PCoA) (Supplemental Figure S3, accounted for 80.8% of the observed variance).PCoA partitioned the samples into groups similar to those identified by the Bayesian software STRUCTURE (Figure 6; Figure S3).Based on PCoA, samples from the 11 sampled locations were divided into four groups: LS and MT as a group, BS, YC, and CL as a group, XZ, XJ, and GS as a group, and unlike the results from the Bayesian analysis, SC was assigned to a group with GZ and YW (Figure S3).
It is well-known that EST-SSR markers are useful for the assessment of genetic diversity, the development of genetic maps, comparative genomics, and marker assisted selection in J. regia [51,52], and because EST-SSRs often have conserved primer sites, they are usually readily transferable to closely related species [53,54].EST-SSRs typically amplify more successfully than non-genic SSRs, and because they reside in genes, they are expected to reflect artificial and natural selection [11], but whether EST-SSRs are more sensitive than allozymes to selection is not certain [51].We designed primers and tested amplification for 77 unigenes containing SSRs.A total of 39 amplified with high specificity; of these, 23 (~68%) were polymorphic and highly similar to sequences previously submitted to NCBI (e-value = 0) (Supplemental Table S4).BLAST searches showed 30 of 77 unigenes (42.85%) had no significant match to known proteins.Many of the ESTs may indeed have been non-coding transcribed regions, which could explain the large number of unigenes that contained SSRs.Some shorter sequences from our transcriptome data that contained SSRs may have lacked a characterized protein domain, or may have contained a known protein domain but did not show a BLAST match because the query sequence was too short, resulting in a false-negative search result.The polymorphism information content (PIC) values of the EST-SSR markers ranged from 0.390 to 0.870 (mean = 0.681 ± 0.18), similar to PIC values reported by Zhang et al. [54], which ranged from 0.47 to 0.88.The number of polymorphic alleles ranged from 3 to 10, with a mean of 5.87; the number of alleles reported by Zhang et al. [54] was 2-4, and 2-25 in Zhang et al. [9].
Based on data from 39 EST-SSRs, STRUTURE, PCoA, and UPGMA produced similar genetic clusters of common walnut samples (Figure 6; Figure S3).The PCoA analysis pooled the blue (SC) and green (GZ and YW) populations that STRUCTURE separated (Figure 6; Figure S3).It is possible that YW represents admixture between SC and GZ, but the details of population structure in this region of China will require additional sampling.
Despite their importance, relatively little is known about the genetic diversity and structure of wild populations of common walnut in China.Most studies of J. regia in China have focused on cultivar development [55], and the ancient history of the crop [3].The genetic diversity of common walnut cultivars and seedlings used for breeding in China was described as rich and complex [56,57].In the Qinling Mountains of central China, the genetic variation of J. regia was mainly within populations, with low genetic differentiation among sampled sites based on ITS (internal transcribed spacer) sequences [58].
In our study, genetically similar Persian walnut samples (based on EST-SSR genotypes) were geographically clustered with the exception of sample locations BS, CL, and YC, which comprised STRUCTURE group V (Figure 6, Supplemental Figure S3).This result corresponded with a previous study of the genetic diversity and structure of nine common walnut populations in central and southwestern China that showed their genetic structure was in agreement with their geographic distribution [59].The non-geographically clustered genotypic group V (mentioned above) was comprised of cultivated trees, likely reflecting propagation by humans of types selected for commercial and horticultural properties [60][61][62].
The pattern of genetic diversity and structure we observed for common walnut in China is probably a consequence of a complex interaction of evolutionary forces such as adaptation/ecotype differentiation and human dispersal.Because J. regia is an important cultivated species and wild trees are not isolated from cultivated trees, gene flow between wild and cultivated populations is likely high.Cultivated walnuts have been moved over long distances for several millennia [3]; the resulting interactions between cultivated genotypes and wild trees presumably reduces genetic differentiation locally and on larger scales as well.Samples from cold and arid regions of China were genetically distinct (XJ, XZ, and GS), more likely reflecting adaptation than isolation because our analysis of isolation by distance (IBD) showed that the correlations between genetic distance and geographic distance were not significant.

Conclusions
This study provides the comprehensive, Illumina-based transcriptome sequence data used for the development of EST-SSRs in common walnut (J.regia).We generated more than 49.9 million paired-end reads comprising 117,229 unigenes with an average length of 725 bp from four different tissues of a single individual using de novo assembly.We identified 27,584 unigenes with SSR motifs as potential molecular markers.We tested 77 primer pairs in detail and found that 39 were polymorphic.These were used to screen 88 common walnut individuals collected from 11 populations.Our results further demonstrated that there is high allelic variation in Chinese J. regia.The transcriptome and markers we characterized provide additional tools for research on population genetics, evolutionary ecology, and breeding of Juglans, Juglandaceae, and other non-model species.

Supplementary Materials:
The following are available online at www.mdpi.com/1999-4907/7/10/247/s1. Figure S1: Length distribution of assembled transcripts and unigenes, Figure S2: PCR products and polymorphic characteristics of three EST-SSR markers across 48 Juglans regia samples, Figure S3: Principal coordinate analyses (PCoA) of 11 Chinese Persian walnut (Juglans regia) populations resolved into four genotype groups based on 39 microsatellite loci, Figure S4: Bayesian inference of the number of clusters (K), Table S1: The EST-SSR frequency type of Juglans regia, Table S2: The number of repeat motif in EST-SSRs, Table S3: The total of 13,947 pairs of SSR primers for Juglans regia, Table S4: BLAST search results for 77 SSR-containing ESTs from a pool of RNA from four walnut tissues, Table S5: Sequences for 39 microsatellite loci of Juglans regia.

Figure 1 .
Figure 1.The transcript (a) and unigene (b) length distribution of Juglans regia.

Figure 2 .
Figure 2. Gene Ontology classifications of assembled unigenes.The results are summarized in three main categories: Biological process, Cellular component, and Molecular function.(a) In total, 259,423 unigenes with BLAST matches to known proteins were assigned to gene ontology; (b) In total, 179 unigenes with BLAST matches to known proteins were assigned to gene ontology which are not list in Figure 2a.

Figure 2 .
Figure 2. Gene Ontology classifications of assembled unigenes.The results are summarized in three main categories: Biological process, Cellular component, and Molecular function.(a) In total, 259,423 unigenes with BLAST matches to known proteins were assigned to gene ontology; (b) In total, 179 unigenes with BLAST matches to known proteins were assigned to gene ontology which are not list in Figure 2a.

Figure 3 .
Figure 3. Histogram presentation of clusters of orthologous groups (COG) classification.All unigenes were aligned to COG database to predict and classify possible functions.Out of 27,435 Nr hits, 11,983 sequences were assigned to 26 COG classifications.(A) RNA processing and modification; (B) chromatin structure and dynamics; (C) energy production and conversion; (D) cell cycle control, cell division, chromosome partitioning; (E) amino acid transport and metabolism; (F) nucleotide transport and metabolism; (G) carbohydrate transport and metabolism; (H) coenzyme transport and metabolism; (I) lipid transport and metabolism; (J) transition, ribosomal structure and biogenesis; (K) transcription; (L) replication, recombination and repair; (M) cell wall/membrane/envelope biogenesis; (N) cell motility; (O) posttranslational modification, protein turnover, chaperones; (P) inorganic ion transport and metabolism; (Q) secondary metabolites biosynthesis, transport and catabolism; (R) general function prediction only; (S)function unknown; (T) signal transduction mechanisms; (U) intracellular trafficking, secretion, and vesicular transport; (V) defense mechanisms; (W) extracellular structures; (X) unnamed protein; (Y) nuclear structure; (Z) cytoskeleton.

Figure 3 .
Figure 3. Histogram presentation of clusters of orthologous groups (COG) classification.All unigenes were aligned to COG database to predict and classify possible functions.Out of 27,435 Nr hits, 11,983 sequences were assigned to 26 COG classifications.(A) RNA processing and modification; (B) chromatin structure and dynamics; (C) energy production and conversion; (D) cell cycle control, cell division, chromosome partitioning; (E) amino acid transport and metabolism; (F) nucleotide transport and metabolism; (G) carbohydrate transport and metabolism; (H) coenzyme transport and metabolism; (I) lipid transport and metabolism; (J) transition, ribosomal structure and biogenesis; (K) transcription; (L) replication, recombination and repair; (M) cell wall/membrane/envelope biogenesis; (N) cell motility; (O) posttranslational modification, protein turnover, chaperones; (P) inorganic ion transport and metabolism; (Q) secondary metabolites biosynthesis, transport and catabolism; (R) general function prediction only; (S)function unknown; (T) signal transduction mechanisms; (U) intracellular trafficking, secretion, and vesicular transport; (V) defense mechanisms; (W) extracellular structures; (X) unnamed protein; (Y) nuclear structure; (Z) cytoskeleton.

Figure 4 .
Figure 4. Pathway assignment based on the Kyoto Encyclopedia of Genes and Genomes (KEGG).(A) Classification based on cellular process categories; (B) classification based on environmental information processing categories; (C) classification based on genetic information processing categories; (D) classification based on metabolism categories; (E) classification based on organismal systems categories.

Figure 4 .
Figure 4. Pathway assignment based on the Kyoto Encyclopedia of Genes and Genomes (KEGG).(A) Classification based on cellular process categories; (B) classification based on environmental information processing categories; (C) classification based on genetic information processing categories; (D) classification based on metabolism categories; (E) classification based on organismal systems categories.

Figure 5 .
Figure 5. Characterization of simple sequence repeats (SSRs) in the common walnut (Juglans regia) transcriptome.(a) Distribution of different SSR repeat motif types; (b) number of different repeat motif; (c) frequency distribution of major SSRs based on main motif type; (d) Distribution of 77 SSR motifs in the J. regia transcriptome.

Figure 5 .
Figure 5. Characterization of simple sequence repeats (SSRs) in the common walnut (Juglans regia) transcriptome.(a) Distribution of different SSR repeat motif types; (b) number of different repeat motif; (c) frequency distribution of major SSRs based on main motif type; (d) Distribution of 77 SSR motifs in the J. regia transcriptome.

3. 6 .
Assessment of Genetic Diversity and Population Structure of J. regia in China Using 39 EST-SSRs

Figure 6 .
Figure 6.(a) Geographical distribution and cluster analysis of 11 J. regia populations using 39 EST-SSR markers in China.Pie charts represent total percentage of each of the five genotypic clusters found among all samples at each sampled site; (b) UPGMA cluster analysis of 11 populations of Chinese Persian walnut using 39 SSRs; (c) Results of the Bayesian model-based clustering STRUCTURE analysis of 88 individuals of Persian walnut (K = 5) (Supplemental Figure S4).

Figure 6 .
Figure 6.(a) Geographical distribution and cluster analysis of 11 J. regia populations using 39 EST-SSR markers in China.Pie charts represent total percentage of each of the five genotypic clusters found among all samples at each sampled site; (b) UPGMA cluster analysis of 11 populations of Chinese Persian walnut using 39 SSRs; (c) Results of the Bayesian model-based clustering STRUCTURE analysis of 88 individuals of Persian walnut (K = 5) (Supplemental Figure S4).

Table 1 .
Sources of samples of Juglans regia used for genotyping based on EST-SSRs.
• C followed by 35 cycles of 15 s at 93 • C, 1 min at annealing temperature (Tm) (Table2), 30 s at 72 • C and extension for 10 min at 72 • C. PCR reactions contained 5 µL 2 × PCR Waltham, MA, USA) in 10 µL reaction volumes (5 µL 2 × PCR Master Mix, 0.2 µL each primer, 1 µL BSA, 1 µL of 10 ng/µL DNA).Genomic DNA from 88 samples of J. regia was used for PCR amplification and analysis of polymorphism.All 88 genotypes were tested at all 77 loci for polymorphisms.PCR products were resolved on 8% polyacrylamide gels and visualized by silver staining.Fragment sizes of each locus were estimated using Quantity One version 4.62 Software (Bio-Rad Laboratories, Drive Hercules, CA, USA) and compared to a 50 bp ladder size standard.