Single-Molecule Long-Read Sequencing of Avocado Generates Microsatellite Markers for Analyzing the Genetic Diversity in Avocado Germplasm

: Avocado ( Persea americana Mill.) is an important fruit crop commercially grown in tropical and subtropical regions. Despite the importance of avocado, there is relatively little available genomic information regarding this fruit species. In this study, we functionally annotated the full-length avocado transcriptome sequence based on single-molecule real-time sequencing technology, and predicted the coding sequences (CDSs), transcription factors (TFs), and long non-coding RNA (lncRNA) sequences. Moreover, 76,777 simple sequence repeat (SSR) loci detected among the 42,096 SSR-containing transcript sequences were used to develop 149,733 expressed sequence tag (EST)-SSR markers. A subset of 100 EST-SSR markers was randomly chosen for an analysis that detected 15 polymorphicEST-SSR markers, with an average polymorphism information content of 0.45. These 15markers were able to clearly and e ﬀ ectively characterize46 avocado accessions based on geographical origin. In summary, our study is the ﬁrst to generate a full-length transcriptome sequence and develop and analyze a set of EST-SSR markers in avocado. The application of third-generation sequencing techniques for developing SSR markers is a potentially powerful tool for genetic studies. avocado germplasm. The results of this study represent useful genetic and transcriptome information to support future research on avocado.


Introduction
Avocado (Persea americana Mill.) belonging to the family Lauraceae of the order Laurales is native to Mexico and Central and South America, and is one of the most economically important subtropical/tropical fruit crops worldwide [1]. Taxonomic treatments differ considerably in terms of the circumscription and defining of infraspecific avocado entities [2][3][4][5]. Additionally, researchers have long considered that geographical isolation has likely resulted in the following three ecological races of avocado: Mexican (P. americana var. drymifolia), Guatemalan (P. americana var. guatemalensis), and West Indian (P. Americana var. americana) [1]. The Mexican race adapted to a Mediterranean climate, whereas the Guatemalan race originated in a tropical highland climate, and the West Indian race adapted to humid tropical lowland conditions [1].
Avocado is rich in lipids, sugars, proteins, minerals, vitamins, and other active ingredients [6][7][8]. Moreover, avocado production has increased worldwide [1]. One factor contributing to the increases in production and consumption is the expansion of avocado products into new global markets where avocado was previously unknown or scarce, includingChina, which is an emerging market for the production and consumption of avocado [1,9]. After avocado was first introduced and cultivatedin China in the late 1950s, selective breeding by some national scientific research bodies and other state farms have resulted in the development of more than 10 superior avocado accessions [9,10].
Additionally, natural crosses among avocado accessions have generated new hybrids on state and private farms, andsome nativeaccessions are increasingly produced in somewhat remote areas with distinct local environmental conditions [9,10]. Avocado is broadly grown and exploited in some provinces in southern China, including Hainan, Guangxi, Yunnan, and Taiwan [9,10]. The climatic conditions in these provinces are subtropical to tropical, which are ideal conditions for the cultivation of avocado [9,10].
The avocado germplasm should be precisely characterized to maximize its utility to breeders worldwide [1]. Specifically, a molecular characterization is required for analyses of the genetic relationships among avocado germplasm. Over the past two decades, studies involving various types of molecular markers have examined the genetic relationships among avocado germplasm [11][12][13][14][15][16][17][18][19][20]. Of the many available DNA markers, simple sequence repeats (SSRs) are commonly used for investigating plant genetics and breeding because they are widely distributed and abundant in plant genomes. They are also genetically codominant, highly reproducible, multi-allelic, and perfectly suitable for high-throughput genotyping [21][22][23][24][25]. Expressed sequence tag (EST)-derived markers in the genomic coding regions have an advantage over genomic DNA-derived markers, and can be efficiently amplified to reveal conserved sequences among related species [26]. There has recently been increasing interest in developing EST-SSR markers viahigh-throughput transcriptome sequencing. Thus, there has been rapid progress in the development of EST-SSR markers based on transcriptome data produced with second-generation sequencing technology for Lilium brownii var. viridulum Baker [27], crataegus Pinnatifida Bunge [28], Acer miaotaiense P. C. Tsoong [29], and Rosa hybrida hort. ex Lavalle [30]. Among the third-generation sequencing platforms, PacBio RS II, which is regarded as the first commercialized third-generation sequencer, is based on single-molecule real-time (SMRT) technology [31]. The PacBio RS II system can produce much longer reads than second-generation sequencing platforms, and has been applied to effectively capture full-length transcriptsequences for EST-derived marker development [32]. However, there are few reports regarding the application ofEST-SSR markers developed with SMRT technology for crop breeding.
In the previous study, we had generated the first full-length transcriptome sequence of avocadobased on SMRT technology andthe short-reads obtained in this previous study involving second-generation transcriptome sequencing were used to correct the transcripts that were obtained with SMRT technology [42]. In this study, we functionally annotated sequences andcompleted SSR mining experiments from SMRT technology in avocado mesocarp. We also predicted the coding sequences (CDSs), transcription factors (TFs), and long non-coding RNA (lncRNA) sequences. Furthermore, we identified a set of EST-SSR markers, and assessed their utility for determining the genetic diversity among 46 selected avocado accessions from various locations in southern China. The generated data enabled the broad and distinct visualization of the genetic diversity in the analyzed avocado germplasm. The results of this study represent useful genetic and transcriptome information to support future research on avocado.  Table S1. Genomic DNA was extracted from fresh leaves as described by Ge [43].

PacBiocDNA Library Construction and Sequencing
Poly-T oligo-attached magnetic beads were used to purify the mRNA from the total RNA extracted from 15 mesocarp (pulp) samples collected at each analyzed developmental stage. The mRNA from all five developmental stages was combined to serve as the template to synthesize cDNA with the SMARTer PCR cDNA Synthesis Kit (Clontech, Mountain View, CA, USA). After a PCR amplification, quality control check, and purification, full-length cDNA fragments were acquired according to the BluePippin Size Selection System protocol, ultimately resulting in the construction of a cDNA library (1-6 kb). Selected full-length cDNA sequences were ligated to the SMRT bell hairpin loop. The concentration of the cDNA library was then determined with the Qubit 2.0 fluorometer, whereas the quality of the cDNA library was assessed with the 2100 Bioanalyzer (Agilent). Finally, one SMRT cell was sequenced with the PacBio RSII system (Pacific Biosciences, Menlo Park, CA, USA).

IlluminacDNA Library Construction and Sequencing
Oligo-(dT) magnetic beads were used to purify the mRNA from the total RNA extracted from 15 mesocarp (pulp) samples from five developmental stages. Three replicates were analyzed for each developmental stage. Samples from each developmental stage underwent an RNA-sequencing analysis, with three biological replicates per sample. The fragmentation step was completed with divalent cations in heated 5× NEBNext First Strand Synthesis Reaction Buffer. First-strand cDNA was synthesized with a series of random hexamer primers and reverse transcriptase, after which the second-strand cDNA was generated with DNA polymerase I and RNase H. The cDNA libraries were constructed by ligating the cDNA fragments to sequencing adapters and amplifying the fragments by PCR. The libraries were then sequenced with the Illumina HiSeq 2000 platform (Nanxin Bioinformatics Technology Co., Ltd., Guangzhou, China).

Quality Filtering and Correction of PacBio Long-Reads
Raw reads were processed into error-corrected reads of insert (ROIs) using an isoform sequencing pipeline, with minimum full pass = 0.00 and minimum predicted accuracy = 0.80. Next, full-length, non-chimeric transcripts were detected by searching for the poly-A tail signal and the 5 and 3 cDNA primer sequences in the ROIs. Iterative clustering for error correction was used to obtain high-quality consensus isoforms, which were then polished with QuiverVersion 1.0. The low-quality full-length transcript isoforms were corrected based on Illumina short-reads with the default setting of the Proovread program. High-quality and corrected low-quality transcript isoforms were confirmed as nonredundant with the CD-HIT software.

Mining of EST-SSR Markers
The MISA (version 1.0) program, with the following default settings, was used to locate SSRs: a minimum of five repeats; a minimum motif length of 5 for tri-and hexanucleotides, 6 for dinucleotides, and 10 for single nucleotides.

Analyses of Detected Coding Sequences, Transcription Factors, and Long Non-Coding RNA Features
The open reading frames (ORFs) detected with the TransDecoder (version 3.0.0) program were designated as putative CDSs if they satisfied the following criteria: (1) An ORF was detected in a transcript sequence; (2) the log-likelihood score was >0, and was similar to what was calculated with the GeneID software; (3) the score was higher when the ORF was in the first reading frame than when the ORF was in the other five reading frames; (4) if a candidate ORF was within another candidate ORF, the longer one was reported. However, a single transcript could be associated with multiple ORFs (because of operons and chimeras); and (5) the putative encoded peptide matched a Pfam domain.
Transcription factor gene families were identified based on categorically defined TF families and criteria from the KO, KOG, GO, Swiss-Prot, Pfam, Nr, and Nt databases. Specifically, the default parameters of the iTAK (version 1.2) program were used. The methods used to identify and classify TFs were previously described by Perez-Rodriguez [46].
The following four computational tools were combined to sort non-protein-coding RNA candidates from putative protein-coding RNAs among the transcripts: the Coding Potential Calculator (CPC), Coding-Non-Coding Index (CNCI), Coding Potential Assessment Tool (CPAT), and Pfam database. Transcripts longer than 200 nt, with more than two exons, were selected as lncRNA candidates and were further screened with CPC/CNCI/CPAT/Pfam, which distinguished the protein-coding genes from the non-coding genes.

Assignment of the Native Avocado Accessions with an Unknown Race
To validate the origins of the 33 native accessions with anunknown race, six primers for race-specific single nucleotide polymorphism (SNP) loci were used for KASP genotyping listed in Table S2 [47]. The primer mix, which was prepared and used as described by KBioscience (http://www.kbioscience.co.uk), comprised 46 µL dH 2 O, 30 µL common primer (100 µM), and 12 µL each tailed primer (100 µM). The SNPs were amplified by PCR in a thermal cycler with a 5-µL solution consisting of 1× KASP Master mix, 10 ng genomic DNA, and the SNP-specific KASP assay mix. The following PCR amplification conditions were the same as those used for each SNP assay: 94 • C for 15 min; 10 touchdown cycles of 94 • C for 20 s, and 58-61 • C for 60 s (decreasing by 0.8 • C per cycle); 35 cycles of 94 • C for 20 s and 57 • C for 60 s. The resulting data were analyzed with the Roche LightCycler 480 (version 1.50.39) program.

Identification of EST-SSR Markers
To screen the EST-SSR loci, primers based on the sequences flanking the selected microsatellite loci were designed with the Primer3 program; the PCR products ranged from 100 to 300 bp. All assigned marker names included Pa-eSSR to indicate their association with P. Americana and EST-SSRs. A subset of 100 EST-SSR primer pairs was randomly selected for validation by a PCR amplification with the same conditions as those described by Ge [43]. The PCR products were analyzed with the 96-capillary 3730xl DNA Analyzer (Applied Biosystems, Foster City, CA, USA). The detection system included 8.9 µL HIDI (Applied Biosystems), 0.1 µL LIZ (Applied Biosystems), and 1 µL PCR products (1:10 dilution). A lack of amplification was considered indicative of a null allele.

Data Analysis
The number of observed alleles (Na), effective number of alleles (Ne), observed heterozygosity (Ho), expected heterozygosity (He), and polymorphism information content (PIC) of each EST-SSR was assessed with the POPGEN (version 1.32) program [48]. A cluster analysis was performed with PowerMarker (version 3.25) [49]. The cophenetic correlation coefficient was computed for the dendrogram after the construction of a cophenetic matrix to measure the goodness of fit between the original similarity matrix and the dendrogram. Bootstrap support values were obtained from 1000 replicates. A neighbor-joining tree was constructed based on shared alleles, and visualized with the MEGA6.0 software [50]. The homology with the other species was relatively low (1.14%-2.54% of transcripts; Figure 3). To further predict and classify the functions of the annotated transcripts, we analyzed their matching GO terms, eggNOG classifications, and KEGG pathway assignments. A total of 45,134 transcripts were assigned to 51 subcategories of the three main GO functional categories as follows: 106,390 transcripts for biological processes, 45,931 transcripts for cellular components, and 69,120 for molecular functions (Figure 4a, Table S4). Next, 70,205 transcripts were functionally classified into 25eggNOG categories (Figure 4b, Table S5). Among the 26 categories, the most heavily represented group was posttranslational modification, protein turnover, chaperones (6410 transcripts, 8.94%), followed by Agronomy 2019, 9, 512 6 of 18 signal transduction mechanisms (4189 transcripts, 5.84%) and transcription (3868 transcripts, 5.39%). Only 20 and 6 transcripts belonged to the cell motility and nuclear structure categories, respectively. Finally, 33,310 transcripts were assigned to 129 KEGG pathways (Table S6). The most represented pathways were related to carbon metabolism (1678 transcripts), protein processing in endoplasmic reticulum (1649 transcripts), and biosynthesis of amino acids (1503 transcripts).         (a)

Predictions of ORFs, TFs, and lncRNAs
A total of 73,946 ORFs were predicted, 61,523 of which were complete CDSs. The number and length distribution of proteins encoded by the CDS regions are presented in Figure 5 and Additional file 1. A total of 7969 putative avocado TFs distributed in 203 families were identified (Table S7). The most abundant TF categories included RLK-Pelle_DLSV (241) and C3H (240). Additionally, the CPC, CNCI, CPAT, and Pfam database were combined to distinguish lncRNA candidates from putative protein-coding RNAs among the unannotated transcripts. Analyses with the CPC, CNCI, CPAT, and Pfam database revealed 7869, 6444, 16,464, and 15,579 transcripts longer than 200 nt with more than two exons as lncRNA candidates. A total of 3596 lncRNA transcripts were predicted (Figure6).

Predictions of ORFs, TFs, and lncRNAs
A total of 73,946 ORFs were predicted, 61,523 of which were complete CDSs. The number and length distribution of proteins encoded by the CDS regions are presented in Figure 5 and Additional file 1. A total of 7969 putative avocado TFs distributed in 203 families were identified (Table S7). The most abundant TF categories included RLK-Pelle_DLSV (241) and C3H (240). Additionally, the CPC, CNCI, CPAT, and Pfam database were combined to distinguish lncRNA candidates from putative protein-coding RNAs among the unannotated transcripts. Analyses with the CPC, CNCI, CPAT, and Pfam database revealed 7869, 6444, 16,464, and 15,579 transcripts longer than 200 nt with more than two exons as lncRNA candidates. A total of 3596 lncRNA transcripts were predicted ( Figure 6).

Predictions of ORFs, TFs, and lncRNAs
A total of 73,946 ORFs were predicted, 61,523 of which were complete CDSs. The number and length distribution of proteins encoded by the CDS regions are presented in Figure 5 and Additional file 1. A total of 7969 putative avocado TFs distributed in 203 families were identified (Table S7)

Development of Polymorphic EST-SSR Markers, Analysis of Genetic Diversity, and KASP genotyping
Using Primer3, we developed 149,733 EST-SSR markers from the 49,911 SSR loci (Table S9). To verify the amplification of the EST-SSR markers, a subset of 100 EST-SSR markers was randomly chosen and tested with seven accessions from various regions in southern China (Table S10). The primers for 30 of the tested markers generated amplification products, whereas 37 primer pairs amplified nonpolymorphic products and 33 did not produce clear amplicons. The 30 polymorphic EST-SSR markers, which included 15 di-, 5 tri-, 5 tetra-, 2 penta-, and 3hexanucleotidemotif-based markers, were further verified with 46 avocado accessions. Finally, 15 polymorphic EST-SSR markers, with missing allele frequencies <10% for all 46 avocado accessions, were selected for subsequent analyses of genetic diversity (Table S11). A total of 71 alleles in the 46 avocado accessions carried the 15 polymorphic EST-SSR markers. Eight of these alleles were considered to be accession-specific and the other 63 alleles were generally found in multiple accessions ( The 15polymorphic EST-SSRs were applied to evaluate diversity parameters ( Table 3). The Na amplified per SSR locus varied from 2 to 10, with a mean of 4.73. The Ne varied from 1.04 to 4.39, with an average of 2.31, and Ho ranged from 0.04 to 0.93, with an average of 0.49. The He ranged from 0.04 to 0.77, with an average of 0.50, and PIC values ranged from 0.04 to 0.74, with an average of 0.45. Six race-specificKASP markers were used to determine the race of 33 avocado accessions with an unknown race. The KASP genotyping results demonstrated that all 33 avocado accessions were Guatemalan × West Indian hybridsbased on the corresponding genotype of each racial avocado (Table S2).

Analyses of Genetic Relationships Based on Polymorphic EST-SSRs from SMRT Sequencing Data
A cluster analysis grouped the 46 accessions into two major sections (Figure 7). The dendrogram revealed a clear separation between the native avocado accessions from Hainan province and those from Guangxi and Yunnan provinces. In cluster I, 19 Guatemalan × West Indian hybrids were clustered into two sub-sections. Sub-cluster I-I consisted of 13native Guatemalan × West Indian hybrids from Guangxi province. Sub-cluster I-II contained two native Guatemalan × West Indian hybrids from Yunnan province.Cluster II comprised 27 Guatemalan × West Indian hybrids from Hainan province. Among these hybrids, 15 and 6were obtained from the CATAS and DLSF, respectively.
Six race-specificKASP markers were used to determine the race of 33 avocado accessions with an unknown race. The KASP genotyping results demonstrated that all 33 avocado accessions were Guatemalan × West Indian hybridsbased on the corresponding genotype of each racial avocado (Table S2).

Analyses of Genetic Relationships Based on Polymorphic EST-SSRs from SMRT Sequencing Data
A cluster analysis grouped the 46 accessions into two major sections (Figure 7). The dendrogram revealed a clear separation between the native avocado accessions from Hainan province and those from Guangxi and Yunnan provinces. In cluster I, 19 Guatemalan × West Indian hybrids were clustered into two sub-sections. Sub-cluster I-I consisted of 13native Guatemalan × West Indian hybrids from Guangxi province. Sub-cluster I-II contained two native Guatemalan × West Indian hybrids from Yunnan province.Cluster II comprised 27 Guatemalan × West Indian hybrids from Hainan province. Among these hybrids, 15 and 6were obtained from the CATAS and DLSF, respectively.  Figure 8 presents the distribution of the 46 avocado accessions for the first two principal coordinates of a principal coordinate analysis (PCoA). On the basis of the first coordinate, which accounted for 21.71% of the total variation, the accessions were generally distributed in two groups. The native avocado accessions from Hainan and Yunnan provinceswere basically grouped separately from the native avocado accessions from Guangxi province. The second coordinate accounted for 10.06% of the total variation.Finally, we observed that the native avocado accessions were generally grouped according to their geographical origins.
relationships among the 46 analyzed avocado accessions based on the shared alleles for the 15 EST-SSR markers. GVTC, native avocado accessions from Guangxi Vocational and Technical College; MMSF, native avocado accessions from Mengmao State Farm; CATAS, native avocado accessions from the Chinese Academy of Tropical Agricultural Sciences; and DLSF, native avocado accessions from Daling State Farm. The native avocado accessionslabeled withan asterisk originated from other regions. Figure 8presents the distribution of the 46 avocado accessions for the first two principal coordinates of a principal coordinate analysis (PCoA). On the basis of the first coordinate, which accounted for 21.71% of the total variation, the accessions were generally distributed in two groups. The native avocado accessions from Hainan and Yunnan provinceswere basically grouped separately from the native avocado accessions from Guangxi province. The second coordinate accounted for 10.06% of the total variation.Finally, we observed that the native avocado accessions were generally grouped according to their geographical origins.

Discussion
Transcriptome sequencing is a useful technique for obtaining a large number of transcripts for organisms lacking a reference sequence, at least partly because it is inexpensive and can be completed rapidly [51][52][53]. To date, several short-read next-generation sequencing (NGS) transcriptome databases have been developed for avocado mesocarp samples [54,55] and avocado mixed tissue samples [18,56]. However, both the number and length of the transcript sequences derived from these short-read NGS studies have hamperedtheirapplication ingenetics and molecular biology research [41]. One of the advances in sequencing technology has been the development of the long-read SMRT sequencing technique, which enables researchers to obtain a substantial number of full-length sequences from a cDNA library [32]. In the current study, we applied the PacBio SMRT system to generate and analyze the full-length transcriptome of avocado mixed mesocarp samples collected at various developmental stages. The 25.79 Gb SMRT data produced in this study provide the first comprehensive insights into the avocado mesocarp, which is the most economically valuable organ of this fruit species, and might serve as the genetic basis for future research on avocado. Interestingly, the full-length transcriptome sequence described herein is also the first such sequence for a plant species from the family Lauraceae.
In this study, 93.82% (71,627 of 76,345) of the nonredundant transcripts were annotated based on similarities with sequences in public databases. Thus, a greater proportion of transcripts were annotated in this study than in previous investigations involving NGS data for various avocado races (49.00%) [18] and for avocado mesocarp samples (57.50%) [55]. We determined that the mean length of the avocado nonredundant transcripts was2330 bp, implying that our sequences were long enough to represent full-length transcripts. Additionally, this mean length was in between the mean lengths obtained for other species, including Z. bungeanum (3414 bp) [32], T. pretense (2789 bp) [34], M. sativa (1706 bp) [37], Z. planispinum (1781 bp) [38], C. sinensis (1781 bp) [40], and Arabidopsis pumila (2194 bp) [57]. Moreover, the 76,345 nonredundant transcripts derived from the 25.79 Gb clean PacBio SMRT data produced in this study may facilitate future research on the physiology, biochemistry, and molecular genetics of avocado and related species.
A previous study indicated that lncRNAs may be important for the gene regulation in eukaryotic cells, especially during some key biological processes [58]. However, the number of lncRNAs encoded in genomes as well as their characteristics remains largely unknown [59]. Predicting and functionally annotating lncRNAs is challenging, but valuable because they are not orthologous and there is a lack of homologous sequences between closely related species [38]. Unfortunately, very few of the lncRNA functions have been elucidated [60,61]. Hence, the lncRNA information for one species is not suitable for predicting the lncRNAs in another species. In this study, 3596 avocado transcript sequences (accounting for 4.71% of the total number of nonredundant transcripts) were putatively predicted aslncRNAs. This almost completely uncharacterized gene pool may include genes associated with agronomically relevant traits related to the most economically valuable organ (mesocarp).
The accurate identification of avocado germplasm races is needed to ensure that germplasm collections are optimally used by plant breeders and farmers worldwide [1]. The traditional assignment of avocado races based on morphological traits is imprecise because of environmental effects and a limited number of applicable characteristics [17]. Molecular-based characterizations are more consistent and valid for assigning avocado genotypes. We previously confirmed the universality of six race-specific KASP markers [47]. These markers were used in the current study to identify avocado accessions with an unknown race, with implications for the application of available avocado germplasms for breeding and resource conservation. Interestingly, the KASP genotyping results revealed that all of the native avocado accessions included in this study are Guatemalan × West Indian hybrids. The reason for this observation might be related to theintroduction of avocado cultivars and the climates of the sample collection regions. First, the major avocado cultivars grown commercially are typically hybrids of three races (i.e., mainly Guatemalan × West Indian and Guatemalan × Mexican hybrids) [1]. Since the late 1950s, Guatemalan × West Indian and Guatemalan × Mexican hybrids have been brought into China from other countries for cultivation in Southern China [9]. Second, the native avocado accessions included in the present studyare mainly from three geographical regions, namely Nanning located in the central and southern region of Guangxi province, Danzhou and Baisha located in the central and western region of Hainan province, and Ruili located in the western region ofYunnan province. These locations are characterized by a warm and humid oceanic climatewith a relatively low altitude in the central and southern region of Guangxi province and the central and western region of Hainan province. Although Ruili is located in the western region ofYunnan province and far from the ocean, it still has a subtropical monsoon climate. The climates of these three regions resemble that of the areas in which theWest Indian races originated, and are favorablefor the growth of Guatemalan × West Indian hybrids. Therefore, Guatemalan × West Indian hybridsmay have graduallybecome the dominant native avocado accessions because of artificial selection or via naturally occurring crosses.
The 100 EST-SSR markers randomly selected for validation in the present study had an amplification rate of 67%, and 30 were determined to be polymorphic. This polymorphism level is generally consistent with that of our previous study [18]. In subsequent analyses of the genetic diversity of these polymorphic EST-SSR markers among 46 avocado accessions, 15 markers produced 4.73 alleles per locus, which was fewer than the 6.13 alleles per locus of Ge [18], the 11.40 alleles per SSR locus of Gross-German and Viruel [17], the 18.8 alleles per SSR locus of Schnell [16], and the 9.75 alleles per SSR locus of Alcaraz and Hormaza [62]. Additionally, a PIC value > 0.5 is generally considered to represent a high polymorphism rate [63]. In this study, 7 of 15 polymorphic EST-SSRs had a PIC value < 0.5. This result may have been because the 46 avocado accessions in this study are genotypically the same (Guatemalan × West Indian hybrids), with relatively low genetic diversity.
In this study, a cluster analysis and a PCoA grouped the native avocado accessions according to where they originated. Additionally, some of the native avocadoaccessions derived from different regions was included in the same sub-cluster. For example, Renong No. 13 from Hainan province clustered with the native accessions from Guangxi province. One factor leading to this promiscuous clustering is the fact that avocado germplasm resources have been exchanged among researchers and breeders since the late 1980s. The CATAS, which is a national scientific research unit, was commissioned to popularize superior avocado accessions among breeders at adjacent state farms or at other national scientific research units. Some superior native accessions from the CATAS may be the male or female parent of other native accessions from various state farms orother national scientific research units, which is consistent with our study results. Furthermore, a cluster analysis grouped two native avocado accessions from Yunnan province with the native avocado accessions from Guangxi province. In contrast, our PCoA indicated that these two native avocado accessions from Yunnan province belong to the same groupas the native avocado accessions from Hainan province. We speculate that the relatively few native avocado accessions from Yunnan province (i.e., two) may have led to these contradictory results based on two statistical analyses. At many avocado plantations in Yunnan province, the local avocado accessions have been replaced by"Hass," which is the most economically valuable avocado cultivar, ultimately making it difficult to collect local avocado accessions. Thus, maximizing the economic benefits of cultivating specific avocado cultivars, while ensuring avocado genetic resources are conserved will need to be addressed.

Conclusions
We annotated SMRT sequencing data based on the COG, GO, KEGG, KOG, Pfam, Swiss-Prot, eggNOG, and Nr databases. Among 71,627 transcripts, 45,134, 52,125, and 33,310 were annotated according to GO, eggNOG, and KEGG classifications, respectively. We detected 76,777 SSR loci in 42,096 transcript sequences and used them to develop 149,733 EST-SSR markers. From a randomly selected subset comprising 100 EST-SSR markers, we finally identified 15 polymorphic EST-SSR markers on 71 alleles, which had 2-10 of these markers per locus. A cluster analysis and a PCoA separated the 46 avocado accessions according to their geographical origins. These 15 newly developed EST-SSR markers may be useful for future analyses of avocado accessions and may contribute to the improved management of avocado resources for germplasm conservation and breeding programs.