1. Introduction
Avocado belongs to the family Lauraceae of the order Laurales, which includes some of the oldest flowering plants in the fossil record and which was already widespread in the Early Cretaceous [
1]. According to Chanderbali [
2], Laurales (avocado and relatives) belong to a key clade, the magnoliids, containing most basal angiosperms in the widely accepted angiosperm phylogeny. In fact, avocado could become an established genetic model plant for the clarification of angiosperm evolution [
2,
3]. Avocado is composed of three ecotypes. The two “subtropical” (Guatemalan and Mexican) ecotypes, a species native to Mesoamerica, are now widely grown in warm to cool subtropical and Mediterranean climates in different countries and regions, while the “tropical” lowland (West Indian) ecotype is cultivated in tropical countries and warm subtropical regions [
4]. These three ecotypes are distinguished and identified according to their genetic and morphological differences [
1]. Avocado flowers exhibit a unique “protogynous dichogamous” opening behavior, which favors cross-pollination, and sterility barriers do not exist between or among the three ecotypes [
1]. Many commercial avocado accessions are thus often natural or artificial hybrids [
5]. The only available transcriptome data, to our knowledge, is that of the Hass cultivar, classified as a Guatemalan × Mexican hybrid, and an unknown Mexican accession [
6,
7,
8], which extremely restricts genetic and breeding studies.
Avocado is among the most economically important subtropical/tropical fruit crops in the world, and increases in production are apparent throughout the world, such as in Mexico, the USA, Indonesia, Chile, Spain, Israel, Colombia, South Africa, and Australia were growth in production is considerable [
1]. In the world, the total area of avocado cultivation has reached 563,916 hectares, with almost ten tons per hectare in the annual production of avocado in 2016 [
9]. The consumption of avocado in the whole world has rapidly increased from 3,426,294 tons in 2008 to 5,567,044 tons in 2016 [
9]. One factor contributing to these increases in production and consumption is the expansion of avocado products into new markets in parts of the world where avocado was previously unknown or scarcely available [
1]. Certain constituents of avocado, including lipids, sugars, proteins, minerals, vitamins, and other nutrients and active ingredients, provide nutritive and health effects [
10,
11,
12].
The precise identification of avocado germplasm needs to be undertaken to make these germplasm collections useful for plant breeders and farmers throughout the world [
1]. Molecular characterization is essential for assessing the level of genetic diversity in avocado germplasm. Over the past two decades, several genetic diversity studies of avocado germplasm have been conducted using a variety of molecular marker types, including RAPDs (random amplified polymorphic DNA) [
13], VNTRs (variable number of tandem repeats) [
14], RFLPs (restricted fragment length polymorphisms) [
15,
16], SSRs (simple sequence repeats) [
5,
17,
18], and SNPs (single nucleotide polymorphism) [
19,
20,
21].
Of the many DNA markers that have been developed, SSRs, which consist of repeated nucleotide motifs of between one and six bases, are widely preferred in plant genetics and breeding because they are widely distributed and abundant in plant genomes, and they are genetically codominant, highly reproducible, multi-allelic, and well suited for high-throughput genotyping [
22,
23,
24,
25]. Transcriptome sequencing, which is based on next-generation sequencing technologies, is a high-throughput technique that facilitates the acquisition of a large number of unigene sequences for expressed sequence tag (EST)-derived marker development [
26,
27,
28]. Because they are derived from genomic coding regions, EST-derived markers have an advantage over genomic DNA-derived markers, and therefore can be efficiently amplified to reveal conserved sequences among related species [
29]. Rapid progress in the development of EST-SSR and EST-SNP loci based on transcriptome data has been made in sorghum [
30], Indian mulberry [
31],
Lycium barbarum [
32],
Bletilla striata [
33],
Lilium [
34], Chinese Hawthorn [
35].
Transcriptome sequencing based on next-generation sequencing technology was performed in this study using six avocado cultivars derived from different ecotypes. We carried out de novo assembly and gene annotation, identified a set of EST-SSR markers, and assessed their application in genetic diversity of 28 selected avocado accessions. The datasets and results reported here can serve as a public resource for the identification, classification, and utilization of avocado germplasm resources.
4. Discussion
Avocado is a member of the family Lauraceae of the order Laurales [
1]. Lauraceae is composed of 50 genera comprising 2500 to 3000 species of mostly trees and some shrubs, but only the genome of
Cinnamomum micranthum has been sequenced. The lack of genomic information has hampered critical research on augmenting marker assisted breeding programs for avocado [
21]. Hence, the development of an effective marker system to assess genetic diversity in avocado collections facilitates the maintenance of germplasm and cultivar improvement. Transcriptome sequencing and de novo assembly has proven to be an important tool for gene discovery in many organisms and an effective method for molecular marker development [
23,
24,
25,
35]. De novo assembly of avocado cv. “Hass” transcriptome during mesocarp development was conducted and identified tissue-specific regulation and biosynthesis of TAG, respectively [
7,
8]. Moreover, the transcriptomes of aerial buds, leaves, flowers, stems, seeds, and roots from an unknown Mexican accession were determined using different sequencing platforms, and it revealed strong differences in gene expression patterns between different organs [
6]. Until recently, however, little attention has been paid to the transcriptome assemblies generated from different avocado ecotypes and the development of EST-SSRs from transcription sequencing. The transcriptome assembly generated in our study, which includes 4.30–6.68 Gb of sequence data from six avocado cultivars derived from different ecotypes, provides a large number of expressed unigenes that can contribute to downstream analyses in genetic studies and breeding improvement programs. Avocado is a highly variable species and has different ecotypes representing distinct evolutionary lineages [
1]. The six avocado cultivars subjected to transcriptome sequencing in the present study were classified as Mexican, Guatemalan, and West Indian races and Guatemalan × Mexican and Guatemalan × West Indian hybrids. The transcriptomes in this study revealed strong differences in differentially expressed unigenes between different ecotypes. For the analyses of differentially expressed unigenes, differentially expressed unigenes between Duke 7 (Mexican) and Simmonds (West Indian) presented more than those between Duke 7 (Mexican) and Reed (Guatemalan), which suggested that Duke 7 (Mexican) and Reed (Guatemalan) could possess more close genetic relationships than Duke 7 (Mexican) and Simmonds (West Indian). This was also confirmed by our previous study [
21]. Our transcriptome data may thus facilitate the studies on genetic diversity across avocado ecotypes. In addition, the numbers of differentially expressed unigenes were generally higher between Duke 7 and each of the other five avocado cultivars. In addition to the ecotype factor, the cultivar/rootstock factor could also contribute to more differentially expressed unigenes. Duke 7, classified as rootstock, is resistant to
Persea cinnamomi, while the other five cultivars are known for their high fruit quality [
1].
Because it is inexpensive and rapid, transcriptome sequencing is useful for obtaining a large number of unigene sequences from organisms lacking a reference sequence [
46,
47,
48]. The N50 and mean lengths of avocado unigenes in our study were 1283 bp and 922 bp, respectively, which implies that our sequence assembly was accurate and effective. These N50 and mean lengths were higher than those obtained for other species, respectively, such as sweet potato (765 and 481 bp,) [
49],
Bletilla striata (957 and 612 bp) [
33],
Mucuna pruriens (987 and 626 bp) [
50],
Calanthe masuca (1,196 and 704 bp) [
51],
Calanthe sinica (1086 and 625 bp) [
51],and
Onobrychis viciifolia (1224 and 709 bp) [
52]. The 154,310 unigenes derived from 29–45 million clean reads produced by Illumina sequencing in this study will facilitate further research on the physiology, biochemistry, and molecular genetics of avocado and related species.
In this study, an average of 1.28 SSR loci were detected per SSR-containing unigene sequence and the distribution density of SSR loci was 4.41 kb per SSR locus, both of which are intermediate between those of other species. For example, the average numbers of SSR loci per sequence and kb per SSR locus are, respectively, 1.35 and 4.02 in centipedegrass [
53], 1.19 and 4.35 in
Bletilla striata [
33], and 1.26 and 5.33 in
Mucuna pruriens [
54]. SSR motif type and abundance are the main characteristics of microsatellites. Similar to the results of the previous research [
33], the most abundant motif types in this study were mononucleotides (34,104, 61.38%), followed by dinucleotides (13,720, 24.69%) and trinucleotides (7161, 12.89%). Within our polymorphic SSR set, (A/T)
n and (AG/CT)
n were the most prevalent in their respective repeat class, similar to findings in other species [
33,
55,
56]. The bias towards (A/T)
n is likely due to remnants of mRNA poly-A tails [
53]. Prior research has also suggested that (AG/CT)
n repeat motifs are generally present in the 5′ untranslated regions and may be involved in transcription and regulation [
54,
56]. alternatively, AG/CT motifs are present in CUC and UCU codons and translate to Ala and Leu, respectively, the most abundant amino acids in proteins [
33]. The 100 EST-SSR markers randomly selected for validation in this study had an amplification rate of 57%, and 31 were markedly polymorphic. This amplification rate and polymorphism percentage is lower than that of a previous report [
5]. In a subsequent genetic diversity analysis of these polymorphic EST-SSR markers among the 28 avocado accessions, 16 markers produced three to 12 alleles (6.13 alleles per locus), which was lower than 9.75 alleles per SSR locus of Alcaraz and Hormaza [
57], 11.40 alleles per SSR locus of Gross-German and Viruel [
5], and 18.8 alleles per SSR locus of Schnell [
18]. This could be because the expressed sequences, from which EST-SSR are derived, are highly conserved. Nevertheless, in this study, we report an efficient protocol for the development of EST-SSR markers of avocado cultivars from RNA-sequence. In addition, 32 accession-specific alleles derived from 18 accessions were detected and could be used for molecular identification of the corresponding accession. A PIC above 0.5 is generally considered to be a high polymorphism rate [
58]. In the present study, 13 out of 16 polymorphic EST-SSRs had a high polymorphism rate, and the exceptions were Pa-eSSR-16 (PIC = 0.27), Pa-eSSR-10 (PIC = 0.33), and Pa-eSSR-3 (PIC = 0.46). These 13 EST-SSRs were highly informative and had high resolving power. SSRs are notorious for having relatively high frequencies of null alleles, and SSRs with such average prevalence of null alleles (up to 15% for some loci) could bias allele frequencies, reduce the observed heterozygosity, increase apparent levels of inbreeding seriously, and therefore misleading in the genetic diversity analysis [
59]. In this study, seven null alleles were detected, including one from Pa-eSSR-3, two from Pa-eSSR-6, one from Pa-eSSR-9, one from Pa-eSSR-11, one from Pa-eSSR-12, and one from Pa-eSSR-14; and the percentage of null alleles for each locus would not affect the accuracy of genetic diversity analysis of avocado. These five primers failed to amplify any product in one or two genotypes. This can be explained by the fact that the ESTs, which are derived from cDNAs, lack introns and EST-SSRs that are unrecognized intron splice sites could disrupt priming sites resulting in failed amplification, alternatively, large introns could fall between the primers resulting in a product that is either too large or, in extreme cases, failed amplification [
60].
The rough separation of the 28 avocado accessions into a Mexican and Guatemalan genotype-related population and a West Indian genotype-related population by STRUCTURE is reasonable and in agreement with previous reports [
18,
21]. These results suggested that Mexican, Guatemalan, and Mexican and Guatemalan hybrids were most closely related to one another, while West Indian and Guatemalan × West Indian hybrids were closer to each other. At
K = 3, STRUCTURE model-based inference could not obviously distinguish the 28 avocado accessions based on three ecotypes and two interracial hybrids. The cause of the phenomenon could be that the number of interracial hybrids was more than the number of the pure ecotypes, especially for Guatemalan × West Indian hybrids, the number of which was half of all the 28 avocado accessions. Similarly, Alcaraz and Hormaza [
59] analyzed the genetic diversity of 78 avocado cultivars using 16 gSSRs, and only 18 pure ecotypes could be observed in this avocado collection. Their results indicated that the dendrogram generated from UPGMA cluster analysis could be roughly divided into two main clusters with no bootstrap support and with accessions of different origin intermixed. Gross-German and Viruel [
5] suggested that interracial hybrid character could alter the diversity distribution and blurred the clear boundaries among avocado ecotypes. However, these 16 polymorphic EST-SSRs could clearly and effectively distinguish the 28 avocado accessions based on the results of the genetic distances and STRUCTURE among the 28 avocado accessions. There is no doubt that these novel EST-SSR markers will be helpful for future research on germplasm conservation and breeding programs for avocado. In addition, the analysis of genetic diversity is a prerequisite for its exploration and utilization.
5. Conclusions
In summary, we obtained 32.44 Gb of sequence data representing six avocado cultivars: one Guatemalan, one Mexican, and one West Indian, and one Guatemalan × Mexican, and two Guatemalan × West Indian hybrids. A total of 154,310 unigenes with an average length of 922 bp were detected and annotated in Nr, Nt, Swiss-Prot, Pfam, GO, and KOG databases. Among these unigenes, 49,811, 16,309, and 25,162 were assigned to GO, KOG, and KEGG classifications, respectively. We detected 55,558 SSR loci in 43,270 unigene sequences and used them to develop 74,580 EST-SSR markers. From a randomly selected subset comprising 100 EST-SSR markers, we finally detected 16 polymorphic EST-SSR markers harboring a total of 98 alleles, which ranged from three to 12 per locus. STRUCTURE analysis separated the 28 avocado accessions into two groups. These newly developed 16 EST-SSR markers should serve as a significant resource for the assessment of avocado accessions and may contribute to the better management of avocado resources for germplasm conservation and breeding programs.